On Sunday, 16 de October de 2011 21:09:44 Thiago Macieira wrote:
> Here's an idea:
> QAtomicInt ref;
> int alloc;
> union {
> qptrdiff offset;
> struct { int begin; int end; };
> };
> // size = 16 bytes
And here are two possibilities admitting defeat and going over 16 bytes:
Option 1:
QAtomicInt ref;
int alloc;
union {
qptrdiff begin;
qint64 dummy;
};
int end;
int flags;
// size = 24 bytes
Advantages:
* 32 bits of flags available, reserving room for future expansion
* no fiddling with sign bits anywhere
Disadvantages:
* 32 bits wasted on 32-bit platforms, which will never be used
* assuming an allocator aligning to 16 bytes, the start of the data will
always be 8 bytes off, incurring performance penalty with SSE2 operations
(> 99% of the cases)
* QVectors of SSE types will have 8 bytes of padding
Option 2:
QAtomicInt ref;
int flags;
union {
qptrdiff alloc;
qint64 dummy;
};
qptrdiff begin;
qptrdiff end;
// size = 24 (32) bytes
Advantages:
* 32 bits of flags available
* size multiple of 16 on 64-bit platforms, for best SSE2 performance
* full 64-bit sizes for 64-bit machines, allowing for allocation of more than
2 GB of data. The same header could be used for a QHugeVector class that
operates on signed 64-bit sizes, allowing up to 8388608 TB of data
* No padding required for QVectors of SSE types
Disadvantages:
* 100% bigger than the original structure, 50% bigger than the Option 1
* 32 bits wasted on 32-bit platforms
On 32-bit machines, if the allocator produces 16-byte-aligned memory regions,
we'll be wrong on >95% of the cases, causing SSE2 performance penalties.
However, if the allocator produces 8-byte-algined memory regions, as malloc in
glibc does, we'll be wrong just over 50% of the cases whether the structure is
24 or 32 bytes long. So we gain nothing by making it 32 bytes long on 32-bit
machines.
The %-age of the use-cases is based on my experience with attempting SIMD
optimisations on QString. Over a large sample, I found out that 95%-99% of the
data comes from QString's own allocations and the rest (1-5%) comes from
fromRawData. The strings in fromRawData are evenly distributed across all
possible alignments, the strings allocated by QString are evenly distributed
across both possibilities on 32-bit machines.
In other words, the histogram of QString data alignments, on a 32-bit machine
with an 8-byte-aligning allocator (like glibc's) should be roughly like the
following, with both a 16, 24 or 32-byte header:
0 48.5%
2 0.5%
4 0.5%
6 0.5%
8 48.5%
10 0.5%
12 0.5%
14 0.5%
With a 16- or 32-byte header with an allocator giving aligned-to-16 memory
regions, we should see:
0 96.5%
2 0.5%
4 0.5%
6 0.5%
8 0.5%
10 0.5%
12 0.5%
14 0.5%
To make the 32-bit structure fit the latter profile above, we'd need to add
another 8 bytes to the header (bringing the total wastage to 12 bytes) and
hope for an allocator that aligns to 16 bytes. Using posix_memalign or
equivalent functions is likely to simply cause another 8 bytes of overhead
inside the allocators.
An alternative, and IMHO better, approach would be to always allocate 8 bytes
more than strictly needed and force d->begin to the 16-byte boundary. That
means that d->begin == 4 whenever d is misaligned. This approach would allow
us to achieve the above profile even on systems with allocators giving 8-byte-
aligned pointers, such as glibc 32-bit.
It would also allow us to adapt on-the-fly if the allocator is updated and
starts to give us 16-byte aligned pointers on 32-bit, which would otherwise be
the worst case scanerio below:
0 0.5%
2 0.5%
4 0.5%
6 0.5%
8 96.5%
10 0.5%
12 0.5%
14 0.5%
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Qt5-feedback mailing list [email protected] http://lists.qt.nokia.com/mailman/listinfo/qt5-feedback
