Hello I've just completed a review of the QMutex & family internals, as well as the Intel optimisation manuals about threading, data sharing, locking and the Transactional Memory extensions.
Short story: I recommend de-inlining QMutex code for Qt 5.0. We can re-inline later. I also recommend increasing the QBasicMutex / QMutex size to accommodate more data without the pointer indirection. Further, some work needs to be done to support Valgrind. Details below. Long story follows. Current state; QBasicMutex is an internal POD class that offers non-recursive locking. It's incredibly efficient for Linux, where the single pointer-sized member variable is enough to execute the futex operations, in all cases. For other platforms, we get the same efficiency in non-contended cases, but incur a non-negligible performance penalty when contention happens. QBasicMutex's documentation also has a note saying that timed locks may not work and may cause memory leaks. QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and provides support for recursive mutexes. A recursive QMutex is simply a QMutex with an owner ID and a recursion counter. QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all. QWaitCondition is implemented by using pthread on Unix and an event queue on Windows. QSemaphore further builds upon them by using one QMutex and one QWaitCondition. QReadWriteLock has one QMutex, two QWaitConditions and some other private data. Valgrind's helgrind and DRD tools can currently operate on Qt locks by preloading a library and hijacking QMutex functions. I have tested helgrinding Qt4 applications in the past. I do not see a library version number in the symbols, so it's possible that helgrind would work unmodified with Qt 5 if we were to de-inline the functions. However, we should probably approach the Valgrind community to make sure that the Qt 5 mutexes work. The Intel optimisation manual says that data sharing is most efficient when each thread or core is operating on a disjoint set of cachelines. That is, if you have two threads running, they are most there are no writes by both threads to the same cacheline (or cache sector of 128 bytes on Pentium 4). Shared reading is fine. While this is the Intel optimisation manual, the recommendation is probably a good rule of thumb for any architecture. There some other optimisation hints about using pipelined locks and about cache aliasing at 64 kB and 1 MB, but those are higher-level problems than we can solve at the lock level. The TSX manual says that transactional memory contention happens at the cacheline level. That is, if the transaction reads from a cacheline that is modified outside the transaction, or if the transaction writes to a cacheline that is read from or written to outside the transaction. I do not believe this to be more of a problem than the above optimisation guideline, which is something for the higher-level organisation: do not put two independent lock variables in the same cacheline. There are two types of instructions to start and finish a transaction. One pair is backwards compatible with existing processors and could be inserted to every single mutex lock and unlock, even in inline code (which might serve as a hint to valgrind, for example). The other pair requires checking the processor CPUID first. However, there are no processors in the market with transactional memory support and I don't have access to a prototype (yet, anyway), so at this point we simply have no idea whether enabling transactions for all mutex locks is a good idea. If enabling for them all isn't a good idea, the code for being adaptative cannot be inlined and we're further bound by the current inline code in QMutex. Recommendations (priority): (P0) de-inline QBasicMutex locking functions until we solve some or all of the below problems (P1) expand the size of QBasicMutex and QMutex to accommodate more data, for Mac and Windows. On Windows, the extra data required is one pointer (a HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we might need a bit more space. At first glance, all three implementations would work fine with a lock state and the wait structure -- with the futex implementation doing both in one single integer. (P1) approach the Valgrind developers to ensure that helgrind and DRD work with Qt 5 (P2) optimise the Mac and Windows implementations to avoid the need for allocating a dynamic d pointer in case of contention. In fact, remove the need for dynamic d-pointer allocation altogether: Mac, Windows and Linux should never do it, while the generic Unix implementation should do it all the time in the constructor (P2) investigate whether the recursive QMutex can benefit from the expanded size of QMutex too (P2) analyse QReadWriteLock and see if expanding the size of the structure like QMutex would be beneficial (P3) investigate TSX support for QMutex, whether by using HLE or RTM, if unconditional or adaptative -- make sure that QMutex unlocking by way of QWaitCondition waiting has the correct semantics (P3) investigate TSX support for QReadWriteLock (P4) optimise the implementations (at least the Linux one) by reading the assembly If, at the completion of the above tasks, we conclude that inlining QMutex locking would be beneficial, with minimal side-effects, we can re-inline it. The same applies to QReadWriteLock. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development