Re: [Development] Mutex future directions
On quarta-feira, 23 de maio de 2012 04.54.11, bradley.hug...@nokia.com wrote: On May 22, 2012, at 11:01 AM, ext Thiago Macieira wrote: And I don't think they very much liked the idea of spinning while trying to acquire a lock (consumes power)... at least we have a call to QThread::yieldCurrentThread(), though it would be interesting to see what happens if we replace it with a PAUSE instruction. Have you seen the spin code in 4.8? It does use QThread::yieldCurrentThread(). ;) Indeed it does. I'm saying that it would be interesting to check what happens if we replace it with a simple PAUSE instruction. We might be able to achieve the same result with a much smaller overhead. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
Hi Bradley, And I don't think they very much liked the idea of spinning while trying to acquire a lock (consumes power)... at least we have a call to QThread::yieldCurrentThread(), though it would be interesting to see what happens if we replace it with a PAUSE instruction. Have you seen the spin code in 4.8? It does use QThread::yieldCurrentThread(). ;) I guess it depends a bit on if you want to yield to another thread or not ;-). Of course: burning cpu time is another way to solve this ;-) Greetings, Carsten -- Bradley T. Hughes bradley.hug...@nokia.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On quarta-feira, 23 de maio de 2012 22.52.55, Carsten Breuer wrote: Hi Bradley, And I don't think they very much liked the idea of spinning while trying to acquire a lock (consumes power)... at least we have a call to QThread::yieldCurrentThread(), though it would be interesting to see what happens if we replace it with a PAUSE instruction. Have you seen the spin code in 4.8? It does use QThread::yieldCurrentThread(). ;) I guess it depends a bit on if you want to yield to another thread or not ;-). Of course: burning cpu time is another way to solve this ;-) According to the Intel manual, if you *don't* yield and you don't have the PAUSE instruction, you may make things worse. For one thing, if you don't yield, the processor will continue executing your code, without giving a chance for other processes to run. There's no guarantee even in a multiprocessor system that the other thread is running. And even if it's running, if it was scheduled to another thread on the same hyperthreaded core, the processor may not execute the that thread because this thread still has instructions to execute. To make matters worse, if you spin too quickly, you issue a LOT of atomic compare-and-swap commands, which block the bus. So even if the other thread is scheduled by the OS and is running in a different core, it may have memory stall issues. Hence one of the most counter-intuitive (at first glance) recommendations: to make your spin lock loops faster, insert a PAUSE :-) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
Hi Thiago, thanks for clearing this :-) I guess it depends a bit on if you want to yield to another thread or not ;-). Of course: burning cpu time is another way to solve this ;-) According to the Intel manual, if you *don't* yield and you don't have the PAUSE instruction, you may make things worse. What do you exactly mean with a PAUSE instruction. Is this a assembler processor command? For one thing, if you don't yield, the processor will continue executing your code, without giving a chance for other processes to run. There's no guarantee even in a multiprocessor system that the other thread is running. And even if it's running, if it was scheduled to another thread on the same hyperthreaded core, the processor may not execute the that thread because this thread still has instructions to execute. Interesting. Wouldn't it be better to change YieldCurrentThread? I guess, most of us are not aware of this and some of us learned in the 386 century that give back CPU performance to the OS is always a good thing. I guess it is even nowadays a good thing on single core machines. Isn't it? Thanks and Best Regards, Carsten ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On quarta-feira, 23 de maio de 2012 23.44.57, Carsten Breuer wrote: Hi Thiago, thanks for clearing this :-) I guess it depends a bit on if you want to yield to another thread or not ;-). Of course: burning cpu time is another way to solve this ;-) According to the Intel manual, if you *don't* yield and you don't have the PAUSE instruction, you may make things worse. What do you exactly mean with a PAUSE instruction. Is this a assembler processor command? The PAUSE instruction on x86. I don't know of equivalent in other architectures. Trivia: the instruction is encoded as REP NOP. For one thing, if you don't yield, the processor will continue executing your code, without giving a chance for other processes to run. There's no guarantee even in a multiprocessor system that the other thread is running. And even if it's running, if it was scheduled to another thread on the same hyperthreaded core, the processor may not execute the that thread because this thread still has instructions to execute. Interesting. Wouldn't it be better to change YieldCurrentThread? I guess, most of us are not aware of this and some of us learned in the 386 century that give back CPU performance to the OS is always a good thing. I guess it is even nowadays a good thing on single core machines. Isn't it? The yieldCurrentThread call is a system call: it tells the OS to think about scheduling another thread. The PAUSE instruction is a processor-only hint. They have different purposes and we should benchmark. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On May 21, 2012, at 2:58 PM, ext Thiago Macieira wrote: On segunda-feira, 21 de maio de 2012 08.34.32, bradley.hug...@nokia.com wrote: On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote: Recommendations (priority): (P0) de-inline QBasicMutex locking functions until we solve some or all of the below problems I agree with this, so that it gives a chance to fix the performance regressions on Mac at a later date (since it probably won't be fixed before 5.0 is released). Some notes from the IRC discussion this morning between Olivier, Brad and myself: * QMutex contended performance has dropped considerably on Mac from 4.8 to 5.0 (it's 10x slower) * QMutex contended performance on Mac is now actually similar to the pthread_mutex_t performance (read: contended QMutex on 4.8 is 10x faster than pthread_mutex_t) * changing the QMutex implementation to use the generic Unix codepath on Mac makes it 2x slower * the non-Linux code in QBasicMutex::lockInternal is considered complex and hard to read by both Brad and myself Brad: could you please provide what is, to the best of your knowledge today, the combination of tricks that made 4.8 fast? The trick was the adaptive spin, added and modified over a series of commits in 4.8. The biggest gain was on Mac, Linux performance didn't change noticibly, Windows did get a small gain too (as far as I recall). * QMutex de-inlining and the Mac performance issues are orthogonal. * QMutex de-inlining should be understood more correctly as: removing the testAndSet calls from the inline functions. The inline functions should remain inline. * The de-inlining is important for Valgrind (helgrind / DRD) to work properly, even in release mode Lars and I had a conversation in the hallway about how QMutex performance on Windows. It's been a while since I last tested, but I recall that QMutex didn't out perform CRITICAL_SECTIONs. De-inlining is necessary so that we can make QMutex nothing more than a wrapper around CRITICAL_SECTION (since the latter performs better). So far, we've got 3 votes for de-inlining: Thiago, Lars, and myself. For the few cases where inlining matters, we can inline inside Qt at those locations (QMetaObject::activate() would be the first place to check). Note that there's another trick that QMutex can apply under valgrind but QBasicMutex cannot: if the QMutex constructor initialises the d pointer to anything non-null and different from 3, the inlined testAndSet will fail, so valgrind can properly hijack the lock and unlock functions. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development -- Bradley T. Hughes bradley.hug...@nokia.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On terça-feira, 22 de maio de 2012 07.12.26, bradley.hug...@nokia.com wrote: So far, we've got 3 votes for de-inlining: Thiago, Lars, and myself. For the few cases where inlining matters, we can inline inside Qt at those locations (QMetaObject::activate() would be the first place to check). Turns out that I got some unexpected extra ammo for this decision. Apparently, some people from Intel found out about this thread and came to me to ask WTH I was thinking. Fortunately, they also offered help :-) We discussed a few ideas back and forth. Summarising: - inlining the lock code should be done on a need-to-do basis, only where the performance is really critical - such inlined locks should or should not use Transactional Memory only after we've done testing to ensure the performance - for generic locks, the overhead of a function call is usually negligible and will be drowned out by the rest of the work to be done - for those generic locks, transactional memory use needs to be adaptative. That means we need to use the RTM instructions, which require a CPUID check. That check and the presence of the abort handler mean that we transactional memory code would be non-inline. - if we were to keep the inlined locking code, we'd need to change the default value of QMutex's d pointer so that the inlined compare-and-swap always fails. That means the code path for locking a mutex on x86 will always include a *failed* compare-and-exchange on Haswell. And I don't think they very much liked the idea of spinning while trying to acquire a lock (consumes power)... at least we have a call to QThread::yieldCurrentThread(), though it would be interesting to see what happens if we replace it with a PAUSE instruction. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On May 22, 2012, at 11:01 AM, ext Thiago Macieira wrote: And I don't think they very much liked the idea of spinning while trying to acquire a lock (consumes power)... at least we have a call to QThread::yieldCurrentThread(), though it would be interesting to see what happens if we replace it with a PAUSE instruction. Have you seen the spin code in 4.8? It does use QThread::yieldCurrentThread(). ;) -- Bradley T. Hughes bradley.hug...@nokia.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote: Recommendations (priority): (P0) de-inline QBasicMutex locking functions until we solve some or all of the below problems I agree with this, so that it gives a chance to fix the performance regressions on Mac at a later date (since it probably won't be fixed before 5.0 is released). -- Bradley T. Hughes bradley.hug...@nokia.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On segunda-feira, 21 de maio de 2012 08.34.32, bradley.hug...@nokia.com wrote: On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote: Recommendations (priority): (P0) de-inline QBasicMutex locking functions until we solve some or all of the below problems I agree with this, so that it gives a chance to fix the performance regressions on Mac at a later date (since it probably won't be fixed before 5.0 is released). Some notes from the IRC discussion this morning between Olivier, Brad and myself: * QMutex contended performance has dropped considerably on Mac from 4.8 to 5.0 (it's 10x slower) * QMutex contended performance on Mac is now actually similar to the pthread_mutex_t performance (read: contended QMutex on 4.8 is 10x faster than pthread_mutex_t) * changing the QMutex implementation to use the generic Unix codepath on Mac makes it 2x slower * the non-Linux code in QBasicMutex::lockInternal is considered complex and hard to read by both Brad and myself Brad: could you please provide what is, to the best of your knowledge today, the combination of tricks that made 4.8 fast? * QMutex de-inlining and the Mac performance issues are orthogonal. * QMutex de-inlining should be understood more correctly as: removing the testAndSet calls from the inline functions. The inline functions should remain inline. * The de-inlining is important for Valgrind (helgrind / DRD) to work properly, even in release mode Note that there's another trick that QMutex can apply under valgrind but QBasicMutex cannot: if the QMutex constructor initialises the d pointer to anything non-null and different from 3, the inlined testAndSet will fail, so valgrind can properly hijack the lock and unlock functions. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On domingo, 20 de maio de 2012 02.17.53, Olivier Goffart wrote: On Saturday 19 May 2012 19:05:58 Thiago Macieira wrote: On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote: Hi, Regarding valgrind: *) On debug build, nothing is inlined. *) If we keep it inline, then we would just need a patch like this [1] -fno-inline doesn't help because of -fvisibility-inlines-hidden. The call cannot be rerouted to valgrind. Visibility does not really matter for valgrind. It does address redirection, using the debug symbols. I see... After playing with an application under helgrind, in the debugger, I see it doesn't actually use ELF symbol interposition or even my second option of inserting jumps. Since valgrind is a CPU interpreter, it simply knows when you've reached the beginning of an intercepted function and transfers control to the interceptor. Provided it knows that the same function can exist in multiple libraries, it's fine. Anyway, we still need to approach the valgrind community and settle the question. The annotation you added might help, but as I said, adding instructions -- even if they produce no architectural change -- still consumes CPU resources. I'd like to benchmark the annotation vs the function call. Yes, they have a cost which I am not sure we want to pay on release build. But since we may want to helgrind release builds... Indeed, but note that what it says about transactions that abort too often. If the transaction aborts, then the code needs to be re-run non-transactionally, with the lock. That means decreased performance and increased power consumption. Yes, but we are talking about the rare case in which a QMutex is shared between two different objects compiled with different version of Qt. And in that unlikely case, one can just recompile to fix the performance issue. That's not what I meant. I meant that, if we were to add the XACQUIRE and XRELEASE prefixes to all mutexes, we might end up with worse performance of 5.0 and 5.1 applications when run on Haswell because we've never tested it. Then again, I am asking for a slightly decreased performance for all situations. Indeed, QMutex can be used for all sort of cases. There can be also way too much code in the critical section to fit into the transaction cache. Or maybe there is side effects. QMutexLocker lock(mutex) qDebug() What now? does it also restart the transaction? Yes, a SYSENTER will definitely cause a transaction abort. Which is why it might be a good idea to use RTM instead of HLE in QMutex, so we know which mutexes abort and we don't try again next time. So it is probably bad to do the lock elision within QMutex... We need to test it on real hardware to see if it works. But my point is that the current QMutex architecture does not keep us from using lock elision later. Mixing different builds of QMutex might be even worse. If a lock is acquired with HLE and released without, the transaction will keep running until it aborts. And I have no clue what happens if you XRELEASE when no transaction is running. It will definitely cause trouble if we use RTM. Anyway, it might be something we can fix for 5.2, but are we prepared to take the chance? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
A few comments from my side: * I am not a big fan of inlining the public classes neither. It feels a bit like over-optimizing at the wrong place. Reason is that this might make it a impossible later on to refactor our code. * It's ok to have some low level inline class for internal use if that helps us in performance critical parts (such as signal/slot code). Nothing says this and the public QMutex class have to be the same. * IMO we should really try to allow using tools such as helgrind also with release build * it would be good to see how much of a real world (ie. not with artificial benchmarks) difference you get due to inlining the mutex code. Is it really relevant? Cheers, Lars On 5/19/12 12:36 AM, ext Olivier Goffart oliv...@woboq.com wrote: On Friday 18 May 2012 23:25:47 Thiago Macieira wrote: QBasicMutex is an internal POD class that offers non-recursive locking. It's incredibly efficient for Linux, where the single pointer-sized member variable is enough to execute the futex operations, in all cases. For other platforms, we get the same efficiency in non-contended cases, but incur a non-negligible performance penalty when contention happens. Which non-negligible performance penalty are you takling about here? The need to allocate memory and do the test-and-set loop for setting the private. There is no memory allocation, it is using the QFreeList. And this happens when we are about to do a context switch. So I beleive the test-and-set loop should be neglectible. Right, and users use the classes as intended... :-P (QBasicMutex is an undocumented internal class) Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU resources. It will consume front-end decoding, one ALU port, the retirement, not to mention increased code size. There's no free lunch. Slower than a function call (which will likely spill register)? I don't beleive so. (moreever, we are talking about debug build, right?) That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do with transactional memory. Then you didn't read the manual. I'm going to ignore what you said until you read the manual because transactional memory has everythig to do with QMutex. Please, either believe me or read the manual. Do you have a link to that manuel? Transactional memory is about detecting conflicts between transactions, and rolling back. Mutexes are about locking until the resource is free Transactional memory could be used to simplify the code that allocate the QMutexPrivate. But I doubt the hypothetical gain is even worth the function call. Because remember the uncontended case is much more critical than the contended case. (There is no point in winning a dozens of cycles if we are going to pay anyway several hendreds in the context switch that follow) But fast uncontended case is critical because it allows the use of mutex in part that might or not be used with threads. Example: QObject. We need to put mutexes while emiting a signal because maybe it is used in multiple thread. But we don't want to pay too much when used in a single thread (the common case) (P2) optimise the Mac and Windows implementations to avoid the need for allocating a dynamic d pointer in case of contention. In fact, remove the need for dynamic d-pointer allocation altogether: Mac, Windows and Linux should never do it, while the generic Unix implementation should do it all the time in the constructor The allocation of the d-ponter is not that dynamic. It is already quite optimized in Qt5 My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows, since they require just a bit of initialisation. Now, given what you said about semaphore_t, we may not be able to do it on the Mac. But we can try to apply the same optimisation for Windows -- the initialisation is a call to CreateEvent. But CreateEvent still probably allocate memory behind. I beleive there is more important priority right now than touching QMutex, which already had its lifting for Qt5. I disagree. If we shoot ourselves in the foot by not being able to support TSX and valgrind in Qt 5, we've lost a lot. That's why I propose de-inlining for 5.0, so we have enough time to investigate those potential drawbacks. I think the inline is not a problem for valgrind. And I don't think we can gain much with TSX. And even if we could, we still do pretty much everything in a binary compatible way despite the inlines (We can say 2 means unlocked and 3 locked, then the old code fallbacks to the non-inline case (The first lock would handle the 0 or 1)) Is it not however shooting ourself in the feet not to inline it? Because we hardly can inline it later in a binary compatible way. -- Olivier Woboq - Qt services and support - http://woboq.com ___ Development mailing list
Re: [Development] Mutex future directions
On sábado, 19 de maio de 2012 00.36.56, Olivier Goffart wrote: On Friday 18 May 2012 23:25:47 Thiago Macieira wrote: Which non-negligible performance penalty are you takling about here? The need to allocate memory and do the test-and-set loop for setting the private. There is no memory allocation, it is using the QFreeList. And this happens when we are about to do a context switch. So I beleive the test-and-set loop should be neglectible. True. My point is that it has a penalty higher than the Linux futex implementation does and, at least on the Windows case, I think it's possible to avoid it. Right, and users use the classes as intended... :-P (QBasicMutex is an undocumented internal class) Again, true, but even our own developers make mistakes. As exhibit A, take Q_GLOBAL_STATIC_WITH_INITIALIZER. See my email Q_GLOBAL_STATIC_WITH_INITIALIZER is harmful; please stop using it. I'm not sure there's a way out for QBasicMutex, not without making it non-POD. Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU resources. It will consume front-end decoding, one ALU port, the retirement, not to mention increased code size. There's no free lunch. Slower than a function call (which will likely spill register)? I don't beleive so. No, not slower. But there's a cost anyway, including a cost that isn't measured in CPU cycles -- the cost of opportunity of debugging. (moreever, we are talking about debug build, right?) No. We're talking about release builds too. That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do with transactional memory. Then you didn't read the manual. I'm going to ignore what you said until you read the manual because transactional memory has everythig to do with QMutex. Please, either believe me or read the manual. Do you have a link to that manuel? http://software.intel.com/file/41604 (see chapter 8) See also http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions Transactional memory is about detecting conflicts between transactions, and rolling back. Mutexes are about locking until the resource is free And the Intel TSX allows us to do both: they allow us to remove the lock and execute the same code optimistically, hoping there are no conflicts. If there are, it will roll back and will lock; if there wasn't, when you unlock it will atomically commit the transaction. Our QMutex is optimised for the non-contended case. The TSX will allow us to to even optimise the contended case where there was no data confilct. Transactional memory could be used to simplify the code that allocate the QMutexPrivate. But I doubt the hypothetical gain is even worth the function call. Because remember the uncontended case is much more critical than the contended case. (There is no point in winning a dozens of cycles if we are going to pay anyway several hendreds in the context switch that follow) But fast uncontended case is critical because it allows the use of mutex in part that might or not be used with threads. Example: QObject. We need to put mutexes while emiting a signal because maybe it is used in multiple thread. But we don't want to pay too much when used in a single thread (the common case) That's a very good example of where we want QMutex to be transactional: we need the mutex because we need to be sure that no other thread is modifying the connection list while the signal activation is reading the lists. However, we know that it's an extremely rare case for that to happen. If we have TSX in the signal emission path, then two threads could read from the signal list simultaneously. In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data cost. My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows, since they require just a bit of initialisation. Now, given what you said about semaphore_t, we may not be able to do it on the Mac. But we can try to apply the same optimisation for Windows -- the initialisation is a call to CreateEvent. But CreateEvent still probably allocate memory behind. True, but will you let me at least try? If you have data saying that allocating a windows event with CreateEvent for every single QMutex has a higher cost than the free list, please add that comment to qmutex_win.cpp. If you don't have such data, then I will try this. I disagree. If we shoot ourselves in the foot by not being able to support TSX and valgrind in Qt 5, we've lost a lot. That's why I propose de-inlining for 5.0, so we have enough time to investigate those potential drawbacks. I think the inline is not a problem for valgrind. Then please add *now* the necessary valgrind macros and prove to me that it works. Note that proving that it works is a necessary condition for keeping them inline. It's not a sufficient condition. And I don't think we can gain much with TSX.
Re: [Development] Mutex future directions
On sábado, 19 de maio de 2012 07.30.56, lars.kn...@nokia.com wrote: * it would be good to see how much of a real world (ie. not with artificial benchmarks) difference you get due to inlining the mutex code. Is it really relevant? Here's a suggestion for one: the Creator C++ parser. I remember Roberto had some benchmarks for this, though I don't remember if the benchmarks were threaded as well. I know that the actual parser in Creator is threaded via QtConcurrent. Another one might be the threaded scene graph. I recommend using the Linux perf(1) tool for benchmarking[*]. The benchlib tickcounter mode works too, but perf provides more information in the results. I'll be surprised if the cost of the function call shows up as more than 1% when parsing a medium-sized project (such as a small Qt library like QtSvg or QtDBus). [*] I've got a perf-mode for benchlib as a pending feature for Qt 5.1 -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On sábado, 19 de maio de 2012 10.26.03, Thiago Macieira wrote: But fast uncontended case is critical because it allows the use of mutex in part that might or not be used with threads. Example: QObject. We need to put mutexes while emiting a signal because maybe it is used in multiple thread. But we don't want to pay too much when used in a single thread (the common case) That's a very good example of where we want QMutex to be transactional: we need the mutex because we need to be sure that no other thread is modifying the connection list while the signal activation is reading the lists. However, we know that it's an extremely rare case for that to happen. If we have TSX in the signal emission path, then two threads could read from the signal list simultaneously. As Lars's email pointed out, this is a non-example since it is located inside QtCore. We can use the inline mutex as much as we want here and change it in any way we want in future versions. For that matter, the signal emission need not use even QBasicMutex. If we want a RWlock semantic, we can change to a different structure at any time, including a TSX-optimised one. In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data cost. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On Saturday 19 May 2012 07:30:56 lars.kn...@nokia.com wrote: A few comments from my side: * I am not a big fan of inlining the public classes neither. It feels a bit like over-optimizing at the wrong place. Reason is that this might make it a impossible later on to refactor our code. In general, I agree. (This is why I was againt the fact that the new QMetaType API is inline) But in this case, we are talking about very little specially crafted instructions, on a critical path. Nothink compared to the complexity of, say, QString or QVector. The instructions are made so that any unexpected value will fallback to non inline code. * It's ok to have some low level inline class for internal use if that helps us in performance critical parts (such as signal/slot code). Nothing says this and the public QMutex class have to be the same. QBasicMutex it is internal. (but is still used directly by QMutex) * IMO we should really try to allow using tools such as helgrind also with release build This would indeed ease the use case of developper who are developping against packaged version of Qt. But I beleive they should still get a debug version of Qt, because the debug version offer much more ASSERT and qWarning that are stripped out of a release version. * it would be good to see how much of a real world (ie. not with artificial benchmarks) difference you get due to inlining the mutex code. Is it really relevant? I agree with that. One needs to run benchmark :-) -- Olivier Woboq - Qt services and support - http://woboq.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
Hi, Regarding valgrind: *) On debug build, nothing is inlined. *) If we keep it inline, then we would just need a patch like this [1] Regarding Transactional memory: *) Very interesting *) Notice the end of section 8.2.1: Improper use of hints will not cause functional bugs though it may expose latent bugs already in the code.. So in other words, we can use XAQUIRE and XRELEASE without any problem in inline code, without binary compaibility issue *) The other transactional model does not really apply, but even if it would, we could still use some different value for locked and unlocked and the previously inline code falls back to the non-inline code. *) debugging tools (valgrind) could also detect the XAQUIRE and XRELEASE prefix. Regarding increasing the size *) I think it is valuable to have small memory overhead per mutexes. *) I think recursive mutex don't deserve improvements on the detriment of normal ones *) I think there would not be improvement on Windows. [1] http://paste.kde.org/482474/ -- Olivier Woboq - Qt services and support - http://woboq.com On Saturday 19 May 2012 10:26:03 Thiago Macieira wrote: Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU resources. It will consume front-end decoding, one ALU port, the retirement, not to mention increased code size. There's no free lunch. Slower than a function call (which will likely spill register)? I don't beleive so. No, not slower. But there's a cost anyway, including a cost that isn't measured in CPU cycles -- the cost of opportunity of debugging. (moreever, we are talking about debug build, right?) No. We're talking about release builds too. That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do with transactional memory. Then you didn't read the manual. I'm going to ignore what you said until you read the manual because transactional memory has everythig to do with QMutex. Please, either believe me or read the manual. Do you have a link to that manuel? http://software.intel.com/file/41604 (see chapter 8) See also http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions Transactional memory is about detecting conflicts between transactions, and rolling back. Mutexes are about locking until the resource is free And the Intel TSX allows us to do both: they allow us to remove the lock and execute the same code optimistically, hoping there are no conflicts. If there are, it will roll back and will lock; if there wasn't, when you unlock it will atomically commit the transaction. Our QMutex is optimised for the non-contended case. The TSX will allow us to to even optimise the contended case where there was no data confilct. Transactional memory could be used to simplify the code that allocate the QMutexPrivate. But I doubt the hypothetical gain is even worth the function call. Because remember the uncontended case is much more critical than the contended case. (There is no point in winning a dozens of cycles if we are going to pay anyway several hendreds in the context switch that follow) But fast uncontended case is critical because it allows the use of mutex in part that might or not be used with threads. Example: QObject. We need to put mutexes while emiting a signal because maybe it is used in multiple thread. But we don't want to pay too much when used in a single thread (the common case) That's a very good example of where we want QMutex to be transactional: we need the mutex because we need to be sure that no other thread is modifying the connection list while the signal activation is reading the lists. However, we know that it's an extremely rare case for that to happen. If we have TSX in the signal emission path, then two threads could read from the signal list simultaneously. In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data cost. My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows, since they require just a bit of initialisation. Now, given what you said about semaphore_t, we may not be able to do it on the Mac. But we can try to apply the same optimisation for Windows -- the initialisation is a call to CreateEvent. But CreateEvent still probably allocate memory behind. True, but will you let me at least try? If you have data saying that allocating a windows event with CreateEvent for every single QMutex has a higher cost than the free list, please add that comment to qmutex_win.cpp. If you don't have such data, then I will try this. I disagree. If we shoot ourselves in the foot by not being able to support TSX and valgrind in Qt 5, we've lost a lot. That's why I propose de-inlining for 5.0, so we have enough time to investigate those potential
Re: [Development] Mutex future directions
snip, QMutex future-direction-discussion I've followed this very technical discussion from the beginning-- impressive array of topics. Thanks to all digging into this, and great thanks to Thiago for opening the issue. I have a question on the point of recursive-locks: I understand recursive-locks (e.g., recursive-lock-count) were not part of the original QMutex, and were added later. I understand it's merely a count-lock-wrapper around the *real* lock. Further, I understand that recursive-locks can help get-one-out-of-a-corner-in-which-they-find-themselves, where the lock was already taken by the current thread and we don't want to deadlock when accidentally attempting the lock again. This problem tends to occur in designs where the lock is implied for (small) transactional operations, *and* for (large) transactional operations that tend to trigger implicit execution of those (small) transactional operations *within* the large transactional operation. I agree with Oliver: *) I think recursive mutex don't deserve improvements on the detriment of normal ones From a practical standpoint, I understand why recursive mutexes exist (see description above). However, from a *logical/cleanliness* standpoint, every time I've used them, I've later re-factored to where I do *not* do mutex-recursion. (Currently I use no recursive mutexes.) Do people really use them intentionally (do designs legitimately require them), or are they as I suspect, a way to get-yourself-out-of-trouble when your design failed to account for proper transactional granularity? --charley ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote: Hi, Regarding valgrind: *) On debug build, nothing is inlined. *) If we keep it inline, then we would just need a patch like this [1] -fno-inline doesn't help because of -fvisibility-inlines-hidden. The call cannot be rerouted to valgrind. The annotation you added might help, but as I said, adding instructions -- even if they produce no architectural change -- still consumes CPU resources. I'd like to benchmark the annotation vs the function call. Regarding Transactional memory: *) Very interesting *) Notice the end of section 8.2.1: Improper use of hints will not cause functional bugs though it may expose latent bugs already in the code.. So in other words, we can use XAQUIRE and XRELEASE without any problem in inline code, without binary compaibility issue Indeed, but note that what it says about transactions that abort too often. If the transaction aborts, then the code needs to be re-run non-transactionally, with the lock. That means decreased performance and increased power consumption. Note also that all x87 instructions will abort, so any transactions around x87 code (32-bit floating point) would cause aborts. At this point, we don't know which mutex locks we should make transactional. As I said, neither you nor I have a Haswell prototype to test on. At the earliest, we'll be able to test this for Qt 5.2. *) The other transactional model does not really apply, but even if it would, we could still use some different value for locked and unlocked and the previously inline code falls back to the non-inline code. RTM might apply because we can get extra information about the abort and figure out whether that particular lock should use transactions the next time. *) debugging tools (valgrind) could also detect the XAQUIRE and XRELEASE prefix. Regarding increasing the size *) I think it is valuable to have small memory overhead per mutexes. *) I think recursive mutex don't deserve improvements on the detriment of normal ones *) I think there would not be improvement on Windows. If we don't improve for recursive mutexes and I simply don't try Windows or Mac, then there's no need to increase the mutex size. Then I need only to look into QReadWriteLock. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On sábado, 19 de maio de 2012 10.48.22, Charley Bay wrote: snip, QMutex future-direction-discussion I've followed this very technical discussion from the beginning-- impressive array of topics. Thanks to all digging into this, and great thanks to Thiago for opening the issue. I have a question on the point of recursive-locks: I understand recursive-locks (e.g., recursive-lock-count) were not part of the original QMutex, and were added later. I understand it's merely a count-lock-wrapper around the *real* lock. Actually, no. QMutex has always supported recursive mutexes. The current implementation is such that only the non-recursive kind is optimised. The reason for that is that implementing them with futexes, Mach semaphores and Windows events was quite easy. By bypassing pthreads, we gain a little in performance. Implementing the recursive mutex on top of that wasn't very difficult. Though, to be honest, it requires constant reviewing by other people. We had the same solution in libdbus-1 and it had race conditions gone unnoticed for years. It also took the glibc developers 2 years to get all the locking primitives right on pthread. Further, I understand that recursive-locks can help get-one-out-of-a-corner-in-which-they-find-themselves, where the lock was already taken by the current thread and we don't want to deadlock when accidentally attempting the lock again. This problem tends to occur in designs where the lock is implied for (small) transactional operations, *and* for (large) transactional operations that tend to trigger implicit execution of those (small) transactional operations *within* the large transactional operation. Often, it's for the big locks that you can't completely assure that no recursion will happen. The small ones usually protect a very limited code path and it's doable to prove that it will never recurse. For that reason, since you have lots of code to protect and you're likely not to be locking and unlocking that often, the small loss of performance is acceptable. That said, the recursive lock isn't that much worse. It's actually quite good. I agree with Oliver: *) I think recursive mutex don't deserve improvements on the detriment of normal ones From a practical standpoint, I understand why recursive mutexes exist (see description above). However, from a *logical/cleanliness* standpoint, every time I've used them, I've later re-factored to where I do *not* do mutex-recursion. (Currently I use no recursive mutexes.) Do people really use them intentionally (do designs legitimately require them), or are they as I suspect, a way to get-yourself-out-of-trouble when your design failed to account for proper transactional granularity? There are a few cases where recursive locks are legitimate uses. For example, in QtDBus, one of the mutexes in QDBusConnection is recursive because we are calling out to the libdbus-1 library to handle some condition, which may cause it to raise further changes to the socket notifiers and timers via the same callbacks that are used by other threads. The change can be executed on this thread while the mutex is being held, but not so by other threads. In other cases, you're probably right that the recursive mutex is just a gimmick to I can't prove it will not recurse. But then you need to be careful in writing your code that operates under the lock, like I did for QtDBus. A recursion from the same thread can be just as bad as simultaneous incursion from another thread -- e.g., imagine you kept an iterator to a vector, but the recursed call caused a reallocation. So in fact, the rules of thumb about locks are: - protect the smallest code path that you really need to protect - do not call out to functions you don't control, never call back to the user To protect the smallest code path, you often need lots of locks and lots of locking, which is the goal of the current implementation by Olivier and Brad (cheap to create, cheap to lock if not contended). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On Saturday 19 May 2012 19:05:58 Thiago Macieira wrote: On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote: Hi, Regarding valgrind: *) On debug build, nothing is inlined. *) If we keep it inline, then we would just need a patch like this [1] -fno-inline doesn't help because of -fvisibility-inlines-hidden. The call cannot be rerouted to valgrind. Visibility does not really matter for valgrind. It does address redirection, using the debug symbols. You can see it works by trying the attached wrapper.c The annotation you added might help, but as I said, adding instructions -- even if they produce no architectural change -- still consumes CPU resources. I'd like to benchmark the annotation vs the function call. Yes, they have a cost which I am not sure we want to pay on release build. Regarding Transactional memory: *) Notice the end of section 8.2.1: Improper use of hints will not cause functional bugs though it may expose latent bugs already in the code.. So in other words, we can use XAQUIRE and XRELEASE without any problem in inline code, without binary compatibility issue Indeed, but note that what it says about transactions that abort too often. If the transaction aborts, then the code needs to be re-run non-transactionally, with the lock. That means decreased performance and increased power consumption. Yes, but we are talking about the rare case in which a QMutex is shared between two different objects compiled with different version of Qt. And in that unlikely case, one can just recompile to fix the performance issue. Note also that all x87 instructions will abort, so any transactions around x87 code (32-bit floating point) would cause aborts. At this point, we don't know which mutex locks we should make transactional. As I said, neither you nor I have a Haswell prototype to test on. At the earliest, we'll be able to test this for Qt 5.2. Indeed, QMutex can be used for all sort of cases. There can be also way too much code in the critical section to fit into the transaction cache. Or maybe there is side effects. QMutexLocker lock(mutex) qDebug() What now? does it also restart the transaction? So it is probably bad to do the lock elision within QMutex... We need to test it on real hardware to see if it works. But my point is that the current QMutex architecture does not keep us from using lock elision later. -- Olivier Woboq - Qt services and support - http://woboq.com /* * Wrapper for helgrind that works with Qt5 mutexes * * Compile and run with: gcc -shared -fPIC -o wrapper.so wrapper.c LD_PRELOAD=wrapper.so valgrind -tool=helgrind Qt5 application * * A debug build is required * * Olivier Goffart ogoff...@woboq.com */ #include stdio.h #include valgrind/valgrind.h #include valgrind/helgrind.h void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex4lockEv)( void *mutex ) { OrigFn fn; VALGRIND_GET_ORIG_FN(fn); // printf(LOCK %p \n, mutex); DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 0); CALL_FN_v_W(fn, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex); } void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex6unlockEv)( void *mutex ) { OrigFn fn; VALGRIND_GET_ORIG_FN(fn); // printf(UNLOCK %p \n, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_PRE, void*, mutex); CALL_FN_v_W(fn, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_POST, void*, mutex); } void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex7tryLockEi)( void *mutex, int timeout ) { OrigFn fn; VALGRIND_GET_ORIG_FN(fn); // printf(TRYLOCK %p %d\n, mutex, timeout); DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 1); CALL_FN_v_WW(fn, mutex, timeout); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex); } ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On Sunday 20 May 2012 02:17:53 Olivier Goffart wrote: You can see it works by trying the attached wrapper.c Actually, I realize my implementation of the hook for tryLock was wrong as i forgot the return value. Fixed, just in case someone wants to use it. -- Olivier Woboq - Qt services and support - http://woboq.com/* * Wrapper for helgrind that works with Qt5 mutexes * * Compile and run with: gcc -shared -fPIC -o wrapper.so wrapper.c LD_PRELOAD=wrapper.so valgrind -tool=helgrind Qt5 application * * A debug build is required * * Olivier Goffart ogoff...@woboq.com */ #include stdio.h #include valgrind/valgrind.h #include valgrind/helgrind.h void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex4lockEv)( void *mutex ) { OrigFn fn; VALGRIND_GET_ORIG_FN(fn); // printf(LOCK %p \n, mutex); DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 0); CALL_FN_v_W(fn, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex); } void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex6unlockEv)( void *mutex ) { OrigFn fn; VALGRIND_GET_ORIG_FN(fn); // printf(UNLOCK %p \n, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_PRE, void*, mutex); CALL_FN_v_W(fn, mutex); DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_POST, void*, mutex); } long I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex7tryLockEi)( void *mutex, int timeout ) { OrigFn fn; long ret; VALGRIND_GET_ORIG_FN(fn); // printf(TRYLOCK %p %d\n, mutex, timeout); DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 1); CALL_FN_W_WW(ret, fn, mutex, timeout); if (ret 0xff) DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex); return ret; } ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On sexta-feira, 18 de maio de 2012 20.34.51, Thiago Macieira wrote: (P1) expand the size of QBasicMutex and QMutex to accommodate more data, for Mac and Windows. On Windows, the extra data required is one pointer (a HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we might need a bit more space the next task was the recursive Mutex one, before I reordered my recommendations around. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden signature.asc Description: This is a digitally signed message part. ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] Mutex future directions
On Friday 18 May 2012 20:34:51 Thiago Macieira wrote: Hello I've just completed a review of the QMutex family internals, as well as the Intel optimisation manuals about threading, data sharing, locking and the Transactional Memory extensions. Short story: I recommend de-inlining QMutex code for Qt 5.0. We can re-inline later. I also recommend increasing the QBasicMutex / QMutex size to accommodate more data without the pointer indirection. Further, some work needs to be done to support Valgrind. Details below. The inlining showed to improve a lot the performences for performence-critical operation where two function calls matters. It improved performence of signal emition for example. Now, if you can prove me that the code of the function call is neglectible compared to the atomic operation, maybe i can change my mind. But last time i checked, it was usefull. As for the size, I think it is really nice to have a light QMutex. That way, you can have more QMutex, and then more granularity. None of your arguments convinces me it is a good idea to increase the size. Assigning a QMutexPrivate to a contended QMutex should be fast enough. Long story follows. Current state; QBasicMutex is an internal POD class that offers non-recursive locking. It's incredibly efficient for Linux, where the single pointer-sized member variable is enough to execute the futex operations, in all cases. For other platforms, we get the same efficiency in non-contended cases, but incur a non-negligible performance penalty when contention happens. Which non-negligible performance penalty are you takling about here? QBasicMutex's documentation also has a note saying that timed locks may not work and may cause memory leaks. It works. But it might leak the QMutexPrivate if the QMutex destructor is not run. This is ususally not a problem because QBasicMutex is meant to be a global object. (also, who uses tryLock with a timeout?) QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and provides support for recursive mutexes. A recursive QMutex is simply a QMutex with an owner ID and a recursion counter. Notice that recusrsive mutex should be advised against. (google for recursive mutex: http://www.zaval.org/resources/library/butenhof1.html ) QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all[...] Yes, feel free to optimize, but this is unrelated to QMutex. Valgrind's helgrind and DRD tools can currently operate on Qt locks by preloading a library and hijacking QMutex functions. I have tested helgrinding Qt4 applications in the past. I do not see a library version number in the symbols, so it's possible that helgrind would work unmodified with Qt 5 if we were to de-inline the functions. However, we should probably approach the Valgrind community to make sure that the Qt 5 mutexes work. Yes, there should be some NOOP we could introduce in debug to hint those tools. But that can work on the inline code. The Intel optimisation manual says that data sharing is most efficient when each thread or core is operating on a disjoint set of cachelines. That is, if you have two threads running, they are most there are no writes by both threads to the same cacheline (or cache sector of 128 bytes on Pentium 4). Shared reading is fine. While this is the Intel optimisation manual, the recommendation is probably a good rule of thumb for any architecture. There some other optimisation hints about using pipelined locks and about cache aliasing at 64 kB and 1 MB, but those are higher-level problems than we can solve at the lock level. The TSX manual says that transactional memory contention happens at the cacheline level. That is, if the transaction reads from a cacheline that is modified outside the transaction, or if the transaction writes to a cacheline that is read from or written to outside the transaction. I do not believe this to be more of a problem than the above optimisation guideline, which is something for the higher-level organisation: do not put two independent lock variables in the same cacheline. There are two types of instructions to start and finish a transaction. One pair is backwards compatible with existing processors and could be inserted to every single mutex lock and unlock, even in inline code (which might serve as a hint to valgrind, for example). The other pair requires checking the processor CPUID first. However, there are no processors in the market with transactional memory support and I don't have access to a prototype (yet, anyway), so at this point we simply have no idea whether enabling transactions for all mutex locks is a good idea. If enabling for them all isn't a good idea, the code for being adaptative cannot be inlined and we're further bound by the current inline code in QMutex. That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do
Re: [Development] Mutex future directions
On sexta-feira, 18 de maio de 2012 22.45.22, Olivier Goffart wrote: On Friday 18 May 2012 20:34:51 Thiago Macieira wrote: Hello I've just completed a review of the QMutex family internals, as well as the Intel optimisation manuals about threading, data sharing, locking and the Transactional Memory extensions. Short story: I recommend de-inlining QMutex code for Qt 5.0. We can re-inline later. I also recommend increasing the QBasicMutex / QMutex size to accommodate more data without the pointer indirection. Further, some work needs to be done to support Valgrind. Details below. The inlining showed to improve a lot the performences for performence-critical operation where two function calls matters. It improved performence of signal emition for example. Now, if you can prove me that the code of the function call is neglectible compared to the atomic operation, maybe i can change my mind. But last time i checked, it was usefull. No, I can't prove that. Of course that doing operations X and Y is more expensive than doing just Y. The function calls do have an impact in performance. But it's not performance I want to trade off. As for the size, I think it is really nice to have a light QMutex. That way, you can have more QMutex, and then more granularity. None of your arguments convinces me it is a good idea to increase the size. Assigning a QMutexPrivate to a contended QMutex should be fast enough. But it's not necessary. In fact, the size of QMutex would be platform dependent. On Linux, it would remain the size of a pointer, on Windows it would be two sizes of a pointer, on the Mac it would be the size of semaphore_t plus an integer (plus padding). For a recursive mutex, I don't mind keeping the private. But I want to investigate whether we can have a recursive QMutex without indirection with just a few more bytes. Long story follows. Current state; QBasicMutex is an internal POD class that offers non-recursive locking. It's incredibly efficient for Linux, where the single pointer-sized member variable is enough to execute the futex operations, in all cases. For other platforms, we get the same efficiency in non-contended cases, but incur a non-negligible performance penalty when contention happens. Which non-negligible performance penalty are you takling about here? The need to allocate memory and do the test-and-set loop for setting the private. QBasicMutex's documentation also has a note saying that timed locks may not work and may cause memory leaks. It works. But it might leak the QMutexPrivate if the QMutex destructor is not run. This is ususally not a problem because QBasicMutex is meant to be a global object. (also, who uses tryLock with a timeout?) Right, and users use the classes as intended... :-P Valgrind's helgrind and DRD tools can currently operate on Qt locks by preloading a library and hijacking QMutex functions. I have tested helgrinding Qt4 applications in the past. I do not see a library version number in the symbols, so it's possible that helgrind would work unmodified with Qt 5 if we were to de-inline the functions. However, we should probably approach the Valgrind community to make sure that the Qt 5 mutexes work. Yes, there should be some NOOP we could introduce in debug to hint those tools. But that can work on the inline code. Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU resources. It will consume front-end decoding, one ALU port, the retirement, not to mention increased code size. There's no free lunch. However, there are no processors in the market with transactional memory support and I don't have access to a prototype (yet, anyway), so at this point we simply have no idea whether enabling transactions for all mutex locks is a good idea. If enabling for them all isn't a good idea, the code for being adaptative cannot be inlined and we're further bound by the current inline code in QMutex. That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do with transactional memory. Then you didn't read the manual. I'm going to ignore what you said until you read the manual because transactional memory has everythig to do with QMutex. Please, either believe me or read the manual. Recommendations (priority): (P0) de-inline QBasicMutex locking functions until we solve some or all of the below problems What problems exactly? The ones listed below: (P1) expand the size of QBasicMutex and QMutex to accommodate more data, for Mac and Windows. On Windows, the extra data required is one pointer (a HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we might need a bit more space. At first glance, all three implementations would work fine with a lock state and the wait structure -- with the futex implementation doing both in one single integer. Those HANDLE or
Re: [Development] Mutex future directions
On Friday 18 May 2012 23:25:47 Thiago Macieira wrote: QBasicMutex is an internal POD class that offers non-recursive locking. It's incredibly efficient for Linux, where the single pointer-sized member variable is enough to execute the futex operations, in all cases. For other platforms, we get the same efficiency in non-contended cases, but incur a non-negligible performance penalty when contention happens. Which non-negligible performance penalty are you takling about here? The need to allocate memory and do the test-and-set loop for setting the private. There is no memory allocation, it is using the QFreeList. And this happens when we are about to do a context switch. So I beleive the test-and-set loop should be neglectible. Right, and users use the classes as intended... :-P (QBasicMutex is an undocumented internal class) Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU resources. It will consume front-end decoding, one ALU port, the retirement, not to mention increased code size. There's no free lunch. Slower than a function call (which will likely spill register)? I don't beleive so. (moreever, we are talking about debug build, right?) That is not relevant. Transactional memory requires a different set of primitives. QMutex has nothing to do with transactional memory. Then you didn't read the manual. I'm going to ignore what you said until you read the manual because transactional memory has everythig to do with QMutex. Please, either believe me or read the manual. Do you have a link to that manuel? Transactional memory is about detecting conflicts between transactions, and rolling back. Mutexes are about locking until the resource is free Transactional memory could be used to simplify the code that allocate the QMutexPrivate. But I doubt the hypothetical gain is even worth the function call. Because remember the uncontended case is much more critical than the contended case. (There is no point in winning a dozens of cycles if we are going to pay anyway several hendreds in the context switch that follow) But fast uncontended case is critical because it allows the use of mutex in part that might or not be used with threads. Example: QObject. We need to put mutexes while emiting a signal because maybe it is used in multiple thread. But we don't want to pay too much when used in a single thread (the common case) (P2) optimise the Mac and Windows implementations to avoid the need for allocating a dynamic d pointer in case of contention. In fact, remove the need for dynamic d-pointer allocation altogether: Mac, Windows and Linux should never do it, while the generic Unix implementation should do it all the time in the constructor The allocation of the d-ponter is not that dynamic. It is already quite optimized in Qt5 My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows, since they require just a bit of initialisation. Now, given what you said about semaphore_t, we may not be able to do it on the Mac. But we can try to apply the same optimisation for Windows -- the initialisation is a call to CreateEvent. But CreateEvent still probably allocate memory behind. I beleive there is more important priority right now than touching QMutex, which already had its lifting for Qt5. I disagree. If we shoot ourselves in the foot by not being able to support TSX and valgrind in Qt 5, we've lost a lot. That's why I propose de-inlining for 5.0, so we have enough time to investigate those potential drawbacks. I think the inline is not a problem for valgrind. And I don't think we can gain much with TSX. And even if we could, we still do pretty much everything in a binary compatible way despite the inlines (We can say 2 means unlocked and 3 locked, then the old code fallbacks to the non-inline case (The first lock would handle the 0 or 1)) Is it not however shooting ourself in the feet not to inline it? Because we hardly can inline it later in a binary compatible way. -- Olivier Woboq - Qt services and support - http://woboq.com ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development