Re: [Development] Mutex future directions

2012-05-23 Thread Thiago Macieira
On quarta-feira, 23 de maio de 2012 04.54.11, bradley.hug...@nokia.com wrote:
 On May 22, 2012, at 11:01 AM, ext Thiago Macieira wrote:
  And I don't think they very much liked the idea of spinning while trying
  to
  acquire a lock (consumes power)... at least we have a call to
  QThread::yieldCurrentThread(), though it would be interesting to see what
  happens if we replace it with a PAUSE instruction.

 Have you seen the spin code in 4.8? It does use
 QThread::yieldCurrentThread(). ;)

Indeed it does. I'm saying that it would be interesting to check what happens
if we replace it with a simple PAUSE instruction. We might be able to achieve
the same result with a much smaller overhead.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-23 Thread Carsten Breuer
Hi Bradley,

 And I don't think they very much liked the idea of spinning while trying to 
 acquire a lock (consumes power)... at least we have a call to 
 QThread::yieldCurrentThread(), though it would be interesting to see what 
 happens if we replace it with a PAUSE instruction.
 Have you seen the spin code in 4.8? It does use 
 QThread::yieldCurrentThread(). ;)

I guess it depends a bit on if you want to yield to another thread or
not ;-). Of course: burning cpu time is another way to solve this ;-)

Greetings,


Carsten

 
 --
 Bradley T. Hughes
 bradley.hug...@nokia.com
 
 ___
 Development mailing list
 Development@qt-project.org
 http://lists.qt-project.org/mailman/listinfo/development

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-23 Thread Thiago Macieira
On quarta-feira, 23 de maio de 2012 22.52.55, Carsten Breuer wrote:
 Hi Bradley,

  And I don't think they very much liked the idea of spinning while trying
  to
  acquire a lock (consumes power)... at least we have a call to
  QThread::yieldCurrentThread(), though it would be interesting to see what
  happens if we replace it with a PAUSE instruction.
 
  Have you seen the spin code in 4.8? It does use
  QThread::yieldCurrentThread(). ;)
 I guess it depends a bit on if you want to yield to another thread or
 not ;-). Of course: burning cpu time is another way to solve this ;-)

According to the Intel manual, if you *don't* yield and you don't have the
PAUSE instruction, you may make things worse.

For one thing, if you don't yield, the processor will continue executing your
code, without giving a chance for other processes to run. There's no guarantee
even in a multiprocessor system that the other thread is running. And even if
it's running, if it was scheduled to another thread on the same hyperthreaded
core, the processor may not execute the that thread because this thread still
has instructions to execute.

To make matters worse, if you spin too quickly, you issue a LOT of atomic
compare-and-swap commands, which block the bus. So even if the other thread is
scheduled by the OS and is running in a different core, it may have memory
stall issues.

Hence one of the most counter-intuitive (at first glance) recommendations: to
make your spin lock loops faster, insert a PAUSE :-)

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-23 Thread Carsten Breuer
Hi Thiago,

thanks for clearing this :-)

 I guess it depends a bit on if you want to yield to another thread or
 not ;-). Of course: burning cpu time is another way to solve this ;-)
 
 According to the Intel manual, if you *don't* yield and you don't have the 
 PAUSE instruction, you may make things worse. 

What do you exactly mean with a PAUSE instruction. Is this a assembler
processor command?

 For one thing, if you don't yield, the processor will continue executing your 
 code, without giving a chance for other processes to run. There's no 
 guarantee 
 even in a multiprocessor system that the other thread is running. And even if 
 it's running, if it was scheduled to another thread on the same hyperthreaded 
 core, the processor may not execute the that thread because this thread still 
 has instructions to execute.

Interesting. Wouldn't it be better to change YieldCurrentThread?
I guess, most of us are not aware of this and some of us
learned in the 386 century that give back CPU performance to the OS is
always a good thing. I guess it is even nowadays a good thing on single
core machines. Isn't it?

Thanks and Best Regards,



Carsten
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-23 Thread Thiago Macieira
On quarta-feira, 23 de maio de 2012 23.44.57, Carsten Breuer wrote:
 Hi Thiago,

 thanks for clearing this :-)

  I guess it depends a bit on if you want to yield to another thread or
  not ;-). Of course: burning cpu time is another way to solve this ;-)
 
  According to the Intel manual, if you *don't* yield and you don't have the
  PAUSE instruction, you may make things worse.

 What do you exactly mean with a PAUSE instruction. Is this a assembler
 processor command?

The PAUSE instruction on x86. I don't know of equivalent in other
architectures.

Trivia: the instruction is encoded as REP NOP.


  For one thing, if you don't yield, the processor will continue executing
  your code, without giving a chance for other processes to run. There's no
  guarantee even in a multiprocessor system that the other thread is
  running. And even if it's running, if it was scheduled to another thread
  on the same hyperthreaded core, the processor may not execute the that
  thread because this thread still has instructions to execute.

 Interesting. Wouldn't it be better to change YieldCurrentThread?
 I guess, most of us are not aware of this and some of us
 learned in the 386 century that give back CPU performance to the OS is
 always a good thing. I guess it is even nowadays a good thing on single
 core machines. Isn't it?

The yieldCurrentThread call is a system call: it tells the OS to think about
scheduling another thread. The PAUSE instruction is a processor-only hint.
They have different purposes and we should benchmark.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-22 Thread bradley.hughes

On May 21, 2012, at 2:58 PM, ext Thiago Macieira wrote:

 On segunda-feira, 21 de maio de 2012 08.34.32, bradley.hug...@nokia.com wrote:
 On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote:
 Recommendations (priority):
 
 (P0) de-inline QBasicMutex locking functions until we solve some or all of
 the below problems
 
 I agree with this, so that it gives a chance to fix the performance
 regressions on Mac at a later date (since it probably won't be fixed before
 5.0 is released).
 
 Some notes from the IRC discussion this morning between Olivier, Brad and 
 myself:
 
 * QMutex contended performance has dropped considerably on Mac from 4.8 to 
 5.0 (it's 10x slower)
 * QMutex contended performance on Mac is now actually similar to the 
 pthread_mutex_t performance (read: contended QMutex on 4.8 is 10x faster than 
 pthread_mutex_t)
 * changing the QMutex implementation to use the generic Unix codepath on Mac 
 makes it 2x slower
 * the non-Linux code in QBasicMutex::lockInternal is considered complex and 
 hard to read by both Brad and myself
 
 Brad: could you please provide what is, to the best of your knowledge today, 
 the combination of tricks that made 4.8 fast?

The trick was the adaptive spin, added and modified over a series of commits in 
4.8. The biggest gain was on Mac, Linux performance didn't change noticibly, 
Windows did get a small gain too (as far as I recall).

 * QMutex de-inlining and the Mac performance issues are orthogonal.
 * QMutex de-inlining should be understood more correctly as: removing the 
 testAndSet calls from the inline functions. The inline functions should 
 remain 
 inline.
 * The de-inlining is important for Valgrind (helgrind / DRD) to work 
 properly, even in release mode

Lars and I had a conversation in the hallway about how QMutex performance on 
Windows. It's been a while since I last tested, but I recall that QMutex didn't 
out perform CRITICAL_SECTIONs. De-inlining is necessary so that we can make 
QMutex nothing more than a wrapper around CRITICAL_SECTION (since the latter 
performs better).

So far, we've got 3 votes for de-inlining: Thiago, Lars, and myself. For the 
few cases where inlining matters, we can inline inside Qt at those locations 
(QMetaObject::activate() would be the first place to check).

 Note that there's another trick that QMutex can apply under valgrind but 
 QBasicMutex cannot: if the QMutex constructor initialises the d pointer to 
 anything non-null and different from 3, the inlined testAndSet will fail, so 
 valgrind can properly hijack the lock and unlock functions.
 
 -- 
 Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
 ___
 Development mailing list
 Development@qt-project.org
 http://lists.qt-project.org/mailman/listinfo/development

--
Bradley T. Hughes
bradley.hug...@nokia.com

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-22 Thread Thiago Macieira
On terça-feira, 22 de maio de 2012 07.12.26, bradley.hug...@nokia.com wrote:
 So far, we've got 3 votes for de-inlining: Thiago, Lars, and myself. For the
 few cases where inlining matters, we can inline inside Qt at those
 locations (QMetaObject::activate() would be the first place to check).

Turns out that I got some unexpected extra ammo for this decision. Apparently,
some people from Intel found out about this thread and came to me to ask WTH I
was thinking. Fortunately, they also offered help :-)

We discussed a few ideas back and forth. Summarising:

 - inlining the lock code should be done on a need-to-do basis, only where the
performance is really critical

 - such inlined locks should or should not use Transactional Memory only after
we've done testing to ensure the performance

 - for generic locks, the overhead of a function call is usually negligible
and will be drowned out by the rest of the work to be done

 - for those generic locks, transactional memory use needs to be adaptative.
That means we need to use the RTM instructions, which require a CPUID check.
That check and the presence of the abort handler mean that we transactional
memory code would be non-inline.

 - if we were to keep the inlined locking code, we'd need to change the
default value of QMutex's d pointer so that the inlined compare-and-swap
always fails. That means the code path for locking a mutex on x86 will always
include a *failed* compare-and-exchange on Haswell.

And I don't think they very much liked the idea of spinning while trying to
acquire a lock (consumes power)... at least we have a call to
QThread::yieldCurrentThread(), though it would be interesting to see what
happens if we replace it with a PAUSE instruction.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-22 Thread bradley.hughes

On May 22, 2012, at 11:01 AM, ext Thiago Macieira wrote:
 And I don't think they very much liked the idea of spinning while trying to 
 acquire a lock (consumes power)... at least we have a call to 
 QThread::yieldCurrentThread(), though it would be interesting to see what 
 happens if we replace it with a PAUSE instruction.

Have you seen the spin code in 4.8? It does use QThread::yieldCurrentThread(). 
;)

--
Bradley T. Hughes
bradley.hug...@nokia.com

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-21 Thread bradley.hughes

On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote:

 Recommendations (priority):
 
 (P0) de-inline QBasicMutex locking functions until we solve some or all of 
 the 
 below problems

I agree with this, so that it gives a chance to fix the performance regressions 
on Mac at a later date (since it probably won't be fixed before 5.0 is 
released).

--
Bradley T. Hughes
bradley.hug...@nokia.com

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-21 Thread Thiago Macieira
On segunda-feira, 21 de maio de 2012 08.34.32, bradley.hug...@nokia.com wrote:
 On May 18, 2012, at 8:34 PM, ext Thiago Macieira wrote:
  Recommendations (priority):
 
  (P0) de-inline QBasicMutex locking functions until we solve some or all of
  the below problems

 I agree with this, so that it gives a chance to fix the performance
 regressions on Mac at a later date (since it probably won't be fixed before
 5.0 is released).

Some notes from the IRC discussion this morning between Olivier, Brad and
myself:

 * QMutex contended performance has dropped considerably on Mac from 4.8 to
5.0 (it's 10x slower)
 * QMutex contended performance on Mac is now actually similar to the
pthread_mutex_t performance (read: contended QMutex on 4.8 is 10x faster than
pthread_mutex_t)
 * changing the QMutex implementation to use the generic Unix codepath on Mac
makes it 2x slower
 * the non-Linux code in QBasicMutex::lockInternal is considered complex and
hard to read by both Brad and myself

Brad: could you please provide what is, to the best of your knowledge today,
the combination of tricks that made 4.8 fast?

 * QMutex de-inlining and the Mac performance issues are orthogonal.
 * QMutex de-inlining should be understood more correctly as: removing the
testAndSet calls from the inline functions. The inline functions should remain
inline.
 * The de-inlining is important for Valgrind (helgrind / DRD) to work
properly, even in release mode

Note that there's another trick that QMutex can apply under valgrind but
QBasicMutex cannot: if the QMutex constructor initialises the d pointer to
anything non-null and different from 3, the inlined testAndSet will fail, so
valgrind can properly hijack the lock and unlock functions.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-20 Thread Thiago Macieira
On domingo, 20 de maio de 2012 02.17.53, Olivier Goffart wrote:
 On Saturday 19 May 2012 19:05:58 Thiago Macieira wrote:
  On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote:
   Hi,
  
   Regarding valgrind:
*) On debug build, nothing is inlined.
*) If we keep it inline, then we would just need a patch like this [1]
 
  -fno-inline doesn't help because of -fvisibility-inlines-hidden. The call
  cannot be rerouted to valgrind.

 Visibility does not really matter for valgrind. It does address redirection,
 using the debug symbols.

I see...

After playing with an application under helgrind, in the debugger, I see it
doesn't actually use ELF symbol interposition or even my second option of
inserting jumps. Since valgrind is a CPU interpreter, it simply knows when
you've reached the beginning of an intercepted function and transfers control
to the interceptor.

Provided it knows that the same function can exist in multiple libraries, it's
fine.

Anyway, we still need to approach the valgrind community and settle the
question.

  The annotation you added might help, but as I said, adding instructions --
  even if they produce no architectural change -- still consumes CPU
  resources. I'd like to benchmark the annotation vs the function call.

 Yes, they have a cost which I am not sure we want to pay on release build.

But since we may want to helgrind release builds...

  Indeed, but note that what it says about transactions that abort too
  often.
  If the transaction aborts, then the code needs to be re-run
  non-transactionally, with the lock. That means decreased performance and
  increased power consumption.

 Yes, but we are talking about the rare case in which a QMutex is shared
 between two different objects compiled with different version of Qt.
 And in that unlikely case, one can just recompile to fix the performance
 issue.

That's not what I meant. I meant that, if we were to add the XACQUIRE and
XRELEASE prefixes to all mutexes, we might end up with worse performance of 5.0
and 5.1 applications when run on Haswell because we've never tested it.

Then again, I am asking for a slightly decreased performance for all
situations.

 Indeed, QMutex can be used for all sort of cases.  There can be also way too
 much code in the critical section to fit into the transaction cache. Or
 maybe there is side effects.

 QMutexLocker lock(mutex)
 qDebug()  What now?  does it also restart the transaction?

Yes, a SYSENTER will definitely cause a transaction abort.

Which is why it might be a good idea to use RTM instead of HLE in QMutex, so
we know which mutexes abort and we don't try again next time.

 So it is probably bad to do the lock elision within QMutex...
 We need to test it on real hardware to see if it works.

 But my point is that the current QMutex architecture does not keep us from
 using lock elision later.

Mixing different builds of QMutex might be even worse. If a lock is acquired
with HLE and released without, the transaction will keep running until it
aborts. And I have no clue what happens if you XRELEASE when no transaction is
running. It will definitely cause trouble if we use RTM.

Anyway, it might be something we can fix for 5.2, but are we prepared to take
the chance?

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread lars.knoll
A few comments from my side:

* I am not a big fan of inlining the public classes neither. It feels a
bit like over-optimizing at the wrong place. Reason is that this might
make it a impossible later on to refactor our code.
* It's ok to have some low level inline class for internal use if that
helps us in performance critical parts (such as signal/slot code). Nothing
says this and the public QMutex class have to be the same.
* IMO we should really try to allow using tools such as helgrind also with
release build
* it would be good to see how much of a real world (ie. not with
artificial benchmarks) difference you get due to inlining the mutex code.
Is it really relevant?

Cheers,
Lars

On 5/19/12 12:36 AM, ext Olivier Goffart oliv...@woboq.com wrote:

On Friday 18 May 2012 23:25:47 Thiago Macieira wrote:
   QBasicMutex is an internal POD class that offers non-recursive
locking.
   It's incredibly efficient for Linux, where the single pointer-sized
   member variable is enough to execute the futex operations, in all
cases.
   For other platforms, we get the same efficiency in non-contended
cases,
   but incur a non-negligible performance penalty when contention
happens.
  
  Which non-negligible performance penalty are you takling about here?
 
 The need to allocate memory and do the test-and-set loop for setting the
 private.

There is no memory allocation, it is using the QFreeList.
And this happens when we are about to do a context switch. So I beleive
the 
test-and-set loop should be neglectible.

 Right, and users use the classes as intended... :-P

(QBasicMutex is an undocumented internal class)

 Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
 resources. It will consume front-end decoding, one ALU port, the
retirement,
 not to mention increased code size. There's no free lunch.

Slower than a function call (which will likely spill register)?  I don't
beleive so.

(moreever, we are talking about debug build, right?)

  That is not relevant.
  Transactional memory requires a different set of primitives. QMutex
has
  nothing to do with transactional memory.
 
 Then you didn't read the manual. I'm going to ignore what you said
until you
 read the manual because transactional memory has everythig to do with
 QMutex.
 
 Please, either believe me or read the manual.

Do you have a link to that manuel?

Transactional memory is about detecting conflicts between transactions,
and 
rolling back.
Mutexes are about locking until the resource is free

Transactional memory could be used to simplify the code that allocate the
QMutexPrivate.

But I doubt the hypothetical gain is even worth the function call.

Because remember the uncontended case is much more critical than the
contended 
case. (There is no point in winning a dozens of cycles if we are going to
pay 
anyway several hendreds in the context switch that follow)

But fast uncontended case is critical because it allows the use of mutex
in 
part that might or not be used with threads. Example: QObject. We need to
put 
mutexes while emiting a signal because maybe it is used in multiple
thread. 
But we don't want to pay too much when used in a single thread (the
common 
case)

   (P2) optimise the Mac and Windows implementations to avoid the need
for
   allocating a dynamic d pointer in case of contention. In fact,
remove
   the
   need for dynamic d-pointer allocation altogether: Mac, Windows and
Linux
   should never do it, while the generic Unix implementation should do
it
   all
   the time in the constructor
  
  The allocation of the d-ponter is not that dynamic.
  It is already quite optimized in Qt5
 
 My suggestion was to avoid the QMutexPrivate allocation on Mac and
Windows,
 since they require just a bit of initialisation. Now, given what you
said
 about semaphore_t, we may not be able to do it on the Mac. But we can
try to
 apply the same optimisation for Windows -- the initialisation is a call
to
 CreateEvent.

But CreateEvent still probably allocate memory behind.

  I beleive there is more important priority right now than touching
QMutex,
  which already had its lifting for Qt5.
 
 I disagree. If we shoot ourselves in the foot by not being able to
support
 TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
 de-inlining for 5.0, so we have enough time to investigate those
potential
 drawbacks.

I think the inline is not a problem for valgrind.

And I don't think we can gain much with TSX.

And even if we could, we still do pretty much everything in a binary
compatible way despite the inlines (We can say 2 means unlocked and 3
locked, 
then the old code fallbacks to the non-inline case (The first lock would
handle the 0 or 1))

Is it not however shooting ourself in the feet not to inline it? Because
we 
hardly can inline it later in a binary compatible way.

-- 
Olivier

Woboq - Qt services and support - http://woboq.com
___
Development mailing list

Re: [Development] Mutex future directions

2012-05-19 Thread Thiago Macieira
On sábado, 19 de maio de 2012 00.36.56, Olivier Goffart wrote:
 On Friday 18 May 2012 23:25:47 Thiago Macieira wrote:
   Which non-negligible performance penalty are you takling about here?
 
  The need to allocate memory and do the test-and-set loop for setting the
  private.

 There is no memory allocation, it is using the QFreeList.
 And this happens when we are about to do a context switch. So I beleive the
 test-and-set loop should be neglectible.

True. My point is that it has a penalty higher than the Linux futex
implementation does and, at least on the Windows case, I think it's possible
to avoid it.

  Right, and users use the classes as intended... :-P

 (QBasicMutex is an undocumented internal class)

Again, true, but even our own developers make mistakes. As exhibit A, take
Q_GLOBAL_STATIC_WITH_INITIALIZER. See my email
Q_GLOBAL_STATIC_WITH_INITIALIZER is harmful; please stop using it.

I'm not sure there's a way out for QBasicMutex, not without making it non-POD.

  Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
  resources. It will consume front-end decoding, one ALU port, the
  retirement, not to mention increased code size. There's no free lunch.

 Slower than a function call (which will likely spill register)?  I don't
 beleive so.

No, not slower. But there's a cost anyway, including a cost that isn't
measured in CPU cycles -- the cost of opportunity of debugging.

 (moreever, we are talking about debug build, right?)

No. We're talking about release builds too.

   That is not relevant.
   Transactional memory requires a different set of primitives. QMutex has
   nothing to do with transactional memory.
 
  Then you didn't read the manual. I'm going to ignore what you said until
  you read the manual because transactional memory has everythig to do with
  QMutex.
 
  Please, either believe me or read the manual.

 Do you have a link to that manuel?

http://software.intel.com/file/41604 (see chapter 8)

See also
http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

 Transactional memory is about detecting conflicts between transactions, and
 rolling back.
 Mutexes are about locking until the resource is free

And the Intel TSX allows us to do both: they allow us to remove the lock and
execute the same code optimistically, hoping there are no conflicts. If there
are, it will roll back and will lock; if there wasn't, when you unlock it
will atomically commit the transaction.

Our QMutex is optimised for the non-contended case. The TSX will allow us to
to even optimise the contended case where there was no data confilct.

 Transactional memory could be used to simplify the code that allocate the
 QMutexPrivate.

 But I doubt the hypothetical gain is even worth the function call.

 Because remember the uncontended case is much more critical than the
 contended case. (There is no point in winning a dozens of cycles if we are
 going to pay anyway several hendreds in the context switch that follow)

 But fast uncontended case is critical because it allows the use of mutex in
 part that might or not be used with threads. Example: QObject. We need to
 put mutexes while emiting a signal because maybe it is used in multiple
 thread. But we don't want to pay too much when used in a single thread (the
 common case)

That's a very good example of where we want QMutex to be transactional: we
need the mutex because we need to be sure that no other thread is modifying
the connection list while the signal activation is reading the lists. However,
we know that it's an extremely rare case for that to happen. If we have TSX in
the signal emission path, then two threads could read from the signal list
simultaneously.

In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data
cost.

  My suggestion was to avoid the QMutexPrivate allocation on Mac and
  Windows,
  since they require just a bit of initialisation. Now, given what you said
  about semaphore_t, we may not be able to do it on the Mac. But we can try
  to apply the same optimisation for Windows -- the initialisation is a
  call to CreateEvent.

 But CreateEvent still probably allocate memory behind.

True, but will you let me at least try?

If you have data saying that allocating a windows event with CreateEvent for
every single QMutex has a higher cost than the free list, please add that
comment to qmutex_win.cpp.

If you don't have such data, then I will try this.

  I disagree. If we shoot ourselves in the foot by not being able to support
  TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
  de-inlining for 5.0, so we have enough time to investigate those potential
  drawbacks.

 I think the inline is not a problem for valgrind.

Then please add *now* the necessary valgrind macros and prove to me that it
works.

Note that proving that it works is a necessary condition for keeping them
inline. It's not a sufficient condition.

 And I don't think we can gain much with TSX.


Re: [Development] Mutex future directions

2012-05-19 Thread Thiago Macieira
On sábado, 19 de maio de 2012 07.30.56, lars.kn...@nokia.com wrote:
 * it would be good to see how much of a real world (ie. not with
 artificial benchmarks) difference you get due to inlining the mutex code.
 Is it really relevant?

Here's a suggestion for one: the Creator C++ parser. I remember Roberto had
some benchmarks for this, though I don't remember if the benchmarks were
threaded as well. I know that the actual parser in Creator is threaded via
QtConcurrent.

Another one might be the threaded scene graph.

I recommend using the Linux perf(1) tool for benchmarking[*]. The benchlib
tickcounter mode works too, but perf provides more information in the results.
I'll be surprised if the cost of the function call shows up as more than 1%
when parsing a medium-sized project (such as a small Qt library like QtSvg or
QtDBus).

[*] I've got a perf-mode for benchlib as a pending feature for Qt 5.1

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Thiago Macieira
On sábado, 19 de maio de 2012 10.26.03, Thiago Macieira wrote:
  But fast uncontended case is critical because it allows the use of mutex
  in
  part that might or not be used with threads. Example: QObject. We need to
  put mutexes while emiting a signal because maybe it is used in multiple
  thread. But we don't want to pay too much when used in a single thread
  (the
  common case)

 That's a very good example of where we want QMutex to be transactional: we
 need the mutex because we need to be sure that no other thread is modifying
 the connection list while the signal activation is reading the lists.
 However, we know that it's an extremely rare case for that to happen. If we
 have TSX in the signal emission path, then two threads could read from the
 signal list simultaneously.

As Lars's email pointed out, this is a non-example since it is located inside
QtCore. We can use the inline mutex as much as we want here and change it in
any way we want in future versions.

For that matter, the signal emission need not use even QBasicMutex. If we want
a RWlock semantic, we can change to a different structure at any time,
including a TSX-optimised one.

 In other words, with TSX, QMutes becomes a QReadWriteLock with no extra
 data  cost.
--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Olivier Goffart
On Saturday 19 May 2012 07:30:56 lars.kn...@nokia.com wrote:
 A few comments from my side:
 
 * I am not a big fan of inlining the public classes neither. It feels a
 bit like over-optimizing at the wrong place. Reason is that this might
 make it a impossible later on to refactor our code.

In general, I agree. (This is why I was againt the fact that the new QMetaType 
API is inline)

But in this case, we are talking about very little specially crafted 
instructions, on a critical path.

Nothink compared to the complexity of, say, QString or QVector.

The instructions are made so that any unexpected value will fallback to non 
inline code.

 * It's ok to have some low level inline class for internal use if that
 helps us in performance critical parts (such as signal/slot code). Nothing
 says this and the public QMutex class have to be the same.

QBasicMutex it is internal. (but is still used directly by QMutex)

 * IMO we should really try to allow using tools such as helgrind also with
 release build

This would indeed ease the use case of developper who are developping against 
packaged version of Qt.

But I beleive they should still get a debug version of Qt, because the debug 
version offer much more ASSERT and qWarning that are stripped out of a release 
version.

 * it would be good to see how much of a real world (ie. not with
 artificial benchmarks) difference you get due to inlining the mutex code.
 Is it really relevant?

I agree with that. One needs to run benchmark :-)

-- 
Olivier

Woboq - Qt services and support - http://woboq.com

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Olivier Goffart
Hi,

Regarding valgrind:
 *) On debug build, nothing is inlined.
 *) If we keep it inline, then we would just need a patch like this [1]

Regarding Transactional memory:
 *) Very interesting
 *) Notice the end of section 8.2.1: Improper use of hints will not cause
functional bugs though it may expose latent bugs already in the code..
So in other words, we can use XAQUIRE and XRELEASE without any problem in
inline code, without binary compaibility issue
 *) The other transactional model does not really apply, but even if it would,
we could still use some different value for locked and unlocked and the  
previously inline code falls back to the non-inline code.
 *) debugging tools (valgrind) could also detect the XAQUIRE and XRELEASE
prefix.

Regarding increasing the size
 *) I think it is valuable to have small memory overhead per mutexes. 
 *) I think recursive mutex don't deserve improvements on the detriment of
normal ones
 *) I think there would not be improvement on Windows.


[1] http://paste.kde.org/482474/
 
-- 
Olivier

Woboq - Qt services and support - http://woboq.com



On Saturday 19 May 2012 10:26:03 Thiago Macieira wrote:
   Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
   resources. It will consume front-end decoding, one ALU port, the
   retirement, not to mention increased code size. There's no free lunch.
  
  Slower than a function call (which will likely spill register)?  I don't
  beleive so.
 
 No, not slower. But there's a cost anyway, including a cost that isn't
 measured in CPU cycles -- the cost of opportunity of debugging.
 
  (moreever, we are talking about debug build, right?)
 
 No. We're talking about release builds too.
 
That is not relevant.
Transactional memory requires a different set of primitives. QMutex
has
nothing to do with transactional memory.
   
   Then you didn't read the manual. I'm going to ignore what you said until
   you read the manual because transactional memory has everythig to do
   with
   QMutex.
   
   Please, either believe me or read the manual.
  
  Do you have a link to that manuel?
 
 http://software.intel.com/file/41604 (see chapter 8)
 
 See also
 http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions
 
  Transactional memory is about detecting conflicts between transactions,
  and
  rolling back.
  Mutexes are about locking until the resource is free
 
 And the Intel TSX allows us to do both: they allow us to remove the lock and
 execute the same code optimistically, hoping there are no conflicts. If
 there are, it will roll back and will lock; if there wasn't, when you
 unlock it will atomically commit the transaction.
 
 Our QMutex is optimised for the non-contended case. The TSX will allow us to
 to even optimise the contended case where there was no data confilct.
  Transactional memory could be used to simplify the code that allocate the
  QMutexPrivate.
  
  But I doubt the hypothetical gain is even worth the function call.
  
  Because remember the uncontended case is much more critical than the
  contended case. (There is no point in winning a dozens of cycles if we are
  going to pay anyway several hendreds in the context switch that follow)
  
  But fast uncontended case is critical because it allows the use of mutex
  in
  part that might or not be used with threads. Example: QObject. We need to
  put mutexes while emiting a signal because maybe it is used in multiple
  thread. But we don't want to pay too much when used in a single thread
  (the
  common case)
 
 That's a very good example of where we want QMutex to be transactional: we
 need the mutex because we need to be sure that no other thread is modifying
 the connection list while the signal activation is reading the lists.
 However, we know that it's an extremely rare case for that to happen. If we
 have TSX in the signal emission path, then two threads could read from the
 signal list simultaneously.
 
 In other words, with TSX, QMutes becomes a QReadWriteLock with no extra data
 cost.
 
   My suggestion was to avoid the QMutexPrivate allocation on Mac and
   Windows,
   since they require just a bit of initialisation. Now, given what you
   said
   about semaphore_t, we may not be able to do it on the Mac. But we can
   try
   to apply the same optimisation for Windows -- the initialisation is a
   call to CreateEvent.
  
  But CreateEvent still probably allocate memory behind.
 
 True, but will you let me at least try?
 
 If you have data saying that allocating a windows event with CreateEvent for
 every single QMutex has a higher cost than the free list, please add that
 comment to qmutex_win.cpp.
 
 If you don't have such data, then I will try this.
 
   I disagree. If we shoot ourselves in the foot by not being able to
   support
   TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
   de-inlining for 5.0, so we have enough time to investigate those
   potential

Re: [Development] Mutex future directions

2012-05-19 Thread Charley Bay
snip, QMutex future-direction-discussion

I've followed this very technical discussion from the beginning--
impressive array of topics.  Thanks to all digging into this, and great
thanks to Thiago for opening the issue.

I have a question on the point of recursive-locks:

I understand recursive-locks (e.g., recursive-lock-count) were not part
of the original QMutex, and were added later.  I understand it's merely a
count-lock-wrapper around the *real* lock.

Further, I understand that recursive-locks can help
get-one-out-of-a-corner-in-which-they-find-themselves, where the lock was
already taken by the current thread and we don't want to deadlock when
accidentally attempting the lock again.  This problem tends to occur in
designs where the lock is implied for (small) transactional operations,
*and* for (large) transactional operations that tend to trigger implicit
execution of those (small) transactional operations *within* the large
transactional operation.

I agree with Oliver:

 *) I think recursive mutex don't deserve improvements on the detriment of
normal ones


From a practical standpoint, I understand why recursive mutexes exist (see
description above).  However, from a *logical/cleanliness* standpoint,
every time I've used them, I've later re-factored to where I do *not* do
mutex-recursion.  (Currently I use no recursive mutexes.)

Do people really use them intentionally (do designs legitimately require
them), or are they as I suspect, a way to get-yourself-out-of-trouble
when your design failed to account for proper transactional granularity?

--charley
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Thiago Macieira
On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote:
 Hi,

 Regarding valgrind:
  *) On debug build, nothing is inlined.
  *) If we keep it inline, then we would just need a patch like this [1]

-fno-inline doesn't help because of -fvisibility-inlines-hidden. The call
cannot be rerouted to valgrind.

The annotation you added might help, but as I said, adding instructions --
even if they produce no architectural change -- still consumes CPU resources.
I'd like to benchmark the annotation vs the function call.

 Regarding Transactional memory:
  *) Very interesting
  *) Notice the end of section 8.2.1: Improper use of hints will not cause
 functional bugs though it may expose latent bugs already in the code..
 So in other words, we can use XAQUIRE and XRELEASE without any problem
 in inline code, without binary compaibility issue

Indeed, but note that what it says about transactions that abort too often. If
the transaction aborts, then the code needs to be re-run non-transactionally,
with the lock. That means decreased performance and increased power
consumption.

Note also that all x87 instructions will abort, so any transactions around x87
code (32-bit floating point) would cause aborts.

At this point, we don't know which mutex locks we should make transactional.
As I said, neither you nor I have a Haswell prototype to test on. At the
earliest, we'll be able to test this for Qt 5.2.

  *) The other transactional model does not really apply, but even if it
 would, we could still use some different value for locked and unlocked and
 the previously inline code falls back to the non-inline code.

RTM might apply because we can get extra information about the abort and figure
out whether that particular lock should use transactions the next time.

  *) debugging tools (valgrind) could also detect the XAQUIRE and XRELEASE
 prefix.

 Regarding increasing the size
  *) I think it is valuable to have small memory overhead per mutexes.
  *) I think recursive mutex don't deserve improvements on the detriment of
 normal ones
  *) I think there would not be improvement on Windows.

If we don't improve for recursive mutexes and I simply don't try Windows or
Mac, then there's no need to increase the mutex size.

Then I need only to look into QReadWriteLock.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Thiago Macieira
On sábado, 19 de maio de 2012 10.48.22, Charley Bay wrote:
 snip, QMutex future-direction-discussion

 I've followed this very technical discussion from the beginning--
 impressive array of topics.  Thanks to all digging into this, and great
 thanks to Thiago for opening the issue.

 I have a question on the point of recursive-locks:

 I understand recursive-locks (e.g., recursive-lock-count) were not part
 of the original QMutex, and were added later.  I understand it's merely a
 count-lock-wrapper around the *real* lock.

Actually, no. QMutex has always supported recursive mutexes.

The current implementation is such that only the non-recursive kind is
optimised. The reason for that is that implementing them with futexes, Mach
semaphores and Windows events was quite easy. By bypassing pthreads, we gain a
little in performance.

Implementing the recursive mutex on top of that wasn't very difficult. Though,
to be honest, it requires constant reviewing by other people. We had the same
solution in libdbus-1 and it had race conditions gone unnoticed for years. It
also took the glibc developers 2 years to get all the locking primitives right
on pthread.

 Further, I understand that recursive-locks can help
 get-one-out-of-a-corner-in-which-they-find-themselves, where the lock was
 already taken by the current thread and we don't want to deadlock when
 accidentally attempting the lock again.  This problem tends to occur in
 designs where the lock is implied for (small) transactional operations,
 *and* for (large) transactional operations that tend to trigger implicit
 execution of those (small) transactional operations *within* the large
 transactional operation.

Often, it's for the big locks that you can't completely assure that no
recursion will happen. The small ones usually protect a very limited code path
and it's doable to prove that it will never recurse.

For that reason, since you have lots of code to protect and you're likely not
to be locking and unlocking that often, the small loss of performance is
acceptable. That said, the recursive lock isn't that much worse. It's actually
quite good.

 I agree with Oliver:

  *) I think recursive mutex don't deserve improvements on the detriment of
 normal ones
 

 From a practical standpoint, I understand why recursive mutexes exist (see
 description above).  However, from a *logical/cleanliness* standpoint,
 every time I've used them, I've later re-factored to where I do *not* do
 mutex-recursion.  (Currently I use no recursive mutexes.)

 Do people really use them intentionally (do designs legitimately require
 them), or are they as I suspect, a way to get-yourself-out-of-trouble
 when your design failed to account for proper transactional granularity?

There are a few cases where recursive locks are legitimate uses. For example,
in QtDBus, one of the mutexes in QDBusConnection is recursive because we are
calling out to the libdbus-1 library to handle some condition, which may cause
it to raise further changes to the socket notifiers and timers via the same
callbacks that are used by other threads. The change can be executed on this
thread while the mutex is being held, but not so by other threads.

In other cases, you're probably right that the recursive mutex is just a
gimmick to I can't prove it will not recurse. But then you need to be
careful in writing your code that operates under the lock, like I did for
QtDBus. A recursion from the same thread can be just as bad as simultaneous
incursion from another thread -- e.g., imagine you kept an iterator to a
vector, but the recursed call caused a reallocation.

So in fact, the rules of thumb about locks are:
 - protect the smallest code path that you really need to protect
 - do not call out to functions you don't control, never call back to the user

To protect the smallest code path, you often need lots of locks and lots of
locking, which is the goal of the current implementation by Olivier and Brad
(cheap to create, cheap to lock if not contended).

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Olivier Goffart
On Saturday 19 May 2012 19:05:58 Thiago Macieira wrote:
 On sábado, 19 de maio de 2012 18.34.30, Olivier Goffart wrote:
  Hi,
  
  Regarding valgrind:
   *) On debug build, nothing is inlined.
   *) If we keep it inline, then we would just need a patch like this [1]
 
 -fno-inline doesn't help because of -fvisibility-inlines-hidden. The call
 cannot be rerouted to valgrind.

Visibility does not really matter for valgrind. It does address redirection, 
using the debug symbols.

You can see it works by trying the attached wrapper.c

 The annotation you added might help, but as I said, adding instructions --
 even if they produce no architectural change -- still consumes CPU
 resources. I'd like to benchmark the annotation vs the function call.

Yes, they have a cost which I am not sure we want to pay on release build.


  Regarding Transactional memory:

   *) Notice the end of section 8.2.1: Improper use of hints will not cause
  functional bugs though it may expose latent bugs already in the
  code.. So in other words, we can use XAQUIRE and XRELEASE without any
  problem in inline code, without binary compatibility issue
 
 Indeed, but note that what it says about transactions that abort too often.
 If the transaction aborts, then the code needs to be re-run
 non-transactionally, with the lock. That means decreased performance and
 increased power consumption.

Yes, but we are talking about the rare case in which a QMutex is shared 
between two different objects compiled with different version of Qt.
And in that unlikely case, one can just recompile to fix the performance 
issue.

 Note also that all x87 instructions will abort, so any transactions around
 x87 code (32-bit floating point) would cause aborts.
 
 At this point, we don't know which mutex locks we should make transactional.
 As I said, neither you nor I have a Haswell prototype to test on. At the
 earliest, we'll be able to test this for Qt 5.2.

Indeed, QMutex can be used for all sort of cases.  There can be also way too 
much code in the critical section to fit into the transaction cache. 
Or maybe there is side effects.

QMutexLocker lock(mutex)
qDebug()  What now?  does it also restart the transaction?


So it is probably bad to do the lock elision within QMutex...  
We need to test it on real hardware to see if it works.

But my point is that the current QMutex architecture does not keep us from 
using lock elision later.

-- 
Olivier

Woboq - Qt services and support - http://woboq.com
/*
 * Wrapper for helgrind that works with Qt5 mutexes
 * 
 * Compile and run with:
 
gcc -shared -fPIC -o wrapper.so wrapper.c

LD_PRELOAD=wrapper.so  valgrind -tool=helgrind  Qt5 application
   
 *  
 * A debug build is required   
 * 
 *  Olivier Goffart ogoff...@woboq.com
 */

#include stdio.h
#include valgrind/valgrind.h
#include valgrind/helgrind.h

void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex4lockEv)( void *mutex )
{
   OrigFn fn;
   VALGRIND_GET_ORIG_FN(fn);
   
//   printf(LOCK %p \n, mutex);
   
   DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 0);
   CALL_FN_v_W(fn, mutex);
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex);
}

void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex6unlockEv)( void *mutex )
{
   OrigFn fn;
   VALGRIND_GET_ORIG_FN(fn);
   
//   printf(UNLOCK %p \n, mutex);
   
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_PRE, void*, mutex);
   CALL_FN_v_W(fn, mutex);
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_POST, void*, mutex);
}

void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex7tryLockEi)( void *mutex, int timeout )
{
   OrigFn fn;
   VALGRIND_GET_ORIG_FN(fn);
   
 //  printf(TRYLOCK %p  %d\n, mutex, timeout);
   
   DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 1);
   CALL_FN_v_WW(fn, mutex, timeout);
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex);
}


___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-19 Thread Olivier Goffart
On Sunday 20 May 2012 02:17:53 Olivier Goffart wrote:

 You can see it works by trying the attached wrapper.c


Actually, I realize my implementation of the hook for tryLock was wrong as i 
forgot the return value.

Fixed, just in case someone wants to use it.

-- 
Olivier

Woboq - Qt services and support - http://woboq.com/*
 * Wrapper for helgrind that works with Qt5 mutexes
 * 
 * Compile and run with:
 
gcc -shared -fPIC -o wrapper.so wrapper.c

LD_PRELOAD=wrapper.so  valgrind -tool=helgrind  Qt5 application
   
 *  
 * A debug build is required   
 * 
 *  Olivier Goffart ogoff...@woboq.com
 */

#include stdio.h
#include valgrind/valgrind.h
#include valgrind/helgrind.h

void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex4lockEv)( void *mutex )
{
   OrigFn fn;
   VALGRIND_GET_ORIG_FN(fn);
   
//   printf(LOCK %p \n, mutex);
   
   DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 0);
   CALL_FN_v_W(fn, mutex);
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex);
}

void I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex6unlockEv)( void *mutex )
{
   OrigFn fn;
   VALGRIND_GET_ORIG_FN(fn);
   
//   printf(UNLOCK %p \n, mutex);
   
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_PRE, void*, mutex);
   CALL_FN_v_W(fn, mutex);
   DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_UNLOCK_POST, void*, mutex);
}

long I_WRAP_SONAME_FNNAME_ZU(Za,_ZN11QBasicMutex7tryLockEi)( void *mutex, int timeout )
{
   OrigFn fn;
   long ret;
   VALGRIND_GET_ORIG_FN(fn);

   
 //  printf(TRYLOCK %p  %d\n, mutex, timeout);
   
   DO_CREQ_v_WW(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_PRE, void*, mutex, long, 1);
   CALL_FN_W_WW(ret, fn, mutex, timeout);
   if (ret  0xff)
DO_CREQ_v_W(_VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST, void*, mutex);
   
   return ret;
}


___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-18 Thread Thiago Macieira
On sexta-feira, 18 de maio de 2012 20.34.51, Thiago Macieira wrote:
 (P1) expand the size of QBasicMutex and QMutex to accommodate more data,
 for  Mac and Windows. On Windows, the extra data required is one pointer (a
 HANDLE), whereas on Mac it's a semaphore_t. Depending on the next task, we
 might need a bit more space

the next task was the recursive Mutex one, before I reordered my
recommendations around.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] Mutex future directions

2012-05-18 Thread Olivier Goffart
On Friday 18 May 2012 20:34:51 Thiago Macieira wrote:
 Hello
 
 I've just completed a review of the QMutex  family internals, as well as
 the Intel optimisation manuals about threading, data sharing, locking and
 the Transactional Memory extensions.
 
 Short story: I recommend de-inlining QMutex code for Qt 5.0. We can
 re-inline later. I also recommend increasing the QBasicMutex / QMutex size
 to accommodate more data without the pointer indirection. Further, some
 work needs to be done to support Valgrind. Details below.

The inlining showed to improve a lot the performences for performence-critical 
operation where two function calls matters. It improved performence of signal 
emition for example.
Now, if you can prove me that the code of the function call is neglectible 
compared to the atomic operation, maybe i can change my mind. But last time i 
checked, it was usefull.


As for the size, I think it is really nice to have a light QMutex. That way, 
you can have more QMutex, and then more granularity.
None of your arguments convinces me it is a good idea to increase the size. 
Assigning a QMutexPrivate to a contended QMutex should be fast enough.


 Long story follows.
 
 Current state;
 
 QBasicMutex is an internal POD class that offers non-recursive locking. It's
 incredibly efficient for Linux, where the single pointer-sized member
 variable is enough to execute the futex operations, in all cases. For other
 platforms, we get the same efficiency in non-contended cases, but incur a
 non-negligible performance penalty when contention happens. 

Which non-negligible performance penalty are you takling about here?

 QBasicMutex's documentation also has a note saying that timed locks may not
 work and may cause memory leaks.

It works.  But it might leak the QMutexPrivate if the QMutex destructor is not 
run.  This is ususally not a problem because QBasicMutex is meant to be a 
global object. (also, who uses tryLock with a timeout?)
 
 QMutex builds on QBasicMutex. It fixes the memory leak (I assume) and
 provides support for recursive mutexes. A recursive QMutex is simply a
 QMutex with an owner ID and a recursion counter.

Notice that recusrsive mutex should be advised against.

(google for recursive mutex:  
http://www.zaval.org/resources/library/butenhof1.html )


 QWaitCondition, QSemaphore, and QReadWriteLock are not optimised at all[...]

Yes, feel free to optimize, but this is unrelated to QMutex.

 Valgrind's helgrind and DRD tools can currently operate on Qt locks by
 preloading a library and hijacking QMutex functions. I have tested
 helgrinding Qt4 applications in the past. I do not see a library version
 number in the symbols, so it's possible that helgrind would work unmodified
 with Qt 5 if we were to de-inline the functions. However, we should
 probably approach the Valgrind community to make sure that the Qt 5 mutexes
 work.

Yes, there should be some NOOP we could introduce in debug to hint those 
tools.
But that can work on the inline code.

 The Intel optimisation manual says that data sharing is most efficient when
 each thread or core is operating on a disjoint set of cachelines. That is,
 if you have two threads running, they are most there are no writes by both
 threads to the same cacheline (or cache sector of 128 bytes on Pentium 4).
 Shared reading is fine. While this is the Intel optimisation manual, the
 recommendation is probably a good rule of thumb for any architecture.
 
 There some other optimisation hints about using pipelined locks and about
 cache aliasing at 64 kB and 1 MB, but those are higher-level problems than
 we can solve at the lock level.
 
 The TSX manual says that transactional memory contention happens at the
 cacheline level. That is, if the transaction reads from a cacheline that is
 modified outside the transaction, or if the transaction writes to a
 cacheline that is read from or written to outside the transaction. I do not
 believe this to be more of a problem than the above optimisation guideline,
 which is something for the higher-level organisation: do not put two
 independent lock variables in the same cacheline.
 
 There are two types of instructions to start and finish a transaction. One
 pair is backwards compatible with existing processors and could be inserted
 to every single mutex lock and unlock, even in inline code (which might
 serve as a hint to valgrind, for example). The other pair requires checking
 the processor CPUID first.
 
 However, there are no processors in the market with transactional memory
 support and I don't have access to a prototype (yet, anyway), so at this
 point we simply have no idea whether enabling transactions for all mutex
 locks is a good idea. If enabling for them all isn't a good idea, the code
 for being adaptative cannot be inlined and we're further bound by the
 current inline code in QMutex.

That is not relevant.
Transactional memory requires a different set of primitives. QMutex has 
nothing to do 

Re: [Development] Mutex future directions

2012-05-18 Thread Thiago Macieira
On sexta-feira, 18 de maio de 2012 22.45.22, Olivier Goffart wrote:
 On Friday 18 May 2012 20:34:51 Thiago Macieira wrote:
  Hello
 
  I've just completed a review of the QMutex  family internals, as well as
  the Intel optimisation manuals about threading, data sharing, locking and
  the Transactional Memory extensions.
 
  Short story: I recommend de-inlining QMutex code for Qt 5.0. We can
  re-inline later. I also recommend increasing the QBasicMutex / QMutex size
  to accommodate more data without the pointer indirection. Further, some
  work needs to be done to support Valgrind. Details below.

 The inlining showed to improve a lot the performences for
 performence-critical operation where two function calls matters. It
 improved performence of signal emition for example.
 Now, if you can prove me that the code of the function call is neglectible
 compared to the atomic operation, maybe i can change my mind. But last time
 i checked, it was usefull.

No, I can't prove that. Of course that doing operations X and Y is more
expensive than doing just Y.

The function calls do have an impact in performance. But it's not performance
I want to trade off.

 As for the size, I think it is really nice to have a light QMutex. That way,
 you can have more QMutex, and then more granularity.
 None of your arguments convinces me it is a good idea to increase the size.
 Assigning a QMutexPrivate to a contended QMutex should be fast enough.

But it's not necessary. In fact, the size of QMutex would be platform
dependent. On Linux, it would remain the size of a pointer, on Windows it
would be two sizes of a pointer, on the Mac it would be the size of
semaphore_t plus an integer (plus padding).

For a recursive mutex, I don't mind keeping the private. But I want to
investigate whether we can have a recursive QMutex without indirection with
just a few more bytes.

  Long story follows.
 
  Current state;
 
  QBasicMutex is an internal POD class that offers non-recursive locking.
  It's incredibly efficient for Linux, where the single pointer-sized
  member variable is enough to execute the futex operations, in all cases.
  For other platforms, we get the same efficiency in non-contended cases,
  but incur a non-negligible performance penalty when contention happens.

 Which non-negligible performance penalty are you takling about here?

The need to allocate memory and do the test-and-set loop for setting the
private.

  QBasicMutex's documentation also has a note saying that timed locks may
  not
  work and may cause memory leaks.

 It works.  But it might leak the QMutexPrivate if the QMutex destructor is
 not run.  This is ususally not a problem because QBasicMutex is meant to be
 a global object. (also, who uses tryLock with a timeout?)

Right, and users use the classes as intended... :-P

  Valgrind's helgrind and DRD tools can currently operate on Qt locks by
  preloading a library and hijacking QMutex functions. I have tested
  helgrinding Qt4 applications in the past. I do not see a library version
  number in the symbols, so it's possible that helgrind would work
  unmodified
  with Qt 5 if we were to de-inline the functions. However, we should
  probably approach the Valgrind community to make sure that the Qt 5
  mutexes
  work.

 Yes, there should be some NOOP we could introduce in debug to hint those
 tools.
 But that can work on the inline code.

Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
resources. It will consume front-end decoding, one ALU port, the retirement,
not to mention increased code size. There's no free lunch.

  However, there are no processors in the market with transactional memory
  support and I don't have access to a prototype (yet, anyway), so at this
  point we simply have no idea whether enabling transactions for all mutex
  locks is a good idea. If enabling for them all isn't a good idea, the code
  for being adaptative cannot be inlined and we're further bound by the
  current inline code in QMutex.

 That is not relevant.
 Transactional memory requires a different set of primitives. QMutex has
 nothing to do with transactional memory.

Then you didn't read the manual. I'm going to ignore what you said until you
read the manual because transactional memory has everythig to do with QMutex.

Please, either believe me or read the manual.

  Recommendations (priority):
 
  (P0) de-inline QBasicMutex locking functions until we solve some or all of
  the below problems

 What problems exactly?

The ones listed below:

  (P1) expand the size of QBasicMutex and QMutex to accommodate more data,
  for Mac and Windows. On Windows, the extra data required is one pointer
  (a HANDLE), whereas on Mac it's a semaphore_t. Depending on the next
  task, we might need a bit more space. At first glance, all three
  implementations would work fine with a lock state and the wait structure
  -- with the futex implementation doing both in one single integer.

 Those HANDLE or 

Re: [Development] Mutex future directions

2012-05-18 Thread Olivier Goffart
On Friday 18 May 2012 23:25:47 Thiago Macieira wrote:
   QBasicMutex is an internal POD class that offers non-recursive locking.
   It's incredibly efficient for Linux, where the single pointer-sized
   member variable is enough to execute the futex operations, in all cases.
   For other platforms, we get the same efficiency in non-contended cases,
   but incur a non-negligible performance penalty when contention happens.
  
  Which non-negligible performance penalty are you takling about here?
 
 The need to allocate memory and do the test-and-set loop for setting the
 private.

There is no memory allocation, it is using the QFreeList.
And this happens when we are about to do a context switch. So I beleive the 
test-and-set loop should be neglectible.

 Right, and users use the classes as intended... :-P

(QBasicMutex is an undocumented internal class)

 Introducing the noop valgrind code (a 32-bit rotate) still consumes CPU
 resources. It will consume front-end decoding, one ALU port, the retirement,
 not to mention increased code size. There's no free lunch.

Slower than a function call (which will likely spill register)?  I don't 
beleive so.

(moreever, we are talking about debug build, right?)

  That is not relevant.
  Transactional memory requires a different set of primitives. QMutex has
  nothing to do with transactional memory.
 
 Then you didn't read the manual. I'm going to ignore what you said until you
 read the manual because transactional memory has everythig to do with
 QMutex.
 
 Please, either believe me or read the manual.

Do you have a link to that manuel?

Transactional memory is about detecting conflicts between transactions, and 
rolling back.
Mutexes are about locking until the resource is free

Transactional memory could be used to simplify the code that allocate the 
QMutexPrivate.

But I doubt the hypothetical gain is even worth the function call.

Because remember the uncontended case is much more critical than the contended 
case. (There is no point in winning a dozens of cycles if we are going to pay 
anyway several hendreds in the context switch that follow)

But fast uncontended case is critical because it allows the use of mutex in 
part that might or not be used with threads. Example: QObject. We need to put 
mutexes while emiting a signal because maybe it is used in multiple thread. 
But we don't want to pay too much when used in a single thread (the common 
case)

   (P2) optimise the Mac and Windows implementations to avoid the need for
   allocating a dynamic d pointer in case of contention. In fact, remove
   the
   need for dynamic d-pointer allocation altogether: Mac, Windows and Linux
   should never do it, while the generic Unix implementation should do it
   all
   the time in the constructor
  
  The allocation of the d-ponter is not that dynamic.
  It is already quite optimized in Qt5
 
 My suggestion was to avoid the QMutexPrivate allocation on Mac and Windows,
 since they require just a bit of initialisation. Now, given what you said
 about semaphore_t, we may not be able to do it on the Mac. But we can try to
 apply the same optimisation for Windows -- the initialisation is a call to
 CreateEvent.

But CreateEvent still probably allocate memory behind.

  I beleive there is more important priority right now than touching QMutex,
  which already had its lifting for Qt5.
 
 I disagree. If we shoot ourselves in the foot by not being able to support
 TSX and valgrind in Qt 5, we've lost a lot. That's why I propose
 de-inlining for 5.0, so we have enough time to investigate those potential
 drawbacks.

I think the inline is not a problem for valgrind. 

And I don't think we can gain much with TSX. 

And even if we could, we still do pretty much everything in a binary 
compatible way despite the inlines (We can say 2 means unlocked and 3 locked, 
then the old code fallbacks to the non-inline case (The first lock would 
handle the 0 or 1))

Is it not however shooting ourself in the feet not to inline it? Because we 
hardly can inline it later in a binary compatible way.

-- 
Olivier

Woboq - Qt services and support - http://woboq.com
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development