[jira] [Commented] (QPID-3890) Improve performance by reducing the use of the Queue's message lock

[email protected] (Commented) (JIRA) Fri, 09 Mar 2012 06:45:23 -0800

    [ 
https://issues.apache.org/jira/browse/QPID-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226110#comment-13226110
 ]


[email protected] commented on QPID-3890:
-----------------------------------------------------



bq.  On 2012-03-07 21:30:28, Andrew Stitcher wrote:
bq.  > This is great work.
bq.  > 
bq.  > The most important thing on my mind though is how do ensure that work 
like this maintains correctness?
bq.  > 
bq.  > Do we have some rules/guidelines for the use of the Queue structs that 
we can use to inspect the code to satisfy its correctness? A simple rule like 
this would be "take the blah lock before capturing or changing the foo state". 
It looks like we might have had rules like that, but these changes are so broad 
it's difficult for me to inspect the code and reconstruct what rules now apply 
(if any).
bq.  > 
bq.  > On another point - I see you've added the use of a RWLock - it's always 
good to benchmark with a plain mutex before using one of these as they are 
generally higher overhead and unless the use is "read lock mostly write lock 
seldom" plain mutexes can win.
bq.  
bq.  Kenneth Giusti wrote:
bq.      Re: locking rules - good point Andrew.  Most (all?) of these changes I 
made simply by manually reviewing the existing code's locking behavior, and 
trying to identify what can be safely moved outside the locks.  I'll update 
this patch with comments that explicitly call out those data members & actions 
that must hold the lock.
bq.      
bq.      The RWLock - hmm.. that may be an unintentional change: at first I had 
a separate lock for the queue observer container which was rarely written too.  
I abandoned that change and put the queue observers under the queue lock since 
otherwise the order of events (as seen by the observers) could be re-arranged.
bq.  
bq.  Kenneth Giusti wrote:
bq.      Doh - sorry, replied with way too low Caffiene/Blood ratio.  You're 
referring to the lock in the listener - yes.  Gordon points out that this may 
be unsafe, so I need to think about it a bit more.
bq.  
bq.  Gordon Sim wrote:
bq.      The order of events as seen by observers can be re-arranged with the 
patch as it stands (at least those for acquires and dequeues, enqueues would 
still be ordered correctly wrt each other).
bq.      
bq.      Regarding the listeners access, I think you could fix it by e.g. 
having an atomic counter that was incremented after pushing the message but 
before populating notification set, checking that count before trying to 
acquire the message on the consume side and then rechecking it before adding 
the consumer to listeners if there was no message. If the number changes 
between the last two checks you would retry the acquisition.
bq.  
bq.  Kenneth Giusti wrote:
bq.      Re: listeners - makes sense.  I thought the consistency of the 
listener could be decoupled from the queue, but clearly that doesn't make sense 
(and probably buys us very little to begin with).
bq.      
bq.      Re: ordering of events as seen by observers.  I want to guarantee that 
the order of events are preserved WRT the observers - otherwise stuff will 
break as it stands (like flow control, which relies on correct ordering).  
Gordon - can you explain the acquire/dequeues reordering case?  Does this patch 
introduce that reordering, or has that been possible all along?
bq.  
bq.  Gordon Sim wrote:
bq.      My mistake, I misread the patch, sorry! I think you are right that the 
ordering is preserved.
bq.  
bq.  Gordon Sim wrote:
bq.      re "I thought the consistency of the listener could be decoupled from 
the queue, but clearly that doesn't make sense (and probably buys us very 
little to begin with)." I would have thought that this was one of the more 
significant change, no? (Along with moving the mgmt stats outside the lock).
bq.  
bq.  Kenneth Giusti wrote:
bq.      I've moved the listeners.populate() calls back under the lock.  There 
is a performance degradation, but there's still an overall improvement:
bq.      
bq.      Funnel Test:  (3 worker threads)
bq.      
bq.      trunk:                        patched:                       patch-2 
(locked listeners):
bq.      tp(m/s)                       tp(m/s)                        tp(m/s)
bq.      
bq.      S1    :: S2    :: R1          S1    :: S2    :: R1          S1    :: 
S2    :: R1
bq.      73144 :: 72524 :: 112914      91120 :: 83278 :: 133730      91857 :: 
87240 :: 133878
bq.      70299 :: 68878 :: 110905      91079 :: 87418 :: 133649      84184 :: 
82794 :: 126980
bq.      70002 :: 69882 :: 110241      83957 :: 83600 :: 131617      81846 :: 
79640 :: 124192
bq.      70523 :: 68681 :: 108372      83911 :: 82845 :: 128513      82942 :: 
80433 :: 123311
bq.      68135 :: 68022 :: 107949      85999 :: 81863 :: 125575      80533 :: 
79706 :: 120317
bq.      
bq.      
bq.      
bq.      Quad Shared Test (1 queue, 2 senders x 2 Receivers,  4 worker threads)
bq.      
bq.      trunk:                               patched:                          
    patch-2 (locked listeners):
bq.      tp(m/s)                              tp(m/s)                           
    tp(m/s)
bq.      
bq.      S1    :: S2    :: R1    :: R2        S1    :: S2    :: R1    :: R2     
    S1    :: S2    :: R1    :: R2
bq.      52826 :: 52364 :: 52386 :: 52275     64826 :: 64780 :: 64811 :: 64638  
    64184 :: 64170 :: 64247 :: 63992
bq.      51457 :: 51269 :: 51399 :: 51213     64630 :: 64157 :: 64562 :: 64157  
    63589 :: 63524 :: 63628 :: 63428
bq.      50557 :: 50432 :: 50468 :: 50366     64023 :: 63832 :: 64092 :: 63748  
    63718 :: 63685 :: 63772 :: 63555
bq.      
bq.      
bq.      
bq.      Federation Funnel (worker threads = 3 upstream, 2 down)
bq.      
bq.      trunk:                        patched:                        patch-2 
(locked listeners):
bq.      tp(m/s)                       tp(m/s)                         tp(m/s)
bq.      
bq.      S1    :: S2    :: R1          S1    :: S2    :: R1          S1    :: 
S2    :: R1
bq.      86150 :: 84146 :: 87665       90877 :: 88534 :: 108418      90953 :: 
90269 :: 105194
bq.      89188 :: 88319 :: 82469       87014 :: 86753 :: 107386      86468 :: 
85530 :: 100796
bq.      88221 :: 85298 :: 82455       89790 :: 88573 :: 104119      89169 :: 
88839 ::  92910
bq.

That's very interesting. There are only two changes I can see on the critical 
path for these tests:

  (i) moving the update of mgmt stats outside the lock (very sensible as we 
know already that adds some overhead and the locking is not required)

  (ii) the Consumer::filter() and Consumer::accept() are also moved out on the 
consumer side; as filter always currently returns true I think this boils down 
to the credit check in accept()

Would be interesting to know the relative impact of these.


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4232/#review5691
-----------------------------------------------------------


On 2012-03-07 19:23:36, Kenneth Giusti wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4232/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-07 19:23:36)
bq.  
bq.  
bq.  Review request for qpid, Andrew Stitcher, Alan Conway, Gordon Sim, and Ted 
Ross.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  In order to improve our parallelism across threads, this patch reduces the 
scope over which the Queue's message lock is held.
bq.  
bq.  I've still got more testing to do, but the simple test cases below show 
pretty good performance improvements on my multi-core box.  I'd like to get 
some early feedback, as there are a lot of little changes to the queue locking 
that will need vetting.
bq.  
bq.  As far as performance is concerned - I've run the following tests on an 8 
core Xeon X5550 2.67Ghz box.  I used qpid-send and qpid-receive to generate 
traffic.  All traffic generated in parallel.
bq.  All numbers are in messages/second - sorted by best overall throughput.
bq.  
bq.  Test 1: Funnel Test (shared queue, 2 producers, 1 consumer)
bq.  
bq.  test setup and execute info:
bq.  qpidd --auth no -p 8888 --no-data-dir --worker-threads=3
bq.  qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 
--max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
bq.  qpid-receive -b $SRC_HOST -a srcQ1 -f -m 4000000 --capacity 2000 
--ack-frequency 1000 --print-content no --report-total &
bq.  qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  
bq.  trunk:                        patched:
bq.  tp(m/s)                       tp(m/s)
bq.  
bq.  S1    :: S2    :: R1          S1    :: S2    :: R1
bq.  73144 :: 72524 :: 112914      91120 :: 83278 :: 133730
bq.  70299 :: 68878 :: 110905      91079 :: 87418 :: 133649
bq.  70002 :: 69882 :: 110241      83957 :: 83600 :: 131617
bq.  70523 :: 68681 :: 108372      83911 :: 82845 :: 128513
bq.  68135 :: 68022 :: 107949      85999 :: 81863 :: 125575
bq.  
bq.  
bq.  Test 2: Quad Client Shared Test (1 queue, 2 senders x 2 Receivers)
bq.  
bq.  test setup and execute info:
bq.  qpidd --auth no -p 8888 --no-data-dir --worker-threads=4
bq.  qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=1200000000 
--max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
bq.  qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 
--ack-frequency 1000 --print-content no --report-total &
bq.  qpid-receive -b $SRC_HOST -a srcQ1 -f -m 2000000 --capacity 2000 
--ack-frequency 1000 --print-content no --report-total &
bq.  qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  
bq.  trunk:                               patched:
bq.  tp(m/s)                              tp(m/s)
bq.  
bq.  S1    :: S2    :: R1    :: R2        S1    :: S2    :: R1    :: R2
bq.  52826 :: 52364 :: 52386 :: 52275     64826 :: 64780 :: 64811 :: 64638
bq.  51457 :: 51269 :: 51399 :: 51213     64630 :: 64157 :: 64562 :: 64157
bq.  50557 :: 50432 :: 50468 :: 50366     64023 :: 63832 :: 64092 :: 63748
bq.  
bq.  **Note: pre-patch, qpidd ran steadily at about 270% cpu during this test.  
With the patch, qpidd's cpu utilization was steady at 300%
bq.  
bq.  
bq.  Test 3: Federation Funnel (two federated brokers, 2 producers upstream, 1 
consumer downstream)
bq.  
bq.  test setup and execute info:
bq.  qpidd --auth no -p 8888 --no-data-dir --worker-threads=3 -d
bq.  qpidd --auth no -p 9999 --no-data-dir --worker-threads=2 -d
bq.  qpid-config -b $SRC_HOST add queue srcQ1 --max-queue-size=600000000  
--max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
bq.  qpid-config -b $SRC_HOST add queue srcQ2 --max-queue-size=600000000  
--max-queue-count=2000000 --flow-stop-size=0 --flow-stop-count=0
bq.  qpid-config -b $DST_HOST add queue dstQ1 --max-queue-size=1200000000 
--max-queue-count=4000000 --flow-stop-size=0 --flow-stop-count=0
bq.  qpid-config -b $DST_HOST add exchange fanout dstX1
bq.  qpid-config -b $DST_HOST bind dstX1 dstQ1
bq.  qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ1
bq.  qpid-route --ack=1000 queue add $DST_HOST $SRC_HOST dstX1 srcQ2
bq.  qpid-receive -b $DST_HOST -a dstQ1 -f -m 4000000 --capacity 2000 
--ack-frequency 1000 --print-content no --report-total &
bq.  qpid-send -b $SRC_HOST -a srcQ1 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  qpid-send -b $SRC_HOST -a srcQ2 -m 2000000 --content-size 300 --capacity 
2000 --report-total --sequence no --timestamp no &
bq.  
bq.  
bq.  trunk:                        patched:
bq.  tp(m/s)                       tp(m/s)
bq.  
bq.  S1    :: S2    :: R1          S1    :: S2    :: R1
bq.  86150 :: 84146 :: 87665       90877 :: 88534 :: 108418
bq.  89188 :: 88319 :: 82469       87014 :: 86753 :: 107386
bq.  88221 :: 85298 :: 82455       89790 :: 88573 :: 104119
bq.  
bq.  Still TBD:
bq.    - fix message group errors
bq.    - verify/fix cluster
bq.  
bq.  
bq.  This addresses bug qpid-3890.
bq.      https://issues.apache.org/jira/browse/qpid-3890
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/qpid/cpp/src/qpid/broker/Queue.h 1297185 
bq.    /trunk/qpid/cpp/src/qpid/broker/Queue.cpp 1297185 
bq.    /trunk/qpid/cpp/src/qpid/broker/QueueListeners.h 1297185 
bq.    /trunk/qpid/cpp/src/qpid/broker/QueueListeners.cpp 1297185 
bq.  
bq.  Diff: https://reviews.apache.org/r/4232/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  unit test (non-cluster)
bq.  performance tests as described above.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Kenneth
bq.  
bq.


                
> Improve performance by reducing the use of the Queue's message lock
> -------------------------------------------------------------------
>
>                 Key: QPID-3890
>                 URL: https://issues.apache.org/jira/browse/QPID-3890
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.16
>            Reporter: Ken Giusti
>            Assignee: Ken Giusti
>            Priority: Minor
>             Fix For: Future
>
>
> The mutex that protects the Queue's message container is held longer than it 
> really should be (e.g. across statistics updates, etc).
> Reducing the amount of code executed while the lock is held will improve the 
> performance of the broker in a multi-process environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (QPID-3890) Improve performance by reducing the use of the Queue's message lock

Reply via email to