[389-devel] Re: System tap performance results,

William Brown Wed, 11 Dec 2019 15:30:28 -0800


> On 11 Dec 2019, at 20:29, thierry bordaz <[email protected]> wrote:
> 
> 
> 
> On 12/11/19 1:21 AM, William Brown wrote:
>> 
>>> On 10 Dec 2019, at 19:15, thierry bordaz <[email protected]> wrote:
>>> 
>>> Hi William,
>>> 
>>> Thanks for these very interesting results.
>>> It would help if you can provide the stap scripts to make sure what you are 
>>> accounting the latency.
>> Yes, I plan to put them into a PR soon once I have done a bit more data 
>> collection and polishing of the script setup.
>> 
>>> Also just to be sure is latency a synonym for response time ?
>> Yep, here I mean the execution time of a single operation.
>> 
>>> Regarding the comparison (tail) 1client/16client. It looks to me that the 
>>> tail are equivalent: The more we have clients the more we have long 
>>> latency. So in a first approach I would eliminate contention effect.
>> I disagree, the tail is much more pronounced in the 16 client version, and 
>> there is a significant shift of response times from the 32768 bucket to 
>> 65536 indicating that many operations are now being "held up".
> 
> You are probably right.
> There is a response time shift but I would expect a major contention effect 
> to shift more and flatter the graph to upper response time.
> Whatever is the answer, I think an important point is to agree on the method 
> and that reports shows somethings suspicious.


In a highly concurrent system like DS we shouldn't be seeing tail latency or 
spikes, we should be seeing all responses within a really small window 
especially when the searches are all the "same query" for a single uid. So 
that's why I'm looking at the spikes first. 

> 
>> 
>>> Regarding the ratio shorter/longer latency (assuming the search are 
>>> equivalent) this is interesting to know why we have such effect. One of the 
>>> possible cause I was thinking of is the impact of DB thread (checkpointing 
>>> or trickling). But if it exists similar long tail in ldbm_back, the 
>>> absolute value is much lower than the opshare_search: in short ldbm_back 
>>> accounted at most for 4ms while opshared for 67ms. So there is something 
>>> else (aci, network, frontend..).
>>> 
>>> Regarding USDT I think it is very good idea. However, just to share some 
>>> recent stap experience, I found it intrusive. In short, I had not the same 
>>> throughput with and without. In my case it was not a problem, as I wanted 
>>> to investigate a reproducible contention. But if we want 
>>> support/user/customers to use it, the performance impact in production will 
>>> be valid point.
>> I haven't noticed any "intrusiveness" from USDT so far? How were you running 
>> the stap scripts?
> I did not add probe in DS. I was just using stap to gather per operation 
> specific functions duration (like plugin, backend or core).
> 
> I do not recall the "intrusiveness" level as I was more looking for 
> contention data than performance value.

Hmmm okay. I've found stap extremely fast and it doesn't get in the way, so 
unless I saw your stap scripts I'm not sure ..... 

> 
>> 
>>> best regards
>>> thierry
>>> 
>>> 
>>> 
>>> On 12/9/19 6:16 AM, William Brown wrote:
>>>> Hi all,
>>>> 
>>>> Following last weeks flamegraph runs, I wanted to find more details on 
>>>> exactly what was happening. While flamegraphs highlighted that a changed 
>>>> existed between single and multithreaded servers, it did not help to 
>>>> isolate where
>>>> the change was occuring.
>>>> 
>>>> To understand this I have started to work on a set of systemtap scripts 
>>>> that we can use to profile 389ds. These will be included in a future PR.
>>>> 
>>>> Here are the hisograms from an initial test of profiling "do_search"
>>>> 
>>>> 1 thread
>>>> 
>>>> stap test_do_search.stp
>>>> ^CDistribution of latencies (in nanoseconds) for 441148 samples
>>>> max/avg/min: 35911542/85694/38927
>>>>     value |-------------------------------------------------- count
>>>>      8192 |                                                        0
>>>>     16384 |                                                        0
>>>>     32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 167285
>>>>     65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  239280
>>>>    131072 |@@@@@                                               25788
>>>>    262144 |@                                                    8530
>>>>    524288 |                                                      252
>>>>   1048576 |                                                        6
>>>>   2097152 |                                                        1
>>>>   4194304 |                                                        3
>>>>   8388608 |                                                        0
>>>>  16777216 |                                                        2
>>>>  33554432 |                                                        1
>>>>  67108864 |                                                        0
>>>> 134217728 |                                                        0
>>>> 
>>>> 16 thread
>>>> 
>>>>  stap test_do_search.stp
>>>> ^CDistribution of latencies (in nanoseconds) for 407806 samples
>>>> max/avg/min: 100315928/112407/39368
>>>>     value |-------------------------------------------------- count
>>>>      8192 |                                                        0
>>>>     16384 |                                                        0
>>>>     32768 |@@@@@@@@@@@@@@@@@@@@                               100281
>>>>     65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  249656
>>>>    131072 |@@@@@@@                                             37837
>>>>    262144 |@@@                                                 18322
>>>>    524288 |                                                     1171
>>>>   1048576 |                                                      203
>>>>   2097152 |                                                       90
>>>>   4194304 |                                                       74
>>>>   8388608 |                                                       83
>>>>  16777216 |                                                       58
>>>>  33554432 |                                                       21
>>>>  67108864 |                                                       10
>>>> 134217728 |                                                        0
>>>> 268435456 |                                                        0
>>>> 
>>>> 
>>>> It's interesting to note the tail latency here: On the 16 thread version 
>>>> we see 67000 less in the 32768 buckets, shifting mostly through the 131072 
>>>> and 262144 buckets, as well as showing a much greater number of calls in 
>>>> the tail. In thread 1 no operation made it to 67108864, but 10 did in 
>>>> 16thread, along with ~200 more that are higher than 1048567, and ~1500 
>>>> more that are greater than 524288. This kind of tailing means we have 
>>>> "spikes" of latency throughout the execution, which then have a minor flow 
>>>> on cause other operations to be increased in latency.
>>>> 
>>>> These are all in nanoseconds, so this means most operations are in 
>>>> do_search for 0.000131072 seconds or less, while there are spikes of 
>>>> operations taking between nearly 0.001048576 and 0.067108864 seconds to 
>>>> complete (which is an 8x to 512x increase in execution time.
>>>> 
>>>> From these point I took two measurements of back_ldbm and the access log, 
>>>> knowing these were easy targets for potential areas of latency. Both of 
>>>> these were performed with the 16thread server. I didn't need to do this on 
>>>> the 1 thread because I'm looking for tail latency, not comparisons now. 
>>>> Because we know that there is an increase of ~1500 events of high latency, 
>>>> we are looking for similar numbers to appear in other tail latencies.
>>>> 
>>>> 
>>>> ^CDistribution of back_ldbm latencies (in nanoseconds) for 364223 samples
>>>> max/avg/min: 4757744/31192/13622
>>>>    value |-------------------------------------------------- count
>>>>     2048 |                                                        0
>>>>     4096 |                                                        0
>>>>     8192 |@@                                                  14318
>>>>    16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  261265
>>>>    32768 |@@@@@@@@@@@@                                        63533
>>>>    65536 |@@@                                                 20624
>>>>   131072 |                                                     4062
>>>>   262144 |                                                      304
>>>>   524288 |                                                       68
>>>>  1048576 |                                                       33
>>>>  2097152 |                                                       15
>>>>  4194304 |                                                        1
>>>>  8388608 |                                                        0
>>>> 16777216 |                                                        0
>>>> 
>>>> This shows some tail latency in back_ldbm however it's infrequent, only 
>>>> showing a handful of spikes, but they do exist and tend to taper off 
>>>> quickly as they approach 2097152. Certainly these should be looked at as 
>>>> they will likely add up to affecting do_search.
>>>> 
>>>> 
>>>> stap -v test_access_log.stp
>>>> ^CDistribution of slapi_log_access latencies (in nanoseconds) for 1284350 
>>>> samples
>>>> max/avg/min: 23843992/5138/1036
>>>>    value |-------------------------------------------------- count
>>>>      256 |                                                        0
>>>>      512 |                                                        0
>>>>     1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     379710
>>>>     2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        361243
>>>>     4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  412480
>>>>     8192 |@@@@@@@@@@                                          83815
>>>>    16384 |@@@@                                                40087
>>>>    32768 |                                                     5076
>>>>    65536 |                                                     1253
>>>>   131072 |                                                      406
>>>>   262144 |                                                      130
>>>>   524288 |                                                       72
>>>>  1048576 |                                                       33
>>>>  2097152 |                                                       16
>>>>  4194304 |                                                       19
>>>>  8388608 |                                                        1
>>>> 16777216 |                                                        9
>>>> 33554432 |                                                        0
>>>> 67108864 |                                                        0
>>>> 
>>>> 
>>>> From this, I can see that while some tail latency exists in some 
>>>> conditions of back_ldbm, there is a much longer and present tail in the 
>>>> access log, with nearly 2000 operations in or exceeding the 65536 mark - 
>>>> remember, in do_search most of the operations as a whole take 65536, so we 
>>>> have latency spikes in slapi_log_access that are as large or larger than 
>>>> whole search operations as a total.
>>>> 
>>>> Now we can see "most! operations are fimly below 16384. This is what we 
>>>> want to see, but it's interesting that a slapi_log_access can "take up to" 
>>>> a quarter of a whole operations processing on the avg case.
>>>> 
>>>> What's truly telling here is the tail down to 16777216, with ~250 
>>>> operations exceeding 262144 ns. This shows that within slapi_log_access, 
>>>> we have have large spiked delays in the behaviour, which will be affecting 
>>>> both the tail latency of do_search as a whole, but also causing other 
>>>> slapi_log_access operations to "hold up".
>>>> 
>>>> My next steps from here will be:
>>>> 
>>>> * To add USDT probes to the logs and back_ldbm to get better, fine detail 
>>>> about what is causing the excessive latency.
>>>>   * these probes are also needed to resolve what appears to be a symbol 
>>>> resolution issue with systemtap when optimisations are enabled.
>>>> * To add probes in other parts of the code base to get better 
>>>> visualisation about where and how long timings are taking through an 
>>>> operation.
>>>> * To run a lock contention profile (I was unable to do this today due to 
>>>> bugs in systemtap)
>>>> * To document the setup proceedures
>>>> * Commit all these into a PR
>>>> * Document some actionable improvements we can make to improve the server 
>>>> performance.
>>>> 
>>>> 
>>>> Hope that is interesting to everyone,
>>>> 
>>>> 
>>>> --
>>>> Sincerely,
>>>> 
>>>> William
>>>> _______________________________________________
>>>> 389-devel mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> Fedora Code of Conduct: 
>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives: 
>>>> https://lists.fedoraproject.org/archives/list/[email protected]
>> --
>> Sincerely,
>> 
>> William
>> 
> _______________________________________________
> 389-devel mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/[email protected]

--
Sincerely,

William
_______________________________________________
389-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/[email protected]

[389-devel] Re: System tap performance results,

Reply via email to