On 01/10/2021 01:32, Han Zhou wrote:
On Thu, Sep 30, 2021 at 2:03 PM Anton Ivanov
<[email protected]
<mailto:[email protected]>> wrote:
On 30/09/2021 20:48, Han Zhou wrote:
On Thu, Sep 30, 2021 at 7:34 AM Anton Ivanov
<[email protected]
<mailto:[email protected]>> wrote:
Summary of findings.
1. The numbers on the perf test do not align with heater
which is much closer to a realistic load. On some tests where
heater gives 5-10% end-to-end improvement with
parallelization we get worse results with the perf-test. You
spotted this one correctly.
Example of the northd average pulled out of the test report
via grep and sed.
127.489353
131.509458
116.088205
94.721911
119.629756
114.896258
124.811069
129.679160
106.699905
134.490338
112.106713
135.957658
132.471111
94.106849
117.431450
115.861592
106.830657
132.396905
107.092542
128.945760
94.298464
120.455510
136.910426
134.311765
115.881292
116.918458
These values are all over the place - this is not a
reproducible test.
2. In the present state you need to re-run it > 30+ times and
take an average. The standard deviation for the values for
the northd loop is > 10%. Compared to that the
reproducibility of ovn-heater is significantly better. I
usually get less than 0.5% difference between runs if there
was no iteration failures. I would suggest using that instead
if you want to do performance comparisons until we have
figured out what affects the perf-test.
3. It is using the short term running average value in
reports which is probably wrong because you have very
significant skew from the last several values.
I will look into all of these.
Thanks for the summary! However, I think there is a bigger
problem (probably related to my environment) than the stability
of the test (make check-perf TESTSUITEFLAGS="--rebuild") itself.
As I mentioned in an earlier email I observed even worse results
with a large scale topology closer to a real world deployment of
ovn-k8s just testing with the command:
ovn-nbctl --print-wait-time --wait=sb sync
This command simply triggers a change in NB_Global table and wait
for northd to complete all the recompute and update SB. It
doesn't have to use "sync" command but any change to the NB DB
produces similar result (e.g.: ovn-nbctl --print-wait-time
--wait=sb ls-add ls1)
Without parallel:
ovn-northd completion: 7807ms
With parallel:
ovn-northd completion: 41267ms
Is this with current master or prior to these patches?
1. There was an issue prior to these where the hash on first
iteration with an existing database when loading a large database
for the first time was not sized correctly. These numbers sound
about right when this bug was around.
The patches are included. The commit id is 9242f27f63 as mentioned in
my first email.
2. There should be NO DIFFERENCE in a single compute cycle with an
existing database between a run with parallel and without with dp
groups at present. This is because the first cycle does not use
parallel compute. It is disabled in order to achieve the correct
hash sizings for future cycle by auto-scaling the hash.
Yes, I understand this and I did enable dp-group for the above
"ovn-nbctl sync" test, so the number I showed above for "with
parallel" was for the 2nd run and onwards. For the first round the
result is exactly the same as without parallel.
I just tried disabling DP group for the large scale "ovn-nbctl sync"
test (after taking some effort squeezing out memory spaces on my
desktop), and the result shows that parallel build performs slightly
better (although it is 3x slower than with dp-group & without
parallel, which is expected). Summarize the result together below:
Without parallel, with dp-group:
ovn-northd completion: 7807ms
With parallel, with dp-group:
ovn-northd completion: 41267ms
without parallel, without dp-group:
ovn-northd completion: 27996ms
with parallel, without dp-group:
ovn-northd completion: 26584ms
Now the interesting part:
I implemented a POC of a hash based mutex array that replaces the rw
lock in the function do_ovn_lflow_add_pd(), and the performance is
greatly improved for the dp-group test:
with parallel, with dp-group (hash based mutex):
ovn-northd completion: 5081ms
This is 8x faster than the current parallel one and 30% faster than
without parallel. This result looks much more reasonable to me. My
theory is that when using parallel with dp-group, the rwlock
contention is causing the low CPU utilization of the threads and the
overall slowness on my machine. I will refine the POC to a formal
patch and send it for review, hopefully by tomorrow.
Cool. The older implementation prior to going to rwlock was based on that.
I found a couple of issues with it which is why I switched to RWlock
Namely - the access to the lflow hash size is not controlled and the
hash size ends up corrupt because different threads modify it without a
lock. In a worst case scenario you end up with a dog's breakfast in this
entire cache line.
So you need a couple of extra macros to insert fast without touching the
cache size.
This in turn leaves you with a hash you cannot resize to correct size
for searching for post-processing lflows and reconciliation. You will
probably need the post-processing optimization patch which I submitted a
couple of weeks back. Instead of using a HMAPX to hold the single flows
and modifying in-place the lflow hash it rebuilds it completely and
replaces the original one. At that point you have the right size and the
hash is resized to optimum size.
By the way, there is one more option - you may want to try switching to
fatrwlock - that is supposed to decrease contention and make things
faster. Though that probably will not be enough.
I still do not get it why your results are so different from ovn-heater
tests, but that is something I will look into separately.
Brgds,
Thanks,
Han
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev