On 01/10/2021 01:32, Han Zhou wrote:


On Thu, Sep 30, 2021 at 2:03 PM Anton Ivanov <[email protected] <mailto:[email protected]>> wrote:

    On 30/09/2021 20:48, Han Zhou wrote:


    On Thu, Sep 30, 2021 at 7:34 AM Anton Ivanov
    <[email protected]
    <mailto:[email protected]>> wrote:

        Summary of findings.

        1. The numbers on the perf test do not align with heater
        which is much closer to a realistic load. On some tests where
        heater gives 5-10% end-to-end improvement with
        parallelization we get worse results with the perf-test. You
        spotted this one correctly.

        Example of the northd average pulled out of the test report
        via grep and sed.

           127.489353
           131.509458
           116.088205
           94.721911
           119.629756
           114.896258
           124.811069
           129.679160
           106.699905
           134.490338
           112.106713
           135.957658
           132.471111
           94.106849
           117.431450
           115.861592
           106.830657
           132.396905
           107.092542
           128.945760
           94.298464
           120.455510
           136.910426
           134.311765
           115.881292
           116.918458

        These values are all over the place - this is not a
        reproducible test.

        2. In the present state you need to re-run it > 30+ times and
        take an average. The standard deviation for the values for
        the northd loop is > 10%. Compared to that the
        reproducibility of ovn-heater is significantly better. I
        usually get less than 0.5% difference between runs if there
        was no iteration failures. I would suggest using that instead
        if you want to do performance comparisons until we have
        figured out what affects the perf-test.

        3. It is using the short term running average value in
        reports which is probably wrong because you have very
        significant skew from the last several values.

        I will look into all of these.

    Thanks for the summary! However, I think there is a bigger
    problem (probably related to my environment) than the stability
    of the test (make check-perf TESTSUITEFLAGS="--rebuild") itself.
    As I mentioned in an earlier email I observed even worse results
    with a large scale topology closer to a real world deployment of
    ovn-k8s just testing with the command:
        ovn-nbctl --print-wait-time --wait=sb sync

    This command simply triggers a change in NB_Global table and wait
    for northd to complete all the recompute and update SB. It
    doesn't have to use "sync" command but any change to the NB DB
    produces similar result (e.g.: ovn-nbctl --print-wait-time
    --wait=sb ls-add ls1)

    Without parallel:
    ovn-northd completion: 7807ms

    With parallel:
    ovn-northd completion: 41267ms

    Is this with current master or prior to these patches?

    1. There was an issue prior to these where the hash on first
    iteration with an existing database when loading a large database
    for the first time was not sized correctly. These numbers sound
    about right when this bug was around.

The patches are included. The commit id is 9242f27f63 as mentioned in my first email.

    2. There should be NO DIFFERENCE in a single compute cycle with an
    existing database between a run with parallel and without with dp
    groups at present. This is because the first cycle does not use
    parallel compute. It is disabled in order to achieve the correct
    hash sizings for future cycle by auto-scaling the hash.

Yes, I understand this and I did enable dp-group for the above "ovn-nbctl sync" test, so the number I showed above for "with parallel" was for the 2nd run and onwards. For the first round the result is exactly the same as without parallel.

I just tried disabling DP group for the large scale "ovn-nbctl sync" test (after taking some effort squeezing out memory spaces on my desktop), and the result shows that parallel build performs slightly better (although it is 3x slower than with dp-group & without parallel, which is expected). Summarize the result together below:

Without parallel, with dp-group:
ovn-northd completion: 7807ms

With parallel, with dp-group:
ovn-northd completion: 41267ms

without parallel, without dp-group:
ovn-northd completion: 27996ms

with parallel, without dp-group:
ovn-northd completion: 26584ms

Now the interesting part:
I implemented a POC of a hash based mutex array that replaces the rw lock in the function do_ovn_lflow_add_pd(), and the performance is greatly improved for the dp-group test:

with parallel, with dp-group (hash based mutex):
ovn-northd completion: 5081ms

This is 8x faster than the current parallel one and 30% faster than without parallel. This result looks much more reasonable to me. My theory is that when using parallel with dp-group, the rwlock contention is causing the low CPU utilization of the threads and the overall slowness on my machine. I will refine the POC to a formal patch and send it for review, hopefully by tomorrow.

Cool. The older implementation prior to going to rwlock was based on that.

I found a couple of issues with it which is why I switched to RWlock

Namely - the access to the lflow hash size is not controlled and the hash size ends up corrupt because different threads modify it without a lock. In a worst case scenario you end up with a dog's breakfast in this entire cache line.

So you need a couple of extra macros to insert fast without touching the cache size.

This in turn leaves you with a hash you cannot resize to correct size for searching for post-processing lflows and reconciliation. You will probably need the post-processing optimization patch which I submitted a couple of weeks back. Instead of using a HMAPX to hold the single flows and modifying in-place the lflow hash it rebuilds it completely and replaces the original one. At that point you have the right size and the hash is resized to optimum size.

By the way, there is one more option - you may want to try switching to fatrwlock - that is supposed to decrease contention and make things faster. Though that probably will not be enough.

I still do not get it why your results are so different from ovn-heater tests, but that is something I will look into separately.

Brgds,


Thanks,
Han


--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to