Re: [ovs-dev] [OVN Patch v8 3/3] northd: Restore parallel build with dp_groups

Anton Ivanov Thu, 30 Sep 2021 23:04:20 -0700

On 01/10/2021 01:32, Han Zhou wrote:

On Thu, Sep 30, 2021 at 2:03 PM Anton Ivanov<[email protected]<mailto:[email protected]>> wrote:


    On 30/09/2021 20:48, Han Zhou wrote:



    On Thu, Sep 30, 2021 at 7:34 AM Anton Ivanov
    <[email protected]
    <mailto:[email protected]>> wrote:

        Summary of findings.

        1. The numbers on the perf test do not align with heater
        which is much closer to a realistic load. On some tests where
        heater gives 5-10% end-to-end improvement with
        parallelization we get worse results with the perf-test. You
        spotted this one correctly.

        Example of the northd average pulled out of the test report
        via grep and sed.

           127.489353
           131.509458
           116.088205
           94.721911
           119.629756
           114.896258
           124.811069
           129.679160
           106.699905
           134.490338
           112.106713
           135.957658
           132.471111
           94.106849
           117.431450
           115.861592
           106.830657
           132.396905
           107.092542
           128.945760
           94.298464
           120.455510
           136.910426
           134.311765
           115.881292
           116.918458

        These values are all over the place - this is not a
        reproducible test.

        2. In the present state you need to re-run it > 30+ times and
        take an average. The standard deviation for the values for
        the northd loop is > 10%. Compared to that the
        reproducibility of ovn-heater is significantly better. I
        usually get less than 0.5% difference between runs if there
        was no iteration failures. I would suggest using that instead
        if you want to do performance comparisons until we have
        figured out what affects the perf-test.

        3. It is using the short term running average value in
        reports which is probably wrong because you have very
        significant skew from the last several values.

        I will look into all of these.

    Thanks for the summary! However, I think there is a bigger
    problem (probably related to my environment) than the stability
    of the test (make check-perf TESTSUITEFLAGS="--rebuild") itself.
    As I mentioned in an earlier email I observed even worse results
    with a large scale topology closer to a real world deployment of
    ovn-k8s just testing with the command:
        ovn-nbctl --print-wait-time --wait=sb sync

    This command simply triggers a change in NB_Global table and wait
    for northd to complete all the recompute and update SB. It
    doesn't have to use "sync" command but any change to the NB DB
    produces similar result (e.g.: ovn-nbctl --print-wait-time
    --wait=sb ls-add ls1)

    Without parallel:
    ovn-northd completion: 7807ms

    With parallel:
    ovn-northd completion: 41267ms


    Is this with current master or prior to these patches?

    1. There was an issue prior to these where the hash on first
    iteration with an existing database when loading a large database
    for the first time was not sized correctly. These numbers sound
    about right when this bug was around.

The patches are included. The commit id is 9242f27f63 as mentioned inmy first email.


    2. There should be NO DIFFERENCE in a single compute cycle with an
    existing database between a run with parallel and without with dp
    groups at present. This is because the first cycle does not use
    parallel compute. It is disabled in order to achieve the correct
    hash sizings for future cycle by auto-scaling the hash.

Yes, I understand this and I did enable dp-group for the above"ovn-nbctl sync" test, so the number I showed above for "withparallel" was for the 2nd run and onwards. For the first round theresult is exactly the same as without parallel.

I just tried disabling DP group for the large scale "ovn-nbctl sync"test (after taking some effort squeezing out memory spaces on mydesktop), and the result shows that parallel build performs slightlybetter (although it is 3x slower than with dp-group & withoutparallel, which is expected). Summarize the result together below:


Without parallel, with dp-group:
ovn-northd completion: 7807ms

With parallel, with dp-group:
ovn-northd completion: 41267ms

without parallel, without dp-group:
ovn-northd completion: 27996ms

with parallel, without dp-group:
ovn-northd completion: 26584ms

Now the interesting part:

I implemented a POC of a hash based mutex array that replaces the rwlock in the function do_ovn_lflow_add_pd(), and the performance isgreatly improved for the dp-group test:


with parallel, with dp-group (hash based mutex):
ovn-northd completion: 5081ms

This is 8x faster than the current parallel one and 30% faster thanwithout parallel. This result looks much more reasonable to me. Mytheory is that when using parallel with dp-group, the rwlockcontention is causing the low CPU utilization of the threads and theoverall slowness on my machine. I will refine the POC to a formalpatch and send it for review, hopefully by tomorrow.


Cool. The older implementation prior to going to rwlock was based on that.

I found a couple of issues with it which is why I switched to RWlock

Namely - the access to the lflow hash size is not controlled and thehash size ends up corrupt because different threads modify it without alock. In a worst case scenario you end up with a dog's breakfast in thisentire cache line.

So you need a couple of extra macros to insert fast without touching thecache size.

This in turn leaves you with a hash you cannot resize to correct sizefor searching for post-processing lflows and reconciliation. You willprobably need the post-processing optimization patch which I submitted acouple of weeks back. Instead of using a HMAPX to hold the single flowsand modifying in-place the lflow hash it rebuilds it completely andreplaces the original one. At that point you have the right size and thehash is resized to optimum size.

By the way, there is one more option - you may want to try switching tofatrwlock - that is supposed to decrease contention and make thingsfaster. Though that probably will not be enough.

I still do not get it why your results are so different from ovn-heatertests, but that is something I will look into separately.


Brgds,


Thanks,
Han



--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [OVN Patch v8 3/3] northd: Restore parallel build with dp_groups

Reply via email to