Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Jean Hsiao Tue, 01 Jun 2021 11:06:30 -0700

Hi Harry,

Great News --- got performance improvement!

With Tim' new build last Friday, I have got new set of performancenumbers. Got about 20% performance improvement with all threeenhancements included --- whether it is 0.1% loss limit or 0.002% losslimit. Got no improvement if only the first two enhancements were included.

The summary report is attached right below. Please review, and let meknow if the current performance meets your expectation.


Thanks!

Jean

=====================================

*OVS RPMS=openvswitch2.15-2.15.0-23.el8fdp.x86_64,openvswitch2.15-debuginfo-2.15.0-23.el8fdp.x86_64*


*EMC Off
*

/Trex binary search for next two tests --- 64 bytes,***0.1% loss**limit*, 1024 flows, 60 secs trial/600 secs validation/


Test case #1

 * default --- not including any of the tree enhancements
 * Mpps=12.4 Mpps (100%)

Test case #2

 * Including all three enhancements --- ovs-appctl
   dpif-netdev/subtable-lookup-prio-set avx512_gather 3; ovs-appctl
   dpif-netdev/dpif-set dpif_avx512 ; ovs-appctl
   dpif-netdev/miniflow-esparser-set study
 * Mpps=14.9 Mpps (120.1%)

/Trex binary search for next two tests --- 64bytes,//*0*/*/./**/0/**/02% loss limit/*/, 1024 flows, 60 secs trial/600secs validation/


Test case #1

 * default --- not including any of the tree enhancements
 * Mpps=12.1 Mpps (100%)

Test case #2

 * Including all three enhancements --- ovs-appctl
   dpif-netdev/subtable-lookup-prio-set avx512_gather 3; ovs-appctl
   dpif-netdev/dpif-set dpif_avx512 ; ovs-appctl
   dpif-netdev/miniflow-parser-set study
 * Mpps=14.6 Mpps (120.6%)

*NOTE: There is no performance gain if only the first two enhancementsare included.*

*Bonus run with latest openvswitch2.15 fdp---openvswitch2.15-2.15.0-23.el8fdp*


 * 0.1% loss limit
 * 10.1 Mpps


On 5/27/21 10:12 AM, Van Haaren, Harry wrote:

(Sorry for top posting – HTML/email-client disagree with inlineanswers below :/ )


Hi Jean,

Yes I expect drops of 0.002% to be enough tolerance. (If theperformance does not improve to Scalar (or higher) please try a muchhigher value like 0.1%.)


Regards, -Harry

*From:* Jean Hsiao <[email protected]>
*Sent:* Thursday, May 27, 2021 3:06 PM

*To:* Van Haaren, Harry <[email protected]>; Timothy Redaelli<[email protected]>; Amber, Kumar <[email protected]>;[email protected]; Jean Hsiao <[email protected]>*Cc:* [email protected]; [email protected]; Stokes, Ian<[email protected]>; Christian Trautman <[email protected]>;Ferriter, Cian <[email protected]>

*Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

On 5/26/21 4:52 PM, Van Haaren, Harry wrote:

    Hi Jean,

    Thanks for the very detailed info, this was very useful in
    identifying the differences between the test setup that we have
    been running and what your configuration is.

    The issue being observed by your testing is due to multiple
    root-causes; hence this has taken some time to identify, apologies
    for the delay in response!

    In order to reproduce the numbers from the OVS Optimization
    whitepaper (where the AVX512 patch performance results are stated),

    please make the following adjustments to the testing setup:

     1. Change from "No Drop Rate" testing, to "packet blast" testing.

Hi Harry,

Just to Clarrify. You want to have higher PvP Mpps while allowing somedrops.

If that's the case, we can relax loss ratio from 0 to 0.002% --- 20losses per million. If you want higher loss rate, please specify.


Thanks!

Jean

    2.
     3. Turn off EMC



    4.

    With the above two changes, your results are expected to show
    approx. the same value-add as we have showing in the whitepaper.

    Please note there remain differences: the whitepaper compares 2^nd
    generation vs 3^rd generation Xeon processors, while benchmarks
    here are 1^st Generation.

    (For those more about CPU code names, the whitepaper measures
    Cascade-Lake vs Icelake, and the results here are Skylake).

    Regarding the changes (as above)

     1. I understand that OVS deployments _/are/*_* sensitive to
        No-Drop-Rate performance, and we will work on a solution that
        has better-than-scalar-code performance in this usage too.
     2. EMC is handling all traffic today, which hides the DPCLS
        optimizations as no packets hit DPCLS. (This explains why
        "master" and "avx512_gather 3" performance are both 15.2 mpps).

    If you can make time to re-run the benchmarks with the above 2
    changes, I expect we will see performance benefits.

    To reduce amount of work to benchmark, just running the "Plus All
    Three" workload would be sufficient to show value of the avx512
    optimizations.

    Thanks for the input on performance here! Regards, -Harry

    *From:* Jean Hsiao <[email protected]> <mailto:[email protected]>
    *Sent:* Friday, May 14, 2021 5:29 PM
    *To:* Van Haaren, Harry <[email protected]>
    <mailto:[email protected]>; Timothy Redaelli
    <[email protected]> <mailto:[email protected]>; Amber, Kumar
    <[email protected]> <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>
    *Cc:* [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>; Stokes, Ian
    <[email protected]> <mailto:[email protected]>; Christian
    Trautman <[email protected]> <mailto:[email protected]>
    *Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure +
    Optimizations

    Hi Harry,

    Pleae take a look. Let me know if needing more info.

    Thanks!

    Jean

    *1&3)*

    Master run: See attachments master-1-0 and master-1-1.

    Plus avx512: See attachments avx512-1-0 and avx512-1-1

    NOTE: On a quick look don't see avx512 function(s) around. Is it
    because EMC doing all the work?

    *2)* NOTE: I am using different commands --- pmd-stats-clear and
    show; Now you can see 100% processing cycles.

    *[root@netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-clear
    [root@netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-show*
    pmd thread numa_id 0 core_id 24:
      packets received: 79625792
      packet recirculations: 0
      avg. datapath passes per packet: 1.00
      emc hits: 79625760
      smc hits: 0
      megaflow hits: 0
      avg. subtable lookups per megaflow hit: 0.00
      miss with success upcall: 0
      miss with failed upcall: 0
      avg. packets per output batch: 32.00
      idle cycles: 0 (0.00%)
    *processing cycles: 27430462544 (100.00%)*
      avg cycles per packet: 344.49 (27430462544/79625792)
      avg processing cycles per packet: 344.49 (27430462544/79625792)
    pmd thread numa_id 0 core_id 52:
      packets received: 79771872
      packet recirculations: 0
      avg. datapath passes per packet: 1.00
      emc hits: 79771872
      smc hits: 0
      megaflow hits: 0
      avg. subtable lookups per megaflow hit: 0.00
      miss with success upcall: 0
      miss with failed upcall: 0
      avg. packets per output batch: 32.00
      idle cycles: 0 (0.00%)
    *processing cycles: 27430498048 (100.00%)*
      avg cycles per packet: 343.86 (27430498048/79771872)
      avg processing cycles per packet: 343.86 (27430498048/79771872)
    main thread:
      packets received: 0
      packet recirculations: 0
      avg. datapath passes per packet: 0.00
      emc hits: 0
      smc hits: 0
      megaflow hits: 0
      avg. subtable lookups per megaflow hit: 0.00
      miss with success upcall: 0
      miss with failed upcall: 0
      avg. packets per output batch: 0.00
    [root@netqe29 jhsiao]#

    On 5/14/21 11:33 AM, Van Haaren, Harry wrote:

        Hi Jean,

        Apologies for top post – just a quick note here today. Thanks
        for all the info, good amount of detail.

        1 & 3)

        Unfortunately the "perf top" output seems to be of a binary
        without debug symbols, so it is not possible to see what is
        what. (Apologies, I should have specified to include debug
        symbols & then we can see function names like "dpcls_lookup"
        and "miniflow_extract" instead of 0x00001234 :) I would be
        interested in the output with function-names, if that's possible?

        2) Is it normal that the vswitch datapath core is >= 80% idle?
        This seems a little strange – and might hint that the
        bottleneck is not on the OVS vswitch datapath cores? (from
        your pmd-perf-stats below):

          - idle iterations: 17444555421  ( 84.1 % of used cycles)
          - busy iterations:     84924866  ( 15.9 % of used cycles)

        4) Ah yes, no-drop testing, "failed to converge" suddenly
        makes a lot of sense, thanks!

        Regards, -Harry

        *From:* Jean Hsiao <[email protected]> <mailto:[email protected]>
        *Sent:* Thursday, May 13, 2021 3:27 PM
        *To:* Van Haaren, Harry <[email protected]>
        <mailto:[email protected]>; Timothy Redaelli
        <[email protected]> <mailto:[email protected]>; Amber,
        Kumar <[email protected]> <mailto:[email protected]>;
        [email protected] <mailto:[email protected]>; Jean Hsiao
        <[email protected]> <mailto:[email protected]>
        *Cc:* [email protected] <mailto:[email protected]>;
        [email protected] <mailto:[email protected]>; Stokes, Ian
        <[email protected]> <mailto:[email protected]>;
        Christian Trautman <[email protected]>
        <mailto:[email protected]>
        *Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure +
        Optimizations

        On 5/11/21 7:35 AM, Van Haaren, Harry wrote:

                -----Original Message-----

                From: Timothy Redaelli <[email protected]>
                <mailto:[email protected]>

                Sent: Monday, May 10, 2021 6:43 PM

                To: Amber, Kumar <[email protected]>
                <mailto:[email protected]>; [email protected]
                <mailto:[email protected]>

                Cc: [email protected] <mailto:[email protected]>;
                [email protected] <mailto:[email protected]>;
                [email protected] <mailto:[email protected]>; Van Haaren, Harry

                <[email protected]>
                <mailto:[email protected]>

                Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure
                + Optimizations

            <snip patchset details for brevity>

                Hi,

                we (as Red Hat) did some tests with a "special" build
                created on top of

                master (a019868a6268 at that time) with with the 2
                series ("DPIF

                Framework + Optimizations" and "MFEX Infrastructure +
                Optimizations")

                cherry-picked.

                The spec file was also modified in order to use add
                "-msse4.2 -mpopcnt"

                to OVS CFLAGS.

            Hi Timothy,

            Thanks for testing and reporting back your findings! Most
            of the configuration is clear to me, but I have a few open
            questions inline below for context.

            The performance numbers reported in the email below do not
            show benefit when enabling AVX512, which contradicts our

            recent whitepaper on benchmarking an Optimized Deployment
            of OVS, which includes the AVX512 patches you've
            benchmarked too.

            Specifically Table 8. for DPIF/MFEX patches, and Table 9.
            for the overall optimizations at a platform level are
            relevant:

            
https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
            
<https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide>

            Based on the differences between these performance
            reports, there must be some discrepancy in our
            testing/measurements.

            I hope that the questions below help us understand any
            differences so we can all measure the benefits from these
            optimizations.

            Regards, -Harry

                RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the
                "special" build with

                the patches backported)

                   * Master --- 15.2 Mpps

                   * Plus "avx512_gather 3" Only --- 15.2 Mpps

                   * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps

                   * Plus "miniflow-parser-set study" --- Failed to
                converge

                   * Plus all three --- 13.5 Mpps

            Open questions:

            1) Is CPU frequency turbo enabled in any scenario, or
            always pinned to the 2.6 GHz base frequency?

               - A "perf top -C x,y"   (where x,y are datapath
            hyperthread ids) would be interesting to compare with 3)
            below.

        See attached screentshoots for two samples --- master-0 and
        master-1



            2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see
            same performance. Is DPCLS in use, or is EMC doing all the
            work?

               - The output of " ovs-appctl dpif-netdev/pmd-perf-show"
            would be interesting to understand where packets are
            classified.

        EMC doing all the work --- see log below. This could explain
        why setting avx512 is not helping.

        NOTE: Our initial study showed that disabling EMC didn't help
        avx512 wining the case.

        [root@netqe29 jhsiao]# ovs-appctl
        dpif-netdev/subtable-lookup-prio-get
        Available lookup functions (priority : name)
          0 : autovalidator
        *1 : generic*
          0 : avx512_gather
        [root@netqe29 jhsiao]#

        sleep 60; ovs-appctl dpif-netdev/pmd-perf-show


        Time: 13:54:40.213
        Measurement duration: 2242.679 s

        pmd thread numa_id 0 core_id 24:

          Iterations:         17531214131  (0.13 us/it)
          - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
          - idle iterations:  17446464548  ( 84.1 % of used cycles)
          - busy iterations:     84749583  ( 15.9 % of used cycles)
          Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
          Datapath passes:     2711982944  (1.00 passes/pkt)
          - EMC hits:          2711677677  (100.0 %)
          - SMC hits:                   0  (  0.0 %)
          - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl
        lookups/hit)
          - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
          - Lost upcalls:               0  (  0.0 %)
          Tx packets:          2711982944  (1209 Kpps)
          Tx batches:            84749583  (32.00 pkts/batch)

        Time: 13:54:40.213
        Measurement duration: 2242.675 s

        pmd thread numa_id 0 core_id 52:

          Iterations:         17529480287  (0.13 us/it)
          - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
          - idle iterations:  17444555421  ( 84.1 % of used cycles)
          - busy iterations:     84924866  ( 15.9 % of used cycles)
          Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
          Datapath passes:     2717592640  (1.00 passes/pkt)
          - EMC hits:          2717280240  (100.0 %)
          - SMC hits:                   0  (  0.0 %)
          - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl
        lookups/hit)
          - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
          - Lost upcalls:               0  (  0.0 %)
          Tx packets:          2717592608  (1212 Kpps)
          Tx batches:            84924866  (32.00 pkts/batch)
        [root@netqe29 jhsiao]#




            3) "dpif-set dpif_avx512" only. The performance here is
            very strange, with ~30% reduction, while our testing shows
            performance improvement.

               - A "perf top" here (compared vs step 1) would be
            helpful to see what is going on

        See avx512-0 and avx512-1 attachments.



            4) "miniflow parser set study", I don't understand what is
            meant by "Failed to converge"?

        This is a 64-bytes 0-loss run. So, "Failed to converge" means
        the binary search fail to get a meaningful Mpps value. This
        could be the case that drops are happening --- could be 1 out
        of a million packets.



               - Is the traffic running in your benchmark
            Ether()/IP()/UDP() ?

               - Note that the only traffic pattern accelerated today
            is Ether()/IP()/UDP() (see patch
            
https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
            
<https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/>
            for details). The next revision of the patchset will
            include other traffic patterns, for example
            Ether()/Dot1Q()/IP()/UDP() and Ether()/IP()/TCP().

                RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2
                -mpopcnt")

                   * 15.2 Mpps

            5) What CFLAGS "-march=" CPU ISA and "-O" optimization
            options are being used for the package?

               - It is likely that "-msse4.2 -mpopcnt" is already
            implied if -march=corei7 or Nehalem for example.

        Tim, Can you answer this question?



                P2P benchmark

                   * ovs-dpdk/25 Gb i40e <-> trex/i40e

                   * single queue two pmd's --- two HT's  out of a CPU
                core.

                Host CPU

                Model name:          Intel(R) Xeon(R) Gold 6132 CPU @
                2.60GHz

            Thanks for detailing the configuration, and looking
            forward to understanding the configuration/performance better.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Reply via email to