On 5/14/21 11:33 AM, Van Haaren, Harry wrote:

Hi Jean,

Apologies for top post – just a quick note here today. Thanks for all the info, good amount of detail.

1 & 3)

Unfortunately the "perf top" output seems to be of a binary without debug symbols, so it is not possible to see what is what. (Apologies, I should have specified to include debug symbols & then we can see function names like "dpcls_lookup" and "miniflow_extract" instead of 0x00001234 :) I would be interested in the output with function-names, if that's possible?

Not sure if debug codes of Tim's Master build still around. Will check.

2) Is it normal that the vswitch datapath core is >= 80% idle? This seems a little strange – and might hint that the bottleneck is not on the OVS vswitch datapath cores? (from your pmd-perf-stats below):

  - idle iterations: 17444555421  ( 84.1 % of used cycles)
  - busy iterations:     84924866  ( 15.9 % of used cycles

Should be almost 100% busy. Not sure if pmd-perf-stats has a clear option like***ovs-appctl dpif-netdev/pmd-stats-clear*.

4) Ah yes, no-drop testing, "failed to converge" suddenly makes a lot of sense, thanks!

Regards, -Harry

*From:* Jean Hsiao <[email protected]>
*Sent:* Thursday, May 13, 2021 3:27 PM
*To:* Van Haaren, Harry <[email protected]>; Timothy Redaelli <[email protected]>; Amber, Kumar <[email protected]>; [email protected]; Jean Hsiao <[email protected]> *Cc:* [email protected]; [email protected]; Stokes, Ian <[email protected]>; Christian Trautman <[email protected]>
*Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

On 5/11/21 7:35 AM, Van Haaren, Harry wrote:

        -----Original Message-----

        From: Timothy Redaelli <[email protected]>
        <mailto:[email protected]>

        Sent: Monday, May 10, 2021 6:43 PM

        To: Amber, Kumar <[email protected]>
        <mailto:[email protected]>; [email protected]
        <mailto:[email protected]>

        Cc: [email protected] <mailto:[email protected]>;
        [email protected] <mailto:[email protected]>; [email protected]
        <mailto:[email protected]>; Van Haaren, Harry

        <[email protected]> <mailto:[email protected]>

        Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure +
        Optimizations

    <snip patchset details for brevity>

        Hi,

        we (as Red Hat) did some tests with a "special" build created
        on top of

        master (a019868a6268 at that time) with with the 2 series ("DPIF

        Framework + Optimizations" and "MFEX Infrastructure +
        Optimizations")

        cherry-picked.

        The spec file was also modified in order to use add "-msse4.2
        -mpopcnt"

        to OVS CFLAGS.

    Hi Timothy,

    Thanks for testing and reporting back your findings! Most of the
    configuration is clear to me, but I have a few open questions
    inline below for context.

    The performance numbers reported in the email below do not show
    benefit when enabling AVX512, which contradicts our

    recent whitepaper on benchmarking an Optimized Deployment of OVS,
    which includes the AVX512 patches you've benchmarked too.

    Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the
    overall optimizations at a platform level are relevant:

    
https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
    
<https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide>

    Based on the differences between these performance reports, there
    must be some discrepancy in our testing/measurements.

    I hope that the questions below help us understand any differences
    so we can all measure the benefits from these optimizations.

    Regards, -Harry

        RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special"
        build with

        the patches backported)

           * Master --- 15.2 Mpps

           * Plus "avx512_gather 3" Only --- 15.2 Mpps

           * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps

           * Plus "miniflow-parser-set study" --- Failed to converge

           * Plus all three --- 13.5 Mpps

    Open questions:

    1) Is CPU frequency turbo enabled in any scenario, or always
    pinned to the 2.6 GHz base frequency?

       - A "perf top -C x,y"   (where x,y are datapath hyperthread
    ids) would be interesting to compare with 3) below.

See attached screentshoots for two samples --- master-0 and master-1

    2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same
    performance. Is DPCLS in use, or is EMC doing all the work?

       - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would
    be interesting to understand where packets are classified.

EMC doing all the work --- see log below. This could explain why setting avx512 is not helping.

NOTE: Our initial study showed that disabling EMC didn't help avx512 wining the case.

[root@netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
Available lookup functions (priority : name)
  0 : autovalidator
*1 : generic*
  0 : avx512_gather
[root@netqe29 jhsiao]#

sleep 60; ovs-appctl dpif-netdev/pmd-perf-show


Time: 13:54:40.213
Measurement duration: 2242.679 s

pmd thread numa_id 0 core_id 24:

  Iterations:         17531214131  (0.13 us/it)
  - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
  - idle iterations:  17446464548  ( 84.1 % of used cycles)
  - busy iterations:     84749583  ( 15.9 % of used cycles)
  Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
  Datapath passes:     2711982944  (1.00 passes/pkt)
  - EMC hits:          2711677677  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl lookups/hit)
  - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:          2711982944  (1209 Kpps)
  Tx batches:            84749583  (32.00 pkts/batch)

Time: 13:54:40.213
Measurement duration: 2242.675 s

pmd thread numa_id 0 core_id 52:

  Iterations:         17529480287  (0.13 us/it)
  - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
  - idle iterations:  17444555421  ( 84.1 % of used cycles)
  - busy iterations:     84924866  ( 15.9 % of used cycles)
  Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
  Datapath passes:     2717592640  (1.00 passes/pkt)
  - EMC hits:          2717280240  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl lookups/hit)
  - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:          2717592608  (1212 Kpps)
  Tx batches:            84924866  (32.00 pkts/batch)
[root@netqe29 jhsiao]#


    3) "dpif-set dpif_avx512" only. The performance here is very
    strange, with ~30% reduction, while our testing shows performance
    improvement.

       - A "perf top" here (compared vs step 1) would be helpful to
    see what is going on

See avx512-0 and avx512-1 attachments.

    4) "miniflow parser set study", I don't understand what is meant
    by "Failed to converge"?

This is a 64-bytes 0-loss run. So, "Failed to converge" means the binary search fail to get a meaningful Mpps value. This could be the case that drops are happening --- could be 1 out of a million packets.

       - Is the traffic running in your benchmark Ether()/IP()/UDP() ?

       - Note that the only traffic pattern accelerated today is
    Ether()/IP()/UDP() (see patch
    
https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
    
<https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/>
    for details). The next revision of the patchset will include other
    traffic patterns, for example Ether()/Dot1Q()/IP()/UDP() and
    Ether()/IP()/TCP().

        RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")

           * 15.2 Mpps

    5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are
    being used for the package?

       - It is likely that "-msse4.2 -mpopcnt" is already implied if
    -march=corei7 or Nehalem for example.

Tim, Can you answer this question?

        P2P benchmark

           * ovs-dpdk/25 Gb i40e <-> trex/i40e

           * single queue two pmd's --- two HT's  out of a CPU core.

        Host CPU

        Model name:          Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

    Thanks for detailing the configuration, and looking forward to
    understanding the configuration/performance better.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to