Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Van Haaren, Harry Wed, 02 Jun 2021 13:55:59 -0700

Hi Jean,

Thanks for the good news – indeed great that you have reproduced the 
performance benefits!


1.20x is a good result, and in-line with expectations for the 1st Generation 
Xeon/Skylake CPUs being tested.
For 3rd generation Xeon/Icelake CPUs there is some additional benefits due to 
AVX512-VBMI ISA availability,
which will be used automatically when the MFEX "study" command is run.

Thanks again for your help & performance testing here. Regards, -Harry


From: Jean Hsiao <[email protected]>
Sent: Tuesday, June 1, 2021 7:06 PM
To: Van Haaren, Harry <[email protected]>; Timothy Redaelli 
<[email protected]>; Amber, Kumar <[email protected]>; 
[email protected]; Jean Hsiao <[email protected]>
Cc: [email protected]; [email protected]; Stokes, Ian <[email protected]>; 
Christian Trautman <[email protected]>; Ferriter, Cian 
<[email protected]>
Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations


Hi Harry,

Great News --- got performance improvement!

With Tim' new build last Friday, I have got new set of performance numbers. Got 
about 20% performance improvement with all three enhancements included --- 
whether it is 0.1% loss limit or 0.002% loss limit. Got no improvement if only 
the  first two enhancements were included.

The summary report is attached right below. Please review, and let me know if 
the current performance meets your expectation.

Thanks!

Jean

=====================================

OVS RPMS=openvswitch2.15-2.15.0-23.el8fdp.x86_64, 
openvswitch2.15-debuginfo-2.15.0-23.el8fdp.x86_64

EMC Off

Trex binary search for next two tests --- 64 bytes, 0.1% loss limit, 1024 
flows, 60 secs trial/600 secs validation

Test case #1

  *   default --- not including any of the tree enhancements
  *   Mpps=12.4 Mpps (100%)

Test case #2

  *   Including all three enhancements --- ovs-appctl 
dpif-netdev/subtable-lookup-prio-set avx512_gather 3; ovs-appctl 
dpif-netdev/dpif-set dpif_avx512 ; ovs-appctl dpif-netdev/miniflow-esparser-set 
study
  *   Mpps=14.9 Mpps (120.1%)

Trex binary search for next two tests --- 64 bytes, 0.002% loss limit, 1024 
flows, 60 secs trial/600 secs validation

Test case #1

  *   default --- not including any of the tree enhancements
  *   Mpps=12.1 Mpps (100%)

Test case #2

  *   Including all three enhancements --- ovs-appctl 
dpif-netdev/subtable-lookup-prio-set avx512_gather 3; ovs-appctl 
dpif-netdev/dpif-set dpif_avx512 ; ovs-appctl dpif-netdev/miniflow-parser-set 
study
  *   Mpps=14.6 Mpps (120.6%)

NOTE: There is no performance gain if only the first two enhancements are 
included.

Bonus run with latest openvswitch2.15 fdp--- openvswitch2.15-2.15.0-23.el8fdp

  *   0.1% loss limit
  *   10.1 Mpps


On 5/27/21 10:12 AM, Van Haaren, Harry wrote:
(Sorry for top posting – HTML/email-client disagree with inline answers below 
:/ )

Hi Jean,

Yes I expect drops of 0.002% to be enough tolerance. (If the performance does 
not improve to Scalar (or higher) please try a much higher value like 0.1%.)

Regards, -Harry


From: Jean Hsiao <[email protected]><mailto:[email protected]>
Sent: Thursday, May 27, 2021 3:06 PM
To: Van Haaren, Harry 
<[email protected]><mailto:[email protected]>; Timothy 
Redaelli <[email protected]><mailto:[email protected]>; Amber, Kumar 
<[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Jean Hsiao 
<[email protected]><mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Stokes, Ian 
<[email protected]><mailto:[email protected]>; Christian Trautman 
<[email protected]><mailto:[email protected]>; Ferriter, Cian 
<[email protected]><mailto:[email protected]>
Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations



On 5/26/21 4:52 PM, Van Haaren, Harry wrote:
Hi Jean,

Thanks for the very detailed info, this was very useful in identifying the 
differences between the test setup that we have been running and what your 
configuration is.
The issue being observed by your testing is due to multiple root-causes; hence 
this has taken some time to identify, apologies for the delay in response!

In order to reproduce the numbers from the OVS Optimization whitepaper (where 
the AVX512 patch performance results are stated),
please make the following adjustments to the testing setup:

  1.  Change from "No Drop Rate" testing, to "packet blast" testing.

Hi Harry,

Just to Clarrify. You want to have higher PvP Mpps while allowing some drops.

If that's the case, we can relax loss ratio from 0 to 0.002% --- 20 losses per 
million. If you want higher loss rate, please specify.

Thanks!

Jean

  1.
  2.  Turn off EMC




  1.

With the above two changes, your results are expected to show approx. the same 
value-add as we have showing in the whitepaper.
Please note there remain differences: the whitepaper compares 2nd generation vs 
3rd generation Xeon processors, while benchmarks here are 1st Generation.
(For those more about CPU code names, the whitepaper measures Cascade-Lake vs 
Icelake, and the results here are Skylake).

Regarding the changes (as above)

  1.  I understand that OVS deployments _are_ sensitive to No-Drop-Rate 
performance, and we will work on a solution that has better-than-scalar-code 
performance in this usage too.
  2.  EMC is handling all traffic today, which hides the DPCLS optimizations as 
no packets hit DPCLS. (This explains why "master" and "avx512_gather 3" 
performance are both 15.2 mpps).

If you can make time to re-run the benchmarks with the above 2 changes, I 
expect we will see performance benefits.
To reduce amount of work to benchmark, just running the "Plus All Three" 
workload would be sufficient to show value of the avx512 optimizations.

Thanks for the input on performance here! Regards, -Harry


From: Jean Hsiao <[email protected]><mailto:[email protected]>
Sent: Friday, May 14, 2021 5:29 PM
To: Van Haaren, Harry 
<[email protected]><mailto:[email protected]>; Timothy 
Redaelli <[email protected]><mailto:[email protected]>; Amber, Kumar 
<[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Stokes, Ian 
<[email protected]><mailto:[email protected]>; Christian Trautman 
<[email protected]><mailto:[email protected]>
Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations


Hi Harry,

Pleae take a look. Let me know if needing more info.

Thanks!

Jean

1&3)

Master run: See attachments master-1-0 and master-1-1.

Plus avx512: See attachments avx512-1-0 and avx512-1-1

NOTE: On a quick look don't see avx512 function(s) around. Is it because EMC 
doing all the work?

2) NOTE: I am using different commands --- pmd-stats-clear and show; Now you 
can see 100% processing cycles.

[root@netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-clear
[root@netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 24:
  packets received: 79625792
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 79625760
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  processing cycles: 27430462544 (100.00%)
  avg cycles per packet: 344.49 (27430462544/79625792)
  avg processing cycles per packet: 344.49 (27430462544/79625792)
pmd thread numa_id 0 core_id 52:
  packets received: 79771872
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 79771872
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  processing cycles: 27430498048 (100.00%)
  avg cycles per packet: 343.86 (27430498048/79771872)
  avg processing cycles per packet: 343.86 (27430498048/79771872)
main thread:
  packets received: 0
  packet recirculations: 0
  avg. datapath passes per packet: 0.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 0.00
[root@netqe29 jhsiao]#
On 5/14/21 11:33 AM, Van Haaren, Harry wrote:
Hi Jean,

Apologies for top post – just a quick note here today. Thanks for all the info, 
good amount of detail.

1 & 3)
Unfortunately the "perf top" output seems to be of a binary without debug 
symbols, so it is not possible to see what is what. (Apologies, I should have 
specified to include debug symbols & then we can see function names like 
"dpcls_lookup" and "miniflow_extract" instead of 0x00001234 :) I would be 
interested in the output with function-names, if that's possible?

2) Is it normal that the vswitch datapath core is >= 80% idle? This seems a 
little strange – and might hint that the bottleneck is not on the OVS vswitch 
datapath cores? (from your pmd-perf-stats below):
  - idle iterations:  17444555421  ( 84.1 % of used cycles)
  - busy iterations:     84924866  ( 15.9 % of used cycles)

4) Ah yes, no-drop testing, "failed to converge" suddenly makes a lot of sense, 
thanks!

Regards, -Harry

From: Jean Hsiao <[email protected]><mailto:[email protected]>
Sent: Thursday, May 13, 2021 3:27 PM
To: Van Haaren, Harry 
<[email protected]><mailto:[email protected]>; Timothy 
Redaelli <[email protected]><mailto:[email protected]>; Amber, Kumar 
<[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Jean Hsiao 
<[email protected]><mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Stokes, Ian 
<[email protected]><mailto:[email protected]>; Christian Trautman 
<[email protected]><mailto:[email protected]>
Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations



On 5/11/21 7:35 AM, Van Haaren, Harry wrote:

-----Original Message-----

From: Timothy Redaelli <[email protected]><mailto:[email protected]>

Sent: Monday, May 10, 2021 6:43 PM

To: Amber, Kumar <[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>

Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Van Haaren, Harry

<[email protected]><mailto:[email protected]>

Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations



<snip patchset details for brevity>







Hi,

we (as Red Hat) did some tests with a "special" build created on top of

master (a019868a6268 at that time) with with the 2 series ("DPIF

Framework + Optimizations" and "MFEX Infrastructure + Optimizations")

cherry-picked.

The spec file was also modified in order to use add "-msse4.2 -mpopcnt"

to OVS CFLAGS.



Hi Timothy,



Thanks for testing and reporting back your findings! Most of the configuration 
is clear to me, but I have a few open questions inline below for context.



The performance numbers reported in the email below do not show benefit when 
enabling AVX512, which contradicts our

recent whitepaper on benchmarking an Optimized Deployment of OVS, which 
includes the AVX512 patches you've benchmarked too.

Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the overall 
optimizations at a platform level are relevant:

https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide



Based on the differences between these performance reports, there must be some 
discrepancy in our testing/measurements.

I hope that the questions below help us understand any differences so we can 
all measure the benefits from these optimizations.



Regards, -Harry





RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special" build with

the patches backported)



   * Master --- 15.2 Mpps

   * Plus "avx512_gather 3" Only --- 15.2 Mpps

   * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps

   * Plus "miniflow-parser-set study" --- Failed to converge

   * Plus all three --- 13.5 Mpps



Open questions:

1) Is CPU frequency turbo enabled in any scenario, or always pinned to the 2.6 
GHz base frequency?

   - A "perf top -C x,y"   (where x,y are datapath hyperthread ids) would be 
interesting to compare with 3) below.
See attached screentshoots for two samples --- master-0 and master-1









2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same performance. Is 
DPCLS in use, or is EMC doing all the work?

   - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would be interesting 
to understand where packets are classified.

EMC doing all the work --- see log below. This could explain why setting avx512 
is not helping.

NOTE: Our initial study showed that disabling EMC didn't help avx512 wining the 
case.

[root@netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
Available lookup functions (priority : name)
  0 : autovalidator
  1 : generic
  0 : avx512_gather
[root@netqe29 jhsiao]#

sleep 60; ovs-appctl dpif-netdev/pmd-perf-show

Time: 13:54:40.213
Measurement duration: 2242.679 s

pmd thread numa_id 0 core_id 24:

  Iterations:         17531214131  (0.13 us/it)
  - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
  - idle iterations:  17446464548  ( 84.1 % of used cycles)
  - busy iterations:     84749583  ( 15.9 % of used cycles)
  Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
  Datapath passes:     2711982944  (1.00 passes/pkt)
  - EMC hits:          2711677677  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl lookups/hit)
  - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:          2711982944  (1209 Kpps)
  Tx batches:            84749583  (32.00 pkts/batch)

Time: 13:54:40.213
Measurement duration: 2242.675 s

pmd thread numa_id 0 core_id 52:

  Iterations:         17529480287  (0.13 us/it)
  - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
  - idle iterations:  17444555421  ( 84.1 % of used cycles)
  - busy iterations:     84924866  ( 15.9 % of used cycles)
  Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
  Datapath passes:     2717592640  (1.00 passes/pkt)
  - EMC hits:          2717280240  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl lookups/hit)
  - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:          2717592608  (1212 Kpps)
  Tx batches:            84924866  (32.00 pkts/batch)
[root@netqe29 jhsiao]#










3) "dpif-set dpif_avx512" only. The performance here is very strange, with ~30% 
reduction, while our testing shows performance improvement.

   - A "perf top" here (compared vs step 1) would be helpful to see what is 
going on
See avx512-0 and avx512-1 attachments.









4) "miniflow parser set study", I don't understand what is meant by "Failed to 
converge"?
This is a 64-bytes 0-loss run. So, "Failed to converge" means the binary search 
fail to get a meaningful Mpps value. This could be the case that drops are 
happening --- could be 1 out of a million packets.







   - Is the traffic running in your benchmark Ether()/IP()/UDP() ?

   - Note that the only traffic pattern accelerated today is Ether()/IP()/UDP() 
(see patch 
https://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/
 for details). The next revision of the patchset will include other traffic 
patterns, for example Ether()/Dot1Q()/IP()/UDP() and Ether()/IP()/TCP().





RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")

   * 15.2 Mpps



5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are being used 
for the package?

   - It is likely that "-msse4.2 -mpopcnt" is already implied if -march=corei7 
or Nehalem for example.


Tim, Can you answer this question?









P2P benchmark

   * ovs-dpdk/25 Gb i40e <-> trex/i40e

   * single queue two pmd's --- two HT's  out of a CPU core.



Host CPU

Model name:          Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz



Thanks for detailing the configuration, and looking forward to understanding 
the configuration/performance better.


_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Reply via email to