Hi Pierrick,

Thanks for sending this detailed report.
I think the anomalies you are seeing may be due to the choice of NIC card. Not 
that the Mellanox NIC is bad but it is a NIC that not many people have tested 
NFVbench with yet.
I am copying Michael as he may have used this same Mellanox NIC with NFVbench 
in a CNCF benchmarking project (VM and container benchmarking with k8s).


More inline…


From: <[email protected]> on behalf of "Pierrick Louin via 
Lists.Opnfv.Org" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, July 31, 2019 at 9:03 AM
To: "Alec Hothan (ahothan)" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances 
using NFVbench - UPDATED

Hi Alec,

I'm Pierrick Louin working in the team of François-Régis Menguy @orange-Labs.

In our last experiments, using the NFVbench application, we came across some 
issues and a workaround in trying to reach high performances from the bench at 
a high line rate.
Maybe you can help us understanding what happens in the following observations.

The raw traces are joined as text files.

(sorry for resending, I hope my subscription works at last)

--------------------------------------------------------------------------------------------

CONFIGURATION & SOFTWARE
************************

Those reference tests are performed over a simple loopback link between the 2 
ports of a NIC (either wired or through a hardware switch).
We studied the two following cases:

NIC Intel X520 (10 Gbits/s)
NIC Mellanox ConnectX-5 (25 Gbits/s)

Note that the list of the cpu threads reserved to the generator is not 
optimized. There is room for tuning...
However this was not the point in these tests since, as we will see, the 
performance issue appears related to the RX process.

[alec]
The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 
40G) and we have not seen any particular issues with rerunning benchmarks over 
and over again using the same TRex instance.
Intel X520 is clearly not a preferred NIC because it misses lots of features 
only available in newer generation NIC (offload).
I may be able to get a hold on a Mellanox 25G NIC soon and this report is 
definitely a good step to optimize on that NIC.


The T-Rex generator will run with the same settings whether it is used 
standalone or wrapped in the NFVbench.

TRex version: v2.59
NFVbench version: 3.5.0 (3.5.1.dev1)

Warning: I found that the NFVbench performance are improved - in mellanox tests 
- provided that a T-Rex server has been launched/stopped once before and since 
the last reboot.

Otherwise no way to obtain a 2 x 25 Gbits/s TX throughput but only ~ 2 x 5.2 
Gbits/s with Mellanox.
We have to investigate  this T-Rex issue which is not addressed thereafter 
(some different initial module loading ?).

[alec]
Reboot of the server that runs nfvbench?
Are you saying that once you launch and restart TRex, all future NFVbench runs 
(with the second instance of Trex) work better? Please explain this in detail 
(eg describe the exact steps in sequence that you performed).
Eg.
Reboot server
Start nfvbench for first time (will launch TRex for the first time) = poor 
performance (quantify)
Restart TRex
Run same benchmark = good performance (quantify)



We have slightly patched the NFVbench code in order to make some processing now 
optional and/or configurable from the command line (some for debugging purpose).

--cores CMD_CORES       Override the T-Rex 'cores' parameter
--cache-size CACHE_SIZE Specify the FE cache size (default: 0, flow-count if < 
0)
--service-mode          Enable T-Rex service mode
--extra-stats           Enable extra flow stats (on high load traffic)
--no-latency-stats      Disable flow stats for latency traffic
--no-latency-streams    Disable latency measurements (no streams)
--ipdb-mask IPDB_MASK   Allow specific breakpoints for the ipdb debugger)

TESTS
*****

We consider the smallest packet size (64 bytes L2) in order to assess the 
maximum throughput achievable.

Four tests are performed for each of the NICs tested - with a 100% and NDR rate.

1) Preliminary tests, performed with a basic scenario: 'pik.py' launched from a 
T-Rex console (derived from the script 'bench.py' coming with the T-Rex appli).

2) Tests performed using NFVbench (v3.5.0) in its native coding:
     High-rate generic streams (for BW measurement) and Low-rate streams (for 
Latency assessment) are configured into the T-Rex generator in order to allow 
stats computing on transmitted/received streams.

Actually we had however left a 10000 hard coded cache_size specification when 
calling STLScVmRaw() in ‘trex_gen.py’ in all the cases even this one.



ð  This caching mode allows far better performances.

(In our code release, we now control this parameter from a command line 
parameter)

[alec]
Can you provide some numbers with/without cache?
I have not tested the cache size option for STLScVmRaw but this is definitely 
something worth trying on Intel NIC X710 family.


3) Tests performed using NFVbench where we have disabled into the 'trex_gen.py' 
script the instructions to tag the traffic generated for the purpose of further 
statistics (as far as we understand it) - this change is made in calls to 
STLStream().

[alec]
If you provide me the diff I can tell you what it does,


4) Tests performed using NFVbench where we keep the flow stats property for the 
latency streams only.

FIRST ANALYSIS
************

The T-Rex test allows us to check that we have no bottleneck on the 
generator/analyzer + SUT side.

The NFVbench results show acceptable performances only when dealing with the 10 
Gbits/sec line.

Using the NFVbench with its unmodified behaviour (case 2) the line rate is far 
from being reached with a 50 Gpbs line:


ð  8.56 Gbits/s instead of 50 Gbits/s (L1)

[alec]
Is that with/without cache, with/without restart?
As mentioned above, we have not seen such issue with Intel NIC (25G/40G).


Unless there are some special reasons for activating a heavy flow stats RX 
processing, we suggest working in the case (4).

We keep the latency assessment while the traffic counters seem to suffice at 
measuring the BW performances.

Of course it may depend on the NIC capabilities for offloading traffic 
measurements.
This is why we also make the flow stat activation optional.

[alec]
we actually had a mode to disable latency completely at one point but decided 
to always leave it on as we did not see any negative side effect. We can 
certainly reinstate the option to disable latency for runs that do not care 
about latency and prioritize highest throughput.


FURTHER ANALYSIS
***************

However, looking closer to the performances obtained in the case (3),
we can see a significantly reduced rate at the TX (and therefore RX) side:

19.09 Gbits/s instead of 20 Gbtts/s (L1) - value is stable between launches
48.42 Gbits/s instead of 50 Gbits/s (L1) - actually variable between launches

Note that the NDR measurement does not show any warning then.

This is not our target case but it made me thinking...

Thus, I tried a (5) case where we keep the flow stats for BW streams only.

  - the TX packet rate is reduced as in the case (3) for the Intel x520
  - the throughput is limited by the line rate for the Mellanox ConnectX-5

<=> this is unexpected regarding to our hypotheses.

[alec]
I would not use X520 to make any judgment because this NIC has shown to be hard 
to work with (to get consistent good numbers for all use cases).
But this observation seems to indicate that flow stats for latency streams are 
costly.
Flow stats are important in nfvbench because they allow us to measure exact 
packet accounting per chain. In case of drops we know exactly which chain(s) 
are dropping and in what direction.
So maybe something to work with TRex team to optimize.



CONCLUSION
***********

It looks like we are missing something in our comprehension.
Not sure that our workaround would not hide some side effects.
So far, we can use it for our present needs.



ð  At least we succeeded at proving that 2x10 & 2x25 Gbits/sec rate line 
performances can be achieved using NFVBench.

Waiting for reading from you.

[alec]
This is great investigative work for a NIC that I have not used.
What I would suggest is upstream the cache size option. Then I’ll be able to 
test it on the Intel NIC family.
Worth considering upstreaming the no latency option as well.

Thanks

  Alec


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#23415): 
https://lists.opnfv.org/g/opnfv-tech-discuss/message/23415
Mute This Topic: https://lists.opnfv.org/mt/32668493/21656
Mute #nfvbench: https://lists.opnfv.org/mk?hashtag=nfvbench&subid=2783016
Group Owner: [email protected]
Unsubscribe: https://lists.opnfv.org/g/opnfv-tech-discuss/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to