Hi Pierrick, Thanks for sending this detailed report. I think the anomalies you are seeing may be due to the choice of NIC card. Not that the Mellanox NIC is bad but it is a NIC that not many people have tested NFVbench with yet. I am copying Michael as he may have used this same Mellanox NIC with NFVbench in a CNCF benchmarking project (VM and container benchmarking with k8s).
More inline… From: <[email protected]> on behalf of "Pierrick Louin via Lists.Opnfv.Org" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, July 31, 2019 at 9:03 AM To: "Alec Hothan (ahothan)" <[email protected]> Cc: "[email protected]" <[email protected]> Subject: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances using NFVbench - UPDATED Hi Alec, I'm Pierrick Louin working in the team of François-Régis Menguy @orange-Labs. In our last experiments, using the NFVbench application, we came across some issues and a workaround in trying to reach high performances from the bench at a high line rate. Maybe you can help us understanding what happens in the following observations. The raw traces are joined as text files. (sorry for resending, I hope my subscription works at last) -------------------------------------------------------------------------------------------- CONFIGURATION & SOFTWARE ************************ Those reference tests are performed over a simple loopback link between the 2 ports of a NIC (either wired or through a hardware switch). We studied the two following cases: NIC Intel X520 (10 Gbits/s) NIC Mellanox ConnectX-5 (25 Gbits/s) Note that the list of the cpu threads reserved to the generator is not optimized. There is room for tuning... However this was not the point in these tests since, as we will see, the performance issue appears related to the RX process. [alec] The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 40G) and we have not seen any particular issues with rerunning benchmarks over and over again using the same TRex instance. Intel X520 is clearly not a preferred NIC because it misses lots of features only available in newer generation NIC (offload). I may be able to get a hold on a Mellanox 25G NIC soon and this report is definitely a good step to optimize on that NIC. The T-Rex generator will run with the same settings whether it is used standalone or wrapped in the NFVbench. TRex version: v2.59 NFVbench version: 3.5.0 (3.5.1.dev1) Warning: I found that the NFVbench performance are improved - in mellanox tests - provided that a T-Rex server has been launched/stopped once before and since the last reboot. Otherwise no way to obtain a 2 x 25 Gbits/s TX throughput but only ~ 2 x 5.2 Gbits/s with Mellanox. We have to investigate this T-Rex issue which is not addressed thereafter (some different initial module loading ?). [alec] Reboot of the server that runs nfvbench? Are you saying that once you launch and restart TRex, all future NFVbench runs (with the second instance of Trex) work better? Please explain this in detail (eg describe the exact steps in sequence that you performed). Eg. Reboot server Start nfvbench for first time (will launch TRex for the first time) = poor performance (quantify) Restart TRex Run same benchmark = good performance (quantify) We have slightly patched the NFVbench code in order to make some processing now optional and/or configurable from the command line (some for debugging purpose). --cores CMD_CORES Override the T-Rex 'cores' parameter --cache-size CACHE_SIZE Specify the FE cache size (default: 0, flow-count if < 0) --service-mode Enable T-Rex service mode --extra-stats Enable extra flow stats (on high load traffic) --no-latency-stats Disable flow stats for latency traffic --no-latency-streams Disable latency measurements (no streams) --ipdb-mask IPDB_MASK Allow specific breakpoints for the ipdb debugger) TESTS ***** We consider the smallest packet size (64 bytes L2) in order to assess the maximum throughput achievable. Four tests are performed for each of the NICs tested - with a 100% and NDR rate. 1) Preliminary tests, performed with a basic scenario: 'pik.py' launched from a T-Rex console (derived from the script 'bench.py' coming with the T-Rex appli). 2) Tests performed using NFVbench (v3.5.0) in its native coding: High-rate generic streams (for BW measurement) and Low-rate streams (for Latency assessment) are configured into the T-Rex generator in order to allow stats computing on transmitted/received streams. Actually we had however left a 10000 hard coded cache_size specification when calling STLScVmRaw() in ‘trex_gen.py’ in all the cases even this one. ð This caching mode allows far better performances. (In our code release, we now control this parameter from a command line parameter) [alec] Can you provide some numbers with/without cache? I have not tested the cache size option for STLScVmRaw but this is definitely something worth trying on Intel NIC X710 family. 3) Tests performed using NFVbench where we have disabled into the 'trex_gen.py' script the instructions to tag the traffic generated for the purpose of further statistics (as far as we understand it) - this change is made in calls to STLStream(). [alec] If you provide me the diff I can tell you what it does, 4) Tests performed using NFVbench where we keep the flow stats property for the latency streams only. FIRST ANALYSIS ************ The T-Rex test allows us to check that we have no bottleneck on the generator/analyzer + SUT side. The NFVbench results show acceptable performances only when dealing with the 10 Gbits/sec line. Using the NFVbench with its unmodified behaviour (case 2) the line rate is far from being reached with a 50 Gpbs line: ð 8.56 Gbits/s instead of 50 Gbits/s (L1) [alec] Is that with/without cache, with/without restart? As mentioned above, we have not seen such issue with Intel NIC (25G/40G). Unless there are some special reasons for activating a heavy flow stats RX processing, we suggest working in the case (4). We keep the latency assessment while the traffic counters seem to suffice at measuring the BW performances. Of course it may depend on the NIC capabilities for offloading traffic measurements. This is why we also make the flow stat activation optional. [alec] we actually had a mode to disable latency completely at one point but decided to always leave it on as we did not see any negative side effect. We can certainly reinstate the option to disable latency for runs that do not care about latency and prioritize highest throughput. FURTHER ANALYSIS *************** However, looking closer to the performances obtained in the case (3), we can see a significantly reduced rate at the TX (and therefore RX) side: 19.09 Gbits/s instead of 20 Gbtts/s (L1) - value is stable between launches 48.42 Gbits/s instead of 50 Gbits/s (L1) - actually variable between launches Note that the NDR measurement does not show any warning then. This is not our target case but it made me thinking... Thus, I tried a (5) case where we keep the flow stats for BW streams only. - the TX packet rate is reduced as in the case (3) for the Intel x520 - the throughput is limited by the line rate for the Mellanox ConnectX-5 <=> this is unexpected regarding to our hypotheses. [alec] I would not use X520 to make any judgment because this NIC has shown to be hard to work with (to get consistent good numbers for all use cases). But this observation seems to indicate that flow stats for latency streams are costly. Flow stats are important in nfvbench because they allow us to measure exact packet accounting per chain. In case of drops we know exactly which chain(s) are dropping and in what direction. So maybe something to work with TRex team to optimize. CONCLUSION *********** It looks like we are missing something in our comprehension. Not sure that our workaround would not hide some side effects. So far, we can use it for our present needs. ð At least we succeeded at proving that 2x10 & 2x25 Gbits/sec rate line performances can be achieved using NFVBench. Waiting for reading from you. [alec] This is great investigative work for a NIC that I have not used. What I would suggest is upstream the cache size option. Then I’ll be able to test it on the Intel NIC family. Worth considering upstreaming the no latency option as well. Thanks Alec
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#23415): https://lists.opnfv.org/g/opnfv-tech-discuss/message/23415 Mute This Topic: https://lists.opnfv.org/mt/32668493/21656 Mute #nfvbench: https://lists.opnfv.org/mk?hashtag=nfvbench&subid=2783016 Group Owner: [email protected] Unsubscribe: https://lists.opnfv.org/g/opnfv-tech-discuss/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
