On 8/31/2024 2:38 PM, Mihai Brodschi wrote: > Hi Ferruh, > > Apologies for the late response. > I've run some performance tests for the two proposed solutions. > In the tables below, the rte_memcpy results correspond to this patch. > The 2xpktmbuf_alloc results correspond to the other proposed solution. > > bash commands: > server# ./dpdk-testpmd --vdev=net_memif0,id=1,role=server,bsize=1024,rsize=8 > --single -l<SERVER_CORES> --file=test1 -- --nb-cores <NB_CORES> --txq > <NB_CORES> --rxq <NB_CORES> --burst <BURST> -i > client# ./dpdk-testpmd > --vdev=net_memif0,id=1,role=client,bsize=1024,rsize=8,zero-copy=yes --single > -l<CLIENT_CORES> --file=test2 -- --nb-cores <NB_CORES> --txq <NB_CORES> --rxq > <NB_CORES> --burst <BURST> -i > > testpmd commands: > client: > testpmd> start > server: > testpmd> start tx_first > > > CPU: AMD EPYC 7713P > RAM: DDR4-3200 > OS: Debian 12 > DPDK: 22.11.1 > SERVER_CORES=72,8,9,10,11 > CLIENT_CORES=76,12,13,14,15 > > Results: > ================================================================== > | | 1 CORE | 2 CORES | 4 CORES | > ================================================================== > | unpatched burst=32 | 9.95 Gbps | 19.24 Gbps | 36.4 Gbps | > ------------------------------------------------------------------ > | 2xpktmbuf_alloc burst=32 | 9.86 Gbps | 18.88 Gbps | 36.6 Gbps | > ------------------------------------------------------------------ > | 2xpktmbuf_alloc burst=31 | 9.17 Gbps | 18.69 Gbps | 35.1 Gbps | > ------------------------------------------------------------------ > | rte_memcpy burst=32 | 9.54 Gbps | 19.10 Gbps | 36.6 Gbps | > ------------------------------------------------------------------ > | rte_memcpy burst=31 | 9.39 Gbps | 18.53 Gbps | 35.5 Gbps | > ================================================================== > > > CPU: Intel Core i7-14700HX > RAM: DDR5-5600 > OS: Ubuntu 24.04.1 > DPDK: 23.11.1 > SERVER_CORES=0,1,3,5,7 > CLIENT_CORES=8,9,11,13,15 > > Results: > ================================================================== > | | 1 CORE | 2 CORES | 4 CORES | > ================================================================== > | unpatched burst=32 | 15.52 Gbps | 27.35 Gbps | 46.8 Gbps | > ------------------------------------------------------------------ > | 2xpktmbuf_alloc burst=32 | 15.49 Gbps | 27.68 Gbps | 46.4 Gbps | > ------------------------------------------------------------------ > | 2xpktmbuf_alloc burst=31 | 14.98 Gbps | 26.75 Gbps | 45.2 Gbps | > ------------------------------------------------------------------ > | rte_memcpy burst=32 | 15.99 Gbps | 28.44 Gbps | 49.3 Gbps | > ------------------------------------------------------------------ > | rte_memcpy burst=31 | 14.85 Gbps | 27.32 Gbps | 46.3 Gbps | > ================================================================== >
Hi Mihai, Thank you for the extensive testing. Problematic case is "burst=31", between '2xpktmbuf_alloc' & 'rte_memcpy' method, there is small difference and not one of them consistently better than other. In this case I will proceed with current patch. > > On 19/07/2024 12:03, Ferruh Yigit wrote: >> On 7/8/2024 12:45 PM, Ferruh Yigit wrote: >>> On 7/8/2024 4:39 AM, Mihai Brodschi wrote: >>>> >>>> >>>> On 07/07/2024 21:46, Mihai Brodschi wrote: >>>>> >>>>> >>>>> On 07/07/2024 18:18, Mihai Brodschi wrote: >>>>>> >>>>>> >>>>>> On 07/07/2024 17:05, Ferruh Yigit wrote: >>>>>>> >>>>>>> My expectation is numbers should be like following: >>>>>>> >>>>>>> Initially: >>>>>>> size = 256 >>>>>>> head = 0 >>>>>>> tail = 0 >>>>>>> >>>>>>> In first refill: >>>>>>> n_slots = 256 >>>>>>> head = 256 >>>>>>> tail = 0 >>>>>>> >>>>>>> Subsequent run that 32 slots used: >>>>>>> head = 256 >>>>>>> tail = 32 >>>>>>> n_slots = 32 >>>>>>> rte_pktmbuf_alloc_bulk(mq, buf[head & mask], n_slots); >>>>>>> head & mask = 0 >>>>>>> // So it fills first 32 elements of buffer, which is inbound >>>>>>> >>>>>>> This will continue as above, combination of only gap filled and head >>>>>>> masked with 'mask' provides the wrapping required. >>>>>> >>>>>> If I understand correctly, this works only if eth_memif_rx_zc always >>>>>> processes >>>>>> a number of packets which is a power of 2, so that the ring's head >>>>>> always wraps >>>>>> around at the end of a refill loop, never in the middle of it. >>>>>> Is there any reason this should be the case? >>>>>> Maybe the tests don't trigger the crash because this condition holds >>>>>> true for them? >>>>> >>>>> Here's how to reproduce the crash on DPDK stable 23.11.1, using testpmd: >>>>> >>>>> Server: >>>>> # ./dpdk-testpmd --vdev=net_memif0,id=1,role=server,bsize=1024,rsize=8 >>>>> --single-file-segments -l2,3 --file-prefix test1 -- -i >>>>> >>>>> Client: >>>>> # ./dpdk-testpmd >>>>> --vdev=net_memif0,id=1,role=client,bsize=1024,rsize=8,zero-copy=yes >>>>> --single-file-segments -l4,5 --file-prefix test2 -- -i >>>>> testpmd> start >>>>> >>>>> Server: >>>>> testpmd> start tx_first >>>>> testpmt> set burst 15 >>>>> >>>>> At this point, the client crashes with a segmentation fault. >>>>> Before the burst is set to 15, its default value is 32. >>>>> If the receiver processes packets in bursts of size 2^N, the crash does >>>>> not occur. >>>>> Setting the burst size to any power of 2 works, anything else crashes. >>>>> After applying this patch, the crashes are completely gone. >>>> >>>> Sorry, this might not crash with a segmentation fault. To confirm the >>>> mempool is >>>> corrupted, please compile DPDK with debug=true and the c_args >>>> -DRTE_LIBRTE_MEMPOOL_DEBUG. >>>> You should see the client panic when changing the burst size to not be a >>>> power of 2. >>>> This also works on the latest main branch. >>>> >>> >>> Hi Mihai, >>> >>> Right, if the buffer size is not multiple of burst size, issue is valid. >>> And as there is a requirement to have buffer size power of two, burst >>> should have the same. >>> I assume this issue is not caught before because default burst size is 32. >>> >>> Can you please share some performance impact of the change, with two >>> possible solutions we discussed above? >>> >>> Other option is to add this as a limitation to the memif zero copy, but >>> this won't be good for usability. >>> >>> We can decide based on performance numbers. >>> >>> >> >> Hi Jakup, >> >> Do you have any comment on this? >> >> I think we should either document this as limitation of the driver, or >> fix it, and if so need to decide the fix. >> > >

