Hi Alex, Thanks for the suggestion! It turns out that the overhead of skb_copy and netdev_alloc_skb is because I turned on the kernel debugging option for SLUB memory allocator (CONFIG_SLUB_DEBUG). That's why I got an extremely longer memory allocation time, which slows down my RX throughput!
In our case, we are trying to deliver a software-based MR-SRIOV system. We run the PF driver on one host (H1) and multiple VF drivers on another host (H2). Between H1 and H2, there is a memory sharing/interrupt forwarding device for H2 VF to communicate with H1 PF. Right now my RX performance is achieving 9G but is a little bit unstable: * About every 10 seconds the throughput is dropped to almost zero and resume full speed again. Does anyone run into this issue before? Or any suggestions are appreciated! [ 3] local 192.168.1.4 port 35451 connected with 192.168.1.21 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 428 MBytes 3.59 Gbits/sec [ 3] 1.0- 2.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 2.0- 3.0 sec 1.00 GBytes 8.62 Gbits/sec [ 3] 3.0- 4.0 sec 1.07 GBytes 9.21 Gbits/sec [ 3] 4.0- 5.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 5.0- 6.0 sec 1.09 GBytes 9.35 Gbits/sec [ 3] 6.0- 7.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 7.0- 8.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 8.0- 9.0 sec 1.07 GBytes 9.16 Gbits/sec [ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec --> drop to 0 bps [ 3] 10.0-11.0 sec 1.01 GBytes 8.71 Gbits/sec [ 3] 11.0-12.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 12.0-13.0 sec 1.09 GBytes 9.33 Gbits/sec [ 3] 13.0-14.0 sec 1.09 GBytes 9.36 Gbits/sec [ 3] 14.0-15.0 sec 1.09 GBytes 9.39 Gbits/sec [ 3] 15.0-16.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 16.0-17.0 sec 1.09 GBytes 9.32 Gbits/sec [ 3] 17.0-18.0 sec 1.09 GBytes 9.40 Gbits/sec [ 3] 18.0-19.0 sec 295 MBytes 2.47 Gbits/sec [ 3] 19.0-20.0 sec 0.00 Bytes 0.00 bits/sec --> drop to 0 bps [ 3] 20.0-21.0 sec 1.03 GBytes 8.80 Gbits/sec [ 3] 21.0-22.0 sec 1.09 GBytes 9.39 Gbits/sec [ 3] 22.0-23.0 sec 1.09 GBytes 9.36 Gbits/sec [ 3] 23.0-24.0 sec 1.09 GBytes 9.38 Gbits/sec [ 3] 24.0-25.0 sec 81.9 MBytes 687 Mbits/sec [ 3] 25.0-26.0 sec 0.00 Bytes 0.00 bits/sec --> drop to 0 bps [ 3] 26.0-27.0 sec 1.02 GBytes 8.80 Gbits/sec Thanks a lot! William On Wed, May 23, 2012 at 12:08 AM, Alexander Duyck <[email protected]> wrote: > On 05/22/2012 05:43 AM, William Tu wrote: >> Hey guys, >> >> I'm William Tu from Stony Brook University. I'm currently working on >> an ixgbevf driver. Due to some special requirements, I need to >> pre-allocate a pool of contiguous RX and TX buffer (4MB total in my >> case). I chopped the pool into multiple pages and assigned one-by-one >> to the RX and TX ring buffer. I also implemented a bitmap to manage >> the free/allocation of this DMA pool. >> >> When packet is coming, the ixgbevf device DMA the packet into the RX >> buffer. Then my modified version of ixgbevf driver needs to do an >> "skb_copy" to copy the whole packet out of the pre-allocated pool so >> that the Linux kernel later on can free this copied skb and the buffer >> in the pre-allocated pool can be freed. Same ideal in the case of >> transmission. >> >> Everything works fine until I found a poor reception performance. I >> got TX: 9.4Gbps and RX: 1Gbps. I looked into the problem and found my >> driver spent quite a long time in doing >> 1. skb_copy in ixgbevf_clean_rx_irq and >> 2. netdev_alloc_skb_ip_align (in ixgbevf_alloc_rx_buffers). >> >> Compared with original ixgbevf code, I found most of the drivers are >> using dma_map_single/dma_unmap_single, which is streaming DMA >> mappings. However, I'm using coherent DMA mapping (dma_alloc_coherent) >> to allocate a big DMA buffer and assigning each piece to the RX ring. >> I'm wondering the performance impact of using dma_alloc_coherent, and >> is it possible that my poor performance is caused by this? >> >> >> Thanks a lot! >> William >> > Hi William, > > It sounds like you are taking on quite a bit of overhead with the > skb_copy and netdev allocation calls. You may want to consider finding > a means of reducing that overhead. > > What you are describing for Rx doesn't sound too different from the > current ixgbe receive path. For the ixgbe receive path we are using > pages that we mapped as a streaming DMA, however instead of un-mapping > them after the receive is complete we are simply calling > dma_sync_single_range_for_cpu on the half we received the packet in and > calling dma_sync_single_range_for_device on the half we are going to > give back to the device. This essentially allows us to mimic a coherent > style mapping and to hold on the the page for an extended period of > time. To avoid most of the overhead for having a locked down buffer we > are using the page to store the data section of the frames, and only > storing the packet header in the skb->data portion. This allows us to > reuse buffers with minimal overhead for doing so versus the copying > approach you described. The code for ixgbe to do this is in either the > 3.4 kernel, or our latest ixgbe driver available on e1000.sf.net. > > Thanks, > > Alex ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
