Hi Alex,

Thanks for the suggestion! It turns out that the overhead of skb_copy
and netdev_alloc_skb is because I turned on the kernel debugging
option for SLUB memory allocator (CONFIG_SLUB_DEBUG). That's why I got
an extremely longer memory allocation time, which slows down my RX
throughput!

In our case, we are trying to deliver a software-based MR-SRIOV
system. We run the PF driver on one host (H1) and multiple VF drivers
on another host (H2). Between H1 and H2, there is a memory
sharing/interrupt forwarding device for H2 VF to communicate with H1
PF.

Right now my RX performance is achieving 9G but is a little bit unstable:
* About every 10 seconds the throughput is dropped to almost zero and
resume full speed again. Does anyone run into this issue before? Or
any suggestions are appreciated!

[  3] local 192.168.1.4 port 35451 connected with 192.168.1.21 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   428 MBytes  3.59 Gbits/sec
[  3]  1.0- 2.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  2.0- 3.0 sec  1.00 GBytes  8.62 Gbits/sec
[  3]  3.0- 4.0 sec  1.07 GBytes  9.21 Gbits/sec
[  3]  4.0- 5.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3]  5.0- 6.0 sec  1.09 GBytes  9.35 Gbits/sec
[  3]  6.0- 7.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3]  7.0- 8.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3]  8.0- 9.0 sec  1.07 GBytes  9.16 Gbits/sec
[  3]  9.0-10.0 sec  0.00 Bytes  0.00 bits/sec          --> drop to 0 bps
[  3] 10.0-11.0 sec  1.01 GBytes  8.71 Gbits/sec
[  3] 11.0-12.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3] 12.0-13.0 sec  1.09 GBytes  9.33 Gbits/sec
[  3] 13.0-14.0 sec  1.09 GBytes  9.36 Gbits/sec
[  3] 14.0-15.0 sec  1.09 GBytes  9.39 Gbits/sec
[  3] 15.0-16.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3] 16.0-17.0 sec  1.09 GBytes  9.32 Gbits/sec
[  3] 17.0-18.0 sec  1.09 GBytes  9.40 Gbits/sec
[  3] 18.0-19.0 sec   295 MBytes  2.47 Gbits/sec
[  3] 19.0-20.0 sec  0.00 Bytes  0.00 bits/sec      --> drop to 0 bps
[  3] 20.0-21.0 sec  1.03 GBytes  8.80 Gbits/sec
[  3] 21.0-22.0 sec  1.09 GBytes  9.39 Gbits/sec
[  3] 22.0-23.0 sec  1.09 GBytes  9.36 Gbits/sec
[  3] 23.0-24.0 sec  1.09 GBytes  9.38 Gbits/sec
[  3] 24.0-25.0 sec  81.9 MBytes   687 Mbits/sec
[  3] 25.0-26.0 sec  0.00 Bytes  0.00 bits/sec      --> drop to 0 bps
[  3] 26.0-27.0 sec  1.02 GBytes  8.80 Gbits/sec

Thanks a lot!
William

On Wed, May 23, 2012 at 12:08 AM, Alexander Duyck
<[email protected]> wrote:
> On 05/22/2012 05:43 AM, William Tu wrote:
>> Hey guys,
>>
>> I'm William Tu from Stony Brook University. I'm currently working on
>> an ixgbevf driver. Due to some special requirements, I need to
>> pre-allocate a pool of  contiguous RX and TX buffer (4MB total in my
>> case). I chopped the pool into multiple pages and assigned one-by-one
>> to the RX and TX ring buffer. I also  implemented a bitmap to manage
>> the free/allocation of this DMA pool.
>>
>> When packet is coming, the ixgbevf device DMA the packet into the RX
>> buffer. Then my modified version of ixgbevf driver needs to do an
>> "skb_copy" to copy the whole packet out of the pre-allocated pool so
>> that the Linux kernel later on can free this copied skb and the buffer
>> in the pre-allocated pool can be freed. Same ideal in the case of
>> transmission.
>>
>> Everything works fine until I found a poor reception performance. I
>> got TX: 9.4Gbps and RX: 1Gbps. I looked into the problem and found my
>> driver spent quite a long time in doing
>> 1. skb_copy in ixgbevf_clean_rx_irq and
>> 2. netdev_alloc_skb_ip_align (in ixgbevf_alloc_rx_buffers).
>>
>> Compared with original ixgbevf code, I found most of the drivers are
>> using dma_map_single/dma_unmap_single, which is streaming DMA
>> mappings. However, I'm using coherent DMA mapping (dma_alloc_coherent)
>> to allocate a big DMA buffer and assigning each piece to the RX ring.
>> I'm wondering the performance impact of using dma_alloc_coherent, and
>> is it possible that my poor performance is caused by this?
>>
>>
>> Thanks a lot!
>> William
>>
> Hi William,
>
> It sounds like you are taking on quite a bit of overhead with the
> skb_copy and netdev allocation calls.  You may want to consider finding
> a means of reducing that overhead.
>
> What you are describing for Rx doesn't sound too different from the
> current ixgbe receive path.  For the ixgbe receive path we are using
> pages that we mapped as a streaming DMA, however instead of un-mapping
> them after the receive is complete we are simply calling
> dma_sync_single_range_for_cpu on the half we received the packet in and
> calling dma_sync_single_range_for_device on the half we are going to
> give back to the device.  This essentially allows us to mimic a coherent
> style mapping and to hold on the the page for an extended period of
> time.  To avoid most of the overhead for having a locked down buffer we
> are using the page to store the data section of the frames, and only
> storing the packet header in the skb->data portion.  This allows us to
> reuse buffers with minimal overhead for doing so versus the copying
> approach you described.  The code for ixgbe to do this is in either the
> 3.4 kernel, or our latest ixgbe driver available on e1000.sf.net.
>
> Thanks,
>
> Alex

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to