Re: [networking-discuss] [driver-discuss] A question: can be avoid using ´bcopy´ in Tx of the NIC driver ?

Brian Xu - Sun Microsystems - Beijing China Tue, 03 Mar 2009 00:49:10 -0800

Garrett D'Amore wrote:

Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
zeeshanul huq - Sun Microsystems - Beijing China wrote:
Hi Brian,
The overhead of it is not only dma binding, but also unbinding.
If no copybuf is used, the overhead of the unbinding is quite quitesmall comparing to the binding.
For small packets, even the unbinding can start to be expensive.Lock contention becomes a concern.
And some other shortages are:
1) we have to hold the MBLKs until packet transmition complete.With bcopy we are able to free them immediately. So when thesystem are near to running out of MBLKs, bcopy works better.
I don't know when running out of MBLKs occurs. When the system isshort of kernel memory? If it is the case, then the extra bcopyalso consumes kernel memory.
Actually, IOMMU resources are a bigger issue.
With bcopy, the pre-allocated dma also occupies IOMMU entry. withoutbcopy, more IOMMU entries are needed and are allocated on the fly. Sodo you mean there may be not enough IOMMU entries? Please clarify.
Without bcopy, you might have to allocate more IOMMU entries. Its abigger problem on the rx path when you do loanup and buffer recycling(using esballoc), but even on the tx side, if you have a packet thatis spread across multiple pages (or chained mbufs even!), then youmight need more IOMMU entries. And, usually you still have the IOMMUentries for bcopy because you *really* want to bcopy for small packetsunless you want to have terrible small packet performance.

OK. I see.

Thanks,
Brian

   -- Garrett
Thanks,
Brian
2) In some driver like bge, it only has a small number of TXbuffer descriptor. With bcopy, it ensures one BD per transmitpacket, while it may require more than one with dma_bind. so usingdma bind, it will run out of Tx BD more quicker during heavy traffic.
Yes. This is reasonable.
That's part of the reasons why we use both bcopy and dma_bind inour NIC driver. I agree we need a more faster dma binding andunbinding solution.
What I suggested is one way to get much faster dma binding.
Of course, the original binding is also kept to meet the bcopyrequirement.
Sure. I think a lot more analysis is required here before we do anysignificant changes, though.
   -- Garrett
Thanks,
Brian
Regards,
Zeeshanul Huq

Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Hi there,

I have a question here:
Why all of the NIC drivers have to bcopy the MBLKs for transmit?(some of them bcopy always, and some others bcopy under athreshold of the packet length).
I think one of the reason is the overhead of the setup of dma onthe fly is greater than the overhead of bcopy for short packets.I want to know if this is the case and if there are any otherreasons.
Yes. For any packet reasonably sized bcopy (ETHERMTU or smaller)is faster on *all* recent hardware. (This is confirmed on evenan older 300MHz Via C3.) (Hmm... I've heard that for someNiagra systems this might not be true, however. But I've nottested it myself.)
I think the situation is different with jumbo frames, though.
If what I guess is the major cause, I have a proposal and I wantto hear your advice whether it makes sense.
The most time-consuming action for the dma setup is the dmabind, more specific, calling into the VM layer to get the PFNfor the vaddr(hat_getpfnum()), since it need to search the hugepage table. While for the MBLKs, essentially which are slabobjects, the PFN has already been determined in the slab layer,and for most of their usage, we only touch the magazine layer,where the PFN is a pre determined one. That is, the PFN shouldbe considered as a constructed state, but we don't leverage itfor dma bind.
In storage, we have a field 'b_shadow' in buf(9S) to store thepages which are recently used, through which the PFNs can beeasily got. so inthe case that b_shadow works, ddi_dma_buf_bind_handle() is muchfaster than the ddi_dma_mem_bind_handle().Another example, moving the dma bind of the HBA driver(mpt) fromTx path to the kmem cache constrcutor, mpt driver got 26%throughput increment. See CR6707308.
If the mblk could store the PFN info and we had addi_dma_mblk_bind_handle() like interface, then I think it willbenefit the performance of the NIC drivers. I consulted thePAE, and got a answer that the bcopy is typically about 10-15%of a NIC TX workload.
There are things that can do to make DMA faster, better, andsimpler. In an ideal world, the GLDv3 could do most of thiswork, and the mblk could just carry the ddi_dma_cookie with it.
   -- Garrett
Thanks,
Brian

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
networking-discuss mailing list
[email protected]


_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] [driver-discuss] A question: can be avoid using ´bcopy´ in Tx of the NIC driver ?

Reply via email to