Re: [driver-discuss] A question: can be avoid using ´bcopy´ in Tx of the NIC driver?

Brian Xu - Sun Microsystems - Beijing China Tue, 03 Mar 2009 01:20:22 -0800

Garrett D'Amore wrote:

Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Hi there,
I have a question here:
Why all of the NIC drivers have to bcopy the MBLKs fortransmit? (some of them bcopy always, and some others bcopyunder a threshold of the packet length).
I think one of the reason is the overhead of the setup of dmaon the fly is greater than the overhead of bcopy for shortpackets. I want to know if this is the case and if there areany other reasons.
Yes. For any packet reasonably sized bcopy (ETHERMTU orsmaller) is faster on *all* recent hardware. (This is confirmedon even an older 300MHz Via C3.) (Hmm... I've heard that forsome Niagra systems this might not be true, however. But I'venot tested it myself.)
Even with bcopy, there is still need a pre-binded dma resource.So the threshold of the bcopy size is based on whether theoverhead for dma bind on the fly is greater than the threshold ofthe bcopy to a pre-binded dma address. For the hardware itself,it only know DMA is needed.
The pre-bound DMA setup you pay at attach() time, and doesn't playa role. So you have to compare the cost of bcopy() vs. the costof ddi_dma_addr_setup().
It is really what I meant.
There is a lot of additional complexity for tx as well, becauseyou have to deal with the fact that packets may cross pageboundaries and require multiple DMA cookies. This adds a lot ofcomplexity, and not all drivers can deal well with multipledescriptors per packet.
Just as what we do for ddi_dma_buf_bind_handle, the shadow pagelist records all the mapped physical pages. so you don't have toworry about the cross of page boundary.
Ah, but the *driver* does., because in the absence of an IOMMU youneed to be able to allocate more than one descriptor. You havesome call overhead as well... multiple ddi_dma_symc() calls perpacket, probably, and ddi_dma_nextcookie() and such.
Yes. that is a problem. So there would be a trade off.
It might not sound like much, but on hot code paths every additionalfunction call adds overhead. You don't have to to call many extrafunction calls before you catch up to to the cost of bcopy. Forexample, ignoring memory for the moment, bcopy of a 1024 byte packetmight require fewer than 150 instruction cycles.
I think a fast binding may shorten the threshold for the bcopy.
It would.
But IMO, we're probably optimizing the wrong part of the stack here.bcopy is 10% of the performance hit, according to RPE. What about theother 90%?

I think the 10% means the bcopy consumes 10% of the CPU cycles out ofthose NIC driver Tx consumes.

Also note that you will *never* make the cost of transferring data*zero*. You have to look at how much better dma binding would be thanbcopy. Already we know its very close for full MTU frames. If youcan make DMA binding 30% cheaper, is it going to really change thebalance of performance that much? I doubt it. But, if you caneliminate stack overheads, lock contention, etc., then you might bemuch better served.

I am not sure either how much a fast bind benefit the performance, so Iask such a question on the alias. :-)

I'd rather avoid continuing to grossly complicate device drivers withDMA details unless there is a significant benefit to doing so. Rightnow, for ethernet, I'm not sure there is. (Again, Jumbo Frameschanges the trade off, a lot. Primarily because it eliminates most ofthe other overhead so that bcopy dominates.)

Even we had fast bind interfaces, they would not make the device drivercomplicated.


Thanks,
Brian

For typical traffic, on typical segments, you can't use jumbo frames,so spending all your effort trying to make dma work faster is probablynot the best use of you energy.
   -- Garrett
I still don´t know if there are other reasons other than theoverhead of dma setup.
Complexity. There are various concerns, as a race with _fini()and esballoc (for the rx path), involved.
Also you have to worry about alignment. Not all hardware cantransmit arbitrarily aligned packets. With all the work you windup doing to make this work correctly, you get very littleperformance benefit. So its rarely worth the pain and suffering.For regular MTU frames, it just isn't worth it, ever. Onreasonably modern hardware, anyway.
For the alignment, does how large packet transmit (dma bind on thefly) does is OK, I think.
Packets may be aligned on *any* boundary. In fact, they are often*not* 32-bit aligned, but 16-bit aligned. Not all hardware can dealwith off-half-word alignment.
Now when the packet is longer than the threshold, the stock NICdrivers use dma bind on the fly. then how do they cope with thealignment?
Thanks,
Brian
   -- Garrett
Thanks,
Brian
For rx, you can eliminate a lot of the DMA costs by recyclingbuffers. But the complexity to do this "well" without introducingpotential panics is high. Almost every driver that has tried hasgotten this wrong at some point. Some of them are still wrong.
   -- Garrett
Thanks,
Brian
I think the situation is different with jumbo frames, though.
If what I guess is the major cause, I have a proposal and Iwant to hear your advice whether it makes sense.
The most time-consuming action for the dma setup is the dmabind, more specific, calling into the VM layer to get the PFNfor the vaddr(hat_getpfnum()), since it need to search the hugepage table. While for the MBLKs, essentially which are slabobjects, the PFN has already been determined in the slablayer, and for most of their usage, we only touch the magazinelayer, where the PFN is a pre determined one. That is, the PFNshould be considered as a constructed state, but we don'tleverage it for dma bind.
In storage, we have a field 'b_shadow' in buf(9S) to store thepages which are recently used, through which the PFNs can beeasily got. so inthe case that b_shadow works, ddi_dma_buf_bind_handle() is muchfaster than the ddi_dma_mem_bind_handle().Another example, moving the dma bind of the HBA driver(mpt)from Tx path to the kmem cache constrcutor, mpt driver got 26%throughput increment. See CR6707308.
If the mblk could store the PFN info and we had addi_dma_mblk_bind_handle() like interface, then I think it willbenefit the performance of the NIC drivers. I consulted thePAE, and got a answer that the bcopy is typically about 10-15%of a NIC TX workload.
There are things that can do to make DMA faster, better, andsimpler. In an ideal world, the GLDv3 could do most of thiswork, and the mblk could just carry the ddi_dma_cookie with it.
   -- Garrett
Thanks,
Brian

_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss


_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] A question: can be avoid using ´bcopy´ in Tx of the NIC driver?

Reply via email to