Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Hi there,
I have a question here:
Why all of the NIC drivers have to bcopy the MBLKs for transmit?
(some of them bcopy always, and some others bcopy under a
threshold of the packet length).
I think one of the reason is the overhead of the setup of dma on
the fly is greater than the overhead of bcopy for short packets.
I want to know if this is the case and if there are any other
reasons.
Yes. For any packet reasonably sized bcopy (ETHERMTU or smaller)
is faster on *all* recent hardware. (This is confirmed on even
an older 300MHz Via C3.) (Hmm... I've heard that for some
Niagra systems this might not be true, however. But I've not
tested it myself.)
Even with bcopy, there is still need a pre-binded dma resource. So
the threshold of the bcopy size is based on whether the overhead
for dma bind on the fly is greater than the threshold of the bcopy
to a pre-binded dma address. For the hardware itself, it only know
DMA is needed.
The pre-bound DMA setup you pay at attach() time, and doesn't play
a role. So you have to compare the cost of bcopy() vs. the cost of
ddi_dma_addr_setup().
It is really what I meant.
There is a lot of additional complexity for tx as well, because
you have to deal with the fact that packets may cross page
boundaries and require multiple DMA cookies. This adds a lot of
complexity, and not all drivers can deal well with multiple
descriptors per packet.
Just as what we do for ddi_dma_buf_bind_handle, the shadow page list
records all the mapped physical pages. so you don't have to worry
about the cross of page boundary.
Ah, but the *driver* does., because in the absence of an IOMMU you
need to be able to allocate more than one descriptor. You have some
call overhead as well... multiple ddi_dma_symc() calls per packet,
probably, and ddi_dma_nextcookie() and such.
Yes. that is a problem. So there would be a trade off.
It might not sound like much, but on hot code paths every additional
function call adds overhead. You don't have to to call many extra
function calls before you catch up to to the cost of bcopy. For
example, ignoring memory for the moment, bcopy of a 1024 byte packet
might require fewer than 150 instruction cycles.
I think a fast binding may shorten the threshold for the bcopy.
It would.
But IMO, we're probably optimizing the wrong part of the stack here.
bcopy is 10% of the performance hit, according to RPE. What about the
other 90%?
Also note that you will *never* make the cost of transferring data
*zero*. You have to look at how much better dma binding would be than
bcopy. Already we know its very close for full MTU frames. If you can
make DMA binding 30% cheaper, is it going to really change the balance
of performance that much? I doubt it. But, if you can eliminate stack
overheads, lock contention, etc., then you might be much better served.
I'd rather avoid continuing to grossly complicate device drivers with
DMA details unless there is a significant benefit to doing so. Right
now, for ethernet, I'm not sure there is. (Again, Jumbo Frames changes
the trade off, a lot. Primarily because it eliminates most of the other
overhead so that bcopy dominates.)
For typical traffic, on typical segments, you can't use jumbo frames, so
spending all your effort trying to make dma work faster is probably not
the best use of you energy.
-- Garrett
I still don´t know if there are other reasons other than the
overhead of dma setup.
Complexity. There are various concerns, as a race with _fini() and
esballoc (for the rx path), involved.
Also you have to worry about alignment. Not all hardware can
transmit arbitrarily aligned packets. With all the work you wind
up doing to make this work correctly, you get very little
performance benefit. So its rarely worth the pain and suffering.
For regular MTU frames, it just isn't worth it, ever. On
reasonably modern hardware, anyway.
For the alignment, does how large packet transmit (dma bind on the
fly) does is OK, I think.
Packets may be aligned on *any* boundary. In fact, they are often
*not* 32-bit aligned, but 16-bit aligned. Not all hardware can deal
with off-half-word alignment.
Now when the packet is longer than the threshold, the stock NIC
drivers use dma bind on the fly. then how do they cope with the
alignment?
Thanks,
Brian
-- Garrett
Thanks,
Brian
For rx, you can eliminate a lot of the DMA costs by recycling
buffers. But the complexity to do this "well" without introducing
potential panics is high. Almost every driver that has tried has
gotten this wrong at some point. Some of them are still wrong.
-- Garrett
Thanks,
Brian
I think the situation is different with jumbo frames, though.
If what I guess is the major cause, I have a proposal and I want
to hear your advice whether it makes sense.
The most time-consuming action for the dma setup is the dma
bind, more specific, calling into the VM layer to get the PFN
for the vaddr(hat_getpfnum()), since it need to search the huge
page table. While for the MBLKs, essentially which are slab
objects, the PFN has already been determined in the slab layer,
and for most of their usage, we only touch the magazine layer,
where the PFN is a pre determined one. That is, the PFN should
be considered as a constructed state, but we don't leverage it
for dma bind.
In storage, we have a field 'b_shadow' in buf(9S) to store the
pages which are recently used, through which the PFNs can be
easily got. so in
the case that b_shadow works, ddi_dma_buf_bind_handle() is much
faster than the ddi_dma_mem_bind_handle().
Another example, moving the dma bind of the HBA driver(mpt) from
Tx path to the kmem cache constrcutor, mpt driver got 26%
throughput increment. See CR6707308.
If the mblk could store the PFN info and we had a
ddi_dma_mblk_bind_handle() like interface, then I think it will
benefit the performance of the NIC drivers. I consulted the
PAE, and got a answer that the bcopy is typically about 10-15%
of a NIC TX workload.
There are things that can do to make DMA faster, better, and
simpler. In an ideal world, the GLDv3 could do most of this
work, and the mblk could just carry the ddi_dma_cookie with it.
-- Garrett
Thanks,
Brian
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
networking-discuss mailing list
[email protected]