Garrett D'Amore <> wrote:
> Brian Xu - Sun Microsystems - Beijing China wrote:
>> zeeshanul huq - Sun Microsystems - Beijing China wrote:
>>> Hi Brian,
>>> 
>>> The overhead of it is not only dma binding, but also unbinding.
>> If no copybuf is used, the overhead of the unbinding is quite quite
>> small comparing to the binding.
> For small packets, even the unbinding can start to be expensive.  Lock
> contention becomes a concern.
On x86 system with iommu enabled, tx side dma binding/unbinding is a
big performance obstacle for 10Gb NIC due to time consumed by iommu 
iotlb flushing. A recent test shows that it would greatly improve 10Gb NIC 
performance if we could reduce dma binding/unbinding operations on TX 
side.

> 
>>> And some other shortages are:
>>> 1) we have to hold the MBLKs until packet transmition complete. With
>>> bcopy we are able to free them immediately. So when the system are
>>> near to running out of MBLKs, bcopy works better.
>> I don't know when running out of MBLKs occurs. When the system is
>> short of kernel memory? If it is the case, then the extra bcopy also
>> consumes kernel memory.
> 
> Actually, IOMMU resources are a bigger issue.
> 
>>> 2) In some driver like bge, it only has a small number of TX buffer
>>> descriptor. With bcopy, it ensures one BD per transmit packet, while
>>> it may require more than one with dma_bind. so using dma bind, it
>>> will run out of Tx BD more quicker during heavy traffic. Yes. This
>>> is reasonable. 
>>> 
>>> That's part of the reasons why we use both bcopy and dma_bind in our
>>> NIC driver. I agree we need a more faster dma binding and unbinding
>>> solution.
>> What I suggested is one way to get much faster dma binding.
>> Of course, the original binding is also kept to meet the bcopy
>> requirement.
> 
> Sure.  I think a lot more analysis is required here before we do any
> significant changes, though.
> 
>     -- Garrett
>> 
>> Thanks,
>> Brian
>>> 
>>> Regards,
>>> Zeeshanul Huq
>>> 
>>> Garrett D'Amore wrote:
>>>> Brian Xu - Sun Microsystems - Beijing China wrote:
>>>>> Hi there,
>>>>> 
>>>>> I have a question here:
>>>>> Why all of the NIC drivers have to bcopy the MBLKs for transmit?
>>>>> (some of them bcopy always, and some others bcopy under a
>>>>> threshold of the packet length). 
>>>>> 
>>>>> I think one of the reason is the overhead of the setup of dma on
>>>>> the fly is greater than the overhead of bcopy for short packets. I
>>>>> want to know if this is the case and if there are any other
>>>>> reasons. 
>>>> 
>>>> Yes.  For any packet reasonably sized bcopy (ETHERMTU or smaller)
>>>> is faster on *all* recent hardware.  (This is confirmed on even an
>>>> older 300MHz Via C3.)   (Hmm... I've heard that for some Niagra
>>>> systems this might not be true, however. But I've not tested it
>>>> myself.) 
>>>> 
>>>> I think the situation is different with jumbo frames, though.
>>>> 
>>>>> 
>>>>> If what I guess is the major cause, I have a proposal and I want
>>>>> to hear your advice whether it makes sense.
>>>>> 
>>>>> The most time-consuming action for the dma setup is the dma bind,
>>>>> more specific, calling into the VM layer to get the PFN for the
>>>>> vaddr(hat_getpfnum()), since it need to search the huge page
>>>>> table. While for the MBLKs, essentially which are slab objects,
>>>>> the PFN has already been determined in the slab layer, and for
>>>>> most of their usage, we only touch the magazine layer, where the
>>>>> PFN is a pre determined one. That is, the PFN should be
>>>>> considered as a constructed state, but we don't leverage it for
>>>>> dma bind. 
>>>>> 
>>>>> In storage, we have a field 'b_shadow' in buf(9S) to store the
>>>>> pages  which are recently used, through which the PFNs can be
>>>>> easily got. so in the case that b_shadow works,
>>>>> ddi_dma_buf_bind_handle() is much faster than the
>>>>> ddi_dma_mem_bind_handle(). 
>>>>> Another example, moving the dma bind of the HBA driver(mpt) from
>>>>> Tx path to the kmem cache constrcutor, mpt driver got 26%
>>>>> throughput increment. See CR6707308. 
>>>>> 
>>>>> If the mblk could store the PFN info and we had a
>>>>> ddi_dma_mblk_bind_handle() like interface, then I think it will
>>>>> benefit  the performance of the NIC drivers. I consulted the PAE,
>>>>> and got a  answer that the bcopy is typically about 10-15% of a
>>>>> NIC TX workload.
>>>> 
>>>> There are things that can do to make DMA faster, better, and
>>>> simpler.  In an ideal world, the GLDv3 could do most of this work,
>>>> and the mblk could just carry the ddi_dma_cookie with it.
>>>> 
>>>>    -- Garrett
>>>>> 
>>>>> Thanks,
>>>>> Brian
>>>>> 
>>>>> _______________________________________________
>>>>> driver-discuss mailing list
>>>>> driver-discuss@opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>>> 
>>>> _______________________________________________
>>>> networking-discuss mailing list
>>>> networking-disc...@opensolaris.org
>> 
> 
> _______________________________________________
> driver-discuss mailing list
> driver-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Liu Jiang (Gerry)
OpenSolaris, OTC, SSG, Intel
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Reply via email to