Garrett D'Amore <> wrote: > Brian Xu - Sun Microsystems - Beijing China wrote: >> zeeshanul huq - Sun Microsystems - Beijing China wrote: >>> Hi Brian, >>> >>> The overhead of it is not only dma binding, but also unbinding. >> If no copybuf is used, the overhead of the unbinding is quite quite >> small comparing to the binding. > For small packets, even the unbinding can start to be expensive. Lock > contention becomes a concern. On x86 system with iommu enabled, tx side dma binding/unbinding is a big performance obstacle for 10Gb NIC due to time consumed by iommu iotlb flushing. A recent test shows that it would greatly improve 10Gb NIC performance if we could reduce dma binding/unbinding operations on TX side.
> >>> And some other shortages are: >>> 1) we have to hold the MBLKs until packet transmition complete. With >>> bcopy we are able to free them immediately. So when the system are >>> near to running out of MBLKs, bcopy works better. >> I don't know when running out of MBLKs occurs. When the system is >> short of kernel memory? If it is the case, then the extra bcopy also >> consumes kernel memory. > > Actually, IOMMU resources are a bigger issue. > >>> 2) In some driver like bge, it only has a small number of TX buffer >>> descriptor. With bcopy, it ensures one BD per transmit packet, while >>> it may require more than one with dma_bind. so using dma bind, it >>> will run out of Tx BD more quicker during heavy traffic. Yes. This >>> is reasonable. >>> >>> That's part of the reasons why we use both bcopy and dma_bind in our >>> NIC driver. I agree we need a more faster dma binding and unbinding >>> solution. >> What I suggested is one way to get much faster dma binding. >> Of course, the original binding is also kept to meet the bcopy >> requirement. > > Sure. I think a lot more analysis is required here before we do any > significant changes, though. > > -- Garrett >> >> Thanks, >> Brian >>> >>> Regards, >>> Zeeshanul Huq >>> >>> Garrett D'Amore wrote: >>>> Brian Xu - Sun Microsystems - Beijing China wrote: >>>>> Hi there, >>>>> >>>>> I have a question here: >>>>> Why all of the NIC drivers have to bcopy the MBLKs for transmit? >>>>> (some of them bcopy always, and some others bcopy under a >>>>> threshold of the packet length). >>>>> >>>>> I think one of the reason is the overhead of the setup of dma on >>>>> the fly is greater than the overhead of bcopy for short packets. I >>>>> want to know if this is the case and if there are any other >>>>> reasons. >>>> >>>> Yes. For any packet reasonably sized bcopy (ETHERMTU or smaller) >>>> is faster on *all* recent hardware. (This is confirmed on even an >>>> older 300MHz Via C3.) (Hmm... I've heard that for some Niagra >>>> systems this might not be true, however. But I've not tested it >>>> myself.) >>>> >>>> I think the situation is different with jumbo frames, though. >>>> >>>>> >>>>> If what I guess is the major cause, I have a proposal and I want >>>>> to hear your advice whether it makes sense. >>>>> >>>>> The most time-consuming action for the dma setup is the dma bind, >>>>> more specific, calling into the VM layer to get the PFN for the >>>>> vaddr(hat_getpfnum()), since it need to search the huge page >>>>> table. While for the MBLKs, essentially which are slab objects, >>>>> the PFN has already been determined in the slab layer, and for >>>>> most of their usage, we only touch the magazine layer, where the >>>>> PFN is a pre determined one. That is, the PFN should be >>>>> considered as a constructed state, but we don't leverage it for >>>>> dma bind. >>>>> >>>>> In storage, we have a field 'b_shadow' in buf(9S) to store the >>>>> pages which are recently used, through which the PFNs can be >>>>> easily got. so in the case that b_shadow works, >>>>> ddi_dma_buf_bind_handle() is much faster than the >>>>> ddi_dma_mem_bind_handle(). >>>>> Another example, moving the dma bind of the HBA driver(mpt) from >>>>> Tx path to the kmem cache constrcutor, mpt driver got 26% >>>>> throughput increment. See CR6707308. >>>>> >>>>> If the mblk could store the PFN info and we had a >>>>> ddi_dma_mblk_bind_handle() like interface, then I think it will >>>>> benefit the performance of the NIC drivers. I consulted the PAE, >>>>> and got a answer that the bcopy is typically about 10-15% of a >>>>> NIC TX workload. >>>> >>>> There are things that can do to make DMA faster, better, and >>>> simpler. In an ideal world, the GLDv3 could do most of this work, >>>> and the mblk could just carry the ddi_dma_cookie with it. >>>> >>>> -- Garrett >>>>> >>>>> Thanks, >>>>> Brian >>>>> >>>>> _______________________________________________ >>>>> driver-discuss mailing list >>>>> driver-discuss@opensolaris.org >>>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss >>>> >>>> _______________________________________________ >>>> networking-discuss mailing list >>>> networking-disc...@opensolaris.org >> > > _______________________________________________ > driver-discuss mailing list > driver-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/driver-discuss Liu Jiang (Gerry) OpenSolaris, OTC, SSG, Intel _______________________________________________ driver-discuss mailing list driver-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/driver-discuss