Dana H. Myers wrote:
> Garrett D'Amore wrote:
>
>> I agree with Dana's sentiments here.
>>
>> I'll add one thing, from my own experience: in my own experience the
>> pain and suffering of dealing with scatter/gather is not justified on
>> modern systems with ordinary sized ethernet frames. In fact,
>> experience shows that just copying frames (you can use mcopymsg())
>> into a contiguous preallocated buffer is more efficient than trying to
>> worry about scatter/gather. (Scatter/gather is only really applicable
>> when you map buffers directly without copying them anyway.)
>>
> ... or when your NIC has fixed transmit/receive resources.
>
> I think Steven might have been using the expression "scatter/gather" a
> little
> specifically, though. The DP8390 doesn't do scatter/gather in the sense
> that payload data is contained in arbitrary memory buffers; the ring managed
> by the 8390 is a fixed allocation. So we're really talking about a
> segmented
> ring buffer, in which case it isn't *that* complicated to manage
> copy-in/out.
>
> In 1995, I had no trouble saturating a 10Base-T Ethernet with a
> PCnet-PCI card on 100MHz Pentium systems without driving the
> CPU utilization through the roof, and that driver copied frames in
> and out of a segmented ring buffer - so I'm comfortable agreeing
> that a modern PCIe system would have no trouble keeping a
> 100Mb/s adapter fed without high CPU or bus utilization.
>
> More generally speaking, though - I'd be surprised if > 1Gbe NICs
> perform well with copy-in/out, and CPU utilization is probably a huge
> factor, and the size of the frame is likely less of a factor than the simple
> transfer rate a good NIC can maintain.
>
The problem is that the cost of bcopy of ~1500 bytes becomes < the cost
to do the various DMA (or DVMA) related contortions. For TX its almost
impossible to make this work well, since you have to bind/unbind each
packet (at least one DMA operation per packet -- often more than that --
since packets often come in that are split across a page boundary.) For
RX its less clear -- you can use esballoc kinds of techniques to get
good performance, but you have to reuse the RX packets, which means
potentially leaking DMA resources (no guarantee when or if an mblk will
be free'd) and additional lock related overhead to make sure that you
don't have a detach()/fini() related race condition. (You can't unload
the driver while the upper layers are holding any of your mblks.)
If I were writing a 10GbE driver, I'd use direct DMA mapping for Jumbo
frames, but I'd stick with bcopy/mcopymsg style for regular MTU frames
-- at least until someone convinced me that doing otherwise was worthwhile.
(Btw, the Neptune folks did these perf. analysis, and came to much the
same conclusion -- bcopy is faster/better than the alternatives for most
cases.)
What would be *really* cool would be an API for Jumbo frames that let
"swap" a pair of pages. Think of it like a super fast alternative to
bcopy. (Or a hardware function to copy an entire frame.) There are
probably some tricks here that could be explored -- but I doubt any of
them are useful for 1500 byte frames.
-- Garrett
> Cheers,
> Dana
>
> _______________________________________________
> driver-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss