James Carlson wrote:
Doing zero-copy from user space to kernel means flipping page table
entries.  When we've looked into this in the recent past, we've[1]
found that the cost of doing this completely dwarfs the cost of
copying the data on fairly modern architectures.

I don't know if that's still true.  Perhaps it's something to put on
the VM wish list: "can we flip pages, or are we still stuck there?"

If I remember rightly (and I'm sure I'll get corrected!) it's not
the remap that kills you (in the sense of setting up a new mapping
for the destination address-space), but the shootdown of the old map
(so that the source address-space program can't tread on the data
once ownership has been transferred).

The latter doesn't matter on inbound data if you regard the kernel
and NIC as trustworthy (I'm prepared to trust the kernel, but
quite often I'm doubtful about I/O devices.  This is my FT heritage
speaking, I'm afraid).

So, transmit data is the problem here.  One approach might be to
track the migrations of the source process across cpus.  If the process
has never migrated and the source buffer isn't shared, mapping shootdown
can be fast (as there's no inter-cpu communication to be done.
Traditionally a demap requires a TLB shootdown broadcast to all cpus
in the system to get them to do any required TLB invalidations).

Another approach is to throw hardware at the problem (I don't expect
this to ever be implemented).  This is a cache coherency issue, the
TLBs caching the mapping translations.  Add snooping capability to
the TLBs.

A third (quite likely to be done) is to redefine the interface.  An
fbufs-style or ICSC extended-sockets style interface can place the
requirement on the programmer of the userland transmit code to
not tread on the outbound buffer once he's offered it to the kernel
for transmission.   This has great potential for new or reworked
applications but obviously does not help legacy ones.



Another aspect of ZC should be mentioned up front, and this time
transmit is easy and receive is hard:  packet payloads are
inconveniently sized.

This doesn't worry transmit; hardware to slice-n-dice user buffer
sizes into packets is easy, and so is scatter/gather to wrap payloads
with protocol headers.

On receive however the stack must deal with whatever comes off the wire
(apart from iSCSI, RDMA etc) which will a) require arbitrary protocol
headers removed, and b) won't be nicely page-aligned or sized.

Again, the probable attack on the issue requires a nontraditional
interface - which won't help legacy applications and will be harder
to program to.

The "shovel hardware" attack could be either full protocol offload
(but I dislike the maintenance and scaling implications)
or curiosities like arbitrary-byte-resolution MMU capability.

An intermediate stage towards protocol offload would be to dedicate
a (or some) core/s of a Niagara-like cpu to network stack work,
and tightly couple them to the NIC hardware.  Preferably in the
same chip, though that seems to have dropped off of Sun's roadmap :-(
The software part of the stack would then be part of the Solaris
kernel, so my maintenance concerns are assuaged.


- Jeremy Harris
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to