Rusty Russell wrote:
On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
Rusty Russell wrote:
        I'm a little puzzled by your response.  Hmm...

        lguest's userspace network frontend does exactly as many copies as
Ingo's in-host-kernel code.  One from the Guest, one to the Guest.
kvm pvnet is suboptimal now. The number of copies could be reduced by two (to zero), by constructing an skb that points to guest memory. Right now, this can only be done in-kernel.

Sorry, you lost me here.  You mean both input and output copies can be
eliminated?  Or are you talking about another two copies somewhere?

On the transmit path, current kvm pvnet has two copies:

1.  on the guest side, the driver copies the skb data into the shared ring
2. on the host side, the device copies the data from the ring into a newly allocated skb

Both of these copies can be eliminated with a host-side kernel. With current userspace interfaces, only one copy can be eliminated.

Similar logic applies to receive, except that one copy must remain.

But I don't get this "we can enhance the kernel but not userspace" vibe
8(

I've been waiting for network aio since ~2003. If it arrives in the next few days, I'm all for it; much more than kvm can use it profitably. But I'm not going to write that interface myself.

Moreover, some things just don't lend themselves to a userspace abstraction. If we want to expose tso (tcp segmentation offload), we can easily do so with a kernel driver since the kernel interfaces are all tso aware. Tacking on tso awareness to tun/tap is doable, but at the very least wierd.

With current userspace networking interfaces, one cannot build a network device that has less than one copy on transmit, because sendmsg() *must* copy the data (as there is no completion notification).

Why are you talking about sendmsg()?  Perhaps this is where we're
getting tangled up.

We're dealing with the tun/tap device here, not a socket.


Hmm. tun actually has aio_write implemented, but it seems synchronous. So does the read path.

If these are made truly asynchronous, and the write path is made in addition copyless, then we might have something workable. I still cringe at having a pagetable walk in order to deliver a 1500-byte packet.


sendfilev(), even if it existed, cannot be used: it is copyless, but lacks completion notification. It is useful only on unchanging data like read-only files.

Again, sendfile is a *much* harder problem than sending a single packet
once, which is the question here.

sendfile() is a *different* problem. It doesn't need completion because the data is assumed not to change under it.

Consider that the guest may be issuing a megabyte-sized sendfile() which is broken into 17 tso frames. We need to preserve the large structures as much as possible or we end up repeating the simple "single packet once" path 700 times.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to