Re: [akaros] #ip 'bypass' command (for virtio-net and maybe user-level TCP)

Dan Cross Wed, 14 Dec 2016 12:44:08 -0800

On Wed, Dec 14, 2016 at 2:46 PM, barret rhoden <[email protected]> wrote:

> On 2016-12-14 at 13:41 Dan Cross wrote:
> > I have some questions/concerns about this. In general, I think we'd be
> > better off supporting a virtual switch layer for layer two and then
> having
> > a traditional PNAT that's stateful for layer 3 (and above) protocols. I
> > think we can do this fairly simply by extending the existing bridge
> device,
> > building a NAT layer that fronts that and that we bind over e.g. an
> > ethernet interface, and then layering multiple IP stacks on top of that.
> > For that other layer, we could maybe take some existing solution like PF
> > and plug it into the virtual switch environment.
>
> That seems more complicated than what I'm doing.  Multiple IP stacks?
> Where are these IP addresses coming from?  Keep in mind the scenario I
> outlined.

So consider QEMU's assignment of RFC1918 addresses to VMs. The actual
distribution mechanism is, I think, something like DHCP (or just a known
set of addresses that the QEMU user then divvies up as suits the scenario).
In some senses, it doesn't much matter: these things would be hidden behind
a NAT, and the overall abstraction would be that there is a switched, NAT'd
virtual network provided to the VMs. But the network the VM connects to
would (or at least *could*) be entirely local to the host. Maybe it would
look like an ethernet, or maybe like something simpler. As far as the
outside is concerned, there'd be one IP address it interacted with to talk
to the guest and host. I think we'd get what you want, but at layer 2
instead of 2/3/4.

As for the stacks.... The Plan 9 networking stuff we've inherited already
supports separate IP stacks (unless we stripped that out? I confess I
haven't looked), and the guest OS in a VM supplies its own stack. Note that
this doesn't change with the bypass mechanism: both Linux and Akaros have
separate stacks but the same IP address, and each thinks they "own" the IP
address that's assigned to them. How does bypass work for talking between
VMs on the same host? It *sounds* like we'd have to pop the frame out onto
the ethernet unless we intercepted it in the virtio layer, in which case
we've built half of a virtual layer 2 switch.

> What's the immediate motivation for this work?
>
> Virtual machines.  We have no way of providing networking to our guests
> and our host at the same time.  I have the nasty devether hack that I
> run, but will not commit.  Qemu has a nice solution.  This work
> basically allows us to implement what they have.
>

Yes. I think what I'm suggesting is closer to the QEMU model on TUN/TAP. We
can possibly do the TUN/TAP-style much more cleanly than it must be done
under Linux.

I don't have a spare IP address that I can give to the guest.  In a
> world where that happens, we can pursue something like openvswitch, or
> even the setup we currently have.  Long term, I'd like our VMs to
> support both types of networking for the guest.

I'm not sure who has definitively said that you don't have a spare IP
address for the guest; there are any number of IPs in un-routable
private/sharable ranges that may be used for such things; if entirely
internal to a single host then who is to say one cannot arbitrarily use
one? Sure, there has got to be some local policy around this within an
*organization*, but that's a separate issue.

Further, I think that long-term the use case strictly implies a PNAT and
virtual switch for the guest.

> This concerns me: not every protocol uses the port abstraction. For
> > example, how does one handle ICMP?
>
> The easiest thing is like qemu does: ignore it.  usermode networking in
> qemu doesn't really deal with ICMP.

Aha. I'm suggesting TUN/TAP instead of usermode networking.

> Not every protocol is connection-oriented. I suspect it would not be long
> > until we found ourselves maintaining a fair amount of state akin to
> simply
> > building a full NAT, the scope of which is pretty big.
>
> I didn't go into details on this, since it's not the purpose of the
> kernel bypass, but the VMM / virtio-net NAT will maintain state and
> my current plan is to time out connections.

I see; in that case we're most of the way towards a virtual layer 2 switch
and a PNAT layer.

> > If real-addr mode is a pain, then I'll scrap it, but I figured it'd
> > > help out with system administration.  And I think I can make it work
> > > with no real changes.  At first, arp seems like it might be hard, but I
> > > think the host side of virtio-net responds with itself as the target
> > > for any ARPs.
> > >
> > > The details don't matter much - suffice to say, userspace will need to
> > > reserve ports in a protocol and preferably access IP headers in
> packets.
> >
> > One advantage of the namespace style network stack is that it's
> recursively
> > stackable. One can implement something like this by providing a
> filesystem
> > that composes with the real /net and itself presents a /net; the
> > implication is that one could prototype entirely in user space and not
> > involve the kernel at all: much in the way that the window system under
> > Plan 9 can run itself recursively in a window.
>
> The discussion about 'real-addr' mode has little to do with the kernel
> interface.  I don't see how another /net helps here.  This is the
> interface to the virtual machine. It answers questions like "what IP
> does the guest think it has" and "what IP does the guest route packets
> to".  We'll see when I flush it out a little more, but it might be as
> simple as "don't hardcode the guest to 10.0.2.2."

Right now, doesn't the host-end of the virtio network works with a bridge
device that talks to /net? Sounds like yes. In that case, one can extend
the bridge to be a switch (after all, one may think of a bridge as a
point-to-point switch) and plug that into a /net that's actually a NAT
device that is itself implemented on top of a "real" /net. In other words,
this gives you something like TAP/TUN.

> Scenario: I want to detect whether a guest running e.g. Linux is reachable
> > using 'ping': how does one accommodate that? Even if I were on the host
> > running the VM, I don't see how I could do that.
>
> You don't, just like qemu's user-mode networking.  Try pinging a guest
> on Qemu that isn't set up with tun/tap.

This doesn't feel like a viable long-term solution. I think usermode
networking is a bit of a hack because most systems don't support TAP/TUN.
However, I think our architecture allows us to do something analogous to
TAP/TUN rather more easily than it must be done on e.g. Linux (this is
where the stackability stuff comes into play). Consider: would the QEMU
authors have done user-mode if TUN/TAP had been universally available?
Sure, it's a speculative question but I think relevant.

> I could see this working for UDP and TCP, but going beyond requires
> another
> > mechanism. Would we be better off porting over a full NAT
> implementation? I
> > believe NATs have been written for Plan 9 before; one may be floating
> > around that we could leverage to this end. By extending devbridge to be
> > devswitch and marrying that to a NAT, I think we'd end up with a really
> > nice solution indeed, that would neatly solve all of the related
> problems.
>
> Would it be workable for VMMs natting traffic for guests running on Plan
> 9?
>
> If there's something less complicated than what I'm doing and it
> actually works, then I'd be interested.  But people have been talking
> about this stuff for a while now and we haven't had anything concrete.

I think we should ask Jim/Charles if there's something out there that's
available. This does keep coming up, but honestly I haven't seen anyone
hurting for it.

> Hmm; I'm not sure I understand what you mean when you say "The flag will
> be
> > one-way".
>
> I mean that a conversation, once in bypass mode, cannot be turned back
> into a regular conversation.  It'll actually be a state in ip.h (like
> Connected and Announced).  Note that technically conversations are
> never freed in #ip - they just get reused.  When the conv gets reused,
> it'll no longer by a Bypass.

Oh ok; I see now.

> Am I understanding the proposal correctly in that `bypass` would only be
> > used on the input side, and on the output side the assumption would
> > be that the stack receives a fully-formed TCP segment from the
> > thing-you-bypassed-to, which would have already handled e.g. the
> > sequence-number issue? I think that's the case but want to confirm.
>
> Yes, but to be clear, bypass occurs on both sides.  It bypasses the
> *kernel's* protocol implementation.  The user gets the IP packet, fresh
> from the IP stack. It can then do what it wants, preferably
> participating in TCP.  It gives the kernel raw IP packets, after it
> already set up the TCP headers.
>
> It's as if the user can send and receive raw IP packets, but for a
> particular {protocol, port}.

Ah, makes sense now. Thanks.

> For the case of migrating things from a guest to a native service,
> > what specifically is wrong with giving native and guest different IP
> > addresses
>
> I don't always have extra IP addresses to hand out.  Consider a case in
> a datacenter where the IP address of the node is fixed.  I'd like to
> run a VM.  What IP do I give it?

Something in an RFC1918 or RFC6598 range?

> and introducing user-level proxies on the guest that
> > simply forward to the native side?
>
> That requires modifications to the guest.  I can avoid that by
> interposing at the VMM / virtio-net layer.  For right now, I just want
> to ssh into both Akaros and Linux.

If "running a service" on the guest is defined as modifying the guest, then
yes. But then we're already running services on the guest.

> Splitting the port space for a given protocol so that we can share IP
> > addresses feels like a hack to me. This would get us some percentage
> > of the way to a PNAT, and at that point, I would have to ask why not
> > just implement a real PNAT? That seems to be the standard solution to
> > this problem.
>
> Ultimately these solutions split the protocol port space, so I don't
> see that as being a hack.  Someone has to do it somewhere in
> the codebase.  When the packet hits the networking stack, we have to
> decide something at some point.

The thing is, this crosses many layers of the stack, but in separate
places. The idea of a PNAT is that one does that (necessarily, as you point
out) but it's done in one place. That is, Linux (or whatever) and Akaros
don't both thing they are IP address w.x.y.z; there's one intelligent agent
that handles the relevant mappings.

> One thing that wasn't obvious to me until I looked into it was that we
> > > don't need to involve the protocols at all.  Initially, I thought
> > > we'd need to check with TCP's data structures to make sure we get a
> > > free port or something.  However, #ip handles the ports for all of
> > > the protocols.  It might be called "IP", but it knows more than
> > > just IP headers.
> > >
> >
> > This does conflate things across the stack, though. Specifically, this
> > proposal would supporting sharing IP addresses between various
> > applications (to the extend that e.g. a VMMCP running Linux is just
> > an application as far as Akaros is concerned) by segmenting the port
> > space for a subset of protocols that layer on top of IP.
>
> I don't see this as a problem.  We already share IP addresses between
> various applications.  The webserver gets port 80, sshd gets 22.  What
> if I want to run my webserver in a Linux VM?  From Akaros's
> perspective, it's just a different TCP stack running.  It's broader
> than just VMs - any app that wants to run their own TCP stack can do
> so.
>

Except that Linux thinks it owns the entire stack; that's perhaps the best
summation of my concern in that area. It also feels we're retreading a
well-worn path and going x% of the way towards implementing the canonical
solution. IF that is the case, we perhaps ought to just fully implement the
canonical solution.

As a side note, user-level TCP is something we talked about a long time
> ago - some people want to do this, but not being able to dedicate cores
> to it might lead to latency problems.

I vaguely remember those discussions. IP/ICMP/IGMP in the kernel but
UDP/TCP etc in the apps? In that case, the kernel would *have* to maintain
a mapping of quintuple to conversation so it would know where to deliver a
packet or segment. I'm not sure that changes much vis-a-vis this particular
idea on a VM, though (the difference being that e.g. Linux in a VM is also
doing IP but kinda-sorta not really).

> If the idea is to do this really really quickly to scratch and
> > immediate use case, then that's one thing, but I don't know that I
> > understand what that use case is. On the other hand, I don't think
> > this is a viable long-term solution, and for just migrating things
> > one-by-one from a guest to a native application while retaining the
> > appearance of a single externally-facing IP address, then user-level
> > proxies give the same effect and don't modify the kernel at all.
>
> It's not just for moving services between VM and host.  It's so we can
> do any concurrent networking at all.
>
> Here's what we currently do:
> - I run with a nasty hack that sends all TCP port 23 traffic to the
>   guest, everything else to Akaros
>

(Side note: 23 is the telnet port. I find it funny that it's really running
SSH. :-))

- Fergus turns off the host's stack.  Guest gets everything, host
>   nothing.
>

Rather, he just doesn't initialize it on that interface.

- Not sure what Ron/Gan do.  Probably the same: no host networking
>
> I didn't view this as a quick and dirty hack.  I viewed it as a simple
> way to have kernel support for our guest's networking requirements,
> similar to qemu's user-mode networking (shared port space).  And we get
> potential user-mode TCP for free, at least on a port-by-port basis for
> apps that want to try it out.  (I have none in mind at this point).
>

The tie-in with usermode TCP is interesting. I wonder, however if we can do
that now by having a user-mode program implement it's own tcp/* filesystem
and bind it over /net/tcp. That would presumably open /ether/$I/data and
just read/write raw IP datagrams. I don't know what the kernel would do in
that case without a `bypass` but it may be an interesting experiment.

The added kernel support helps the VMM by not having to fake being the
> TCP endpoint.  If you grep qemu's code for TCP, there's a mountain of
> stuff that I don't want to get involved with.

That's why I want to side-step the entire issue by going to layer 2 and NAT.

It may be that `bypass` is necessary for e.g. usermode TCP and something
analogous to usermode networking for special applications, but I don't
think it's sufficient for the long-term.

        - Dan C.

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] #ip 'bypass' command (for virtio-net and maybe user-level TCP)

Reply via email to