Re: [akaros] #ip 'bypass' command (for virtio-net and maybe user-level TCP)

barret rhoden Wed, 14 Dec 2016 11:46:37 -0800

On 2016-12-14 at 13:41 Dan Cross wrote:
> I have some questions/concerns about this. In general, I think we'd be
> better off supporting a virtual switch layer for layer two and then having
> a traditional PNAT that's stateful for layer 3 (and above) protocols. I
> think we can do this fairly simply by extending the existing bridge device,
> building a NAT layer that fronts that and that we bind over e.g. an
> ethernet interface, and then layering multiple IP stacks on top of that.
> For that other layer, we could maybe take some existing solution like PF
> and plug it into the virtual switch environment.


That seems more complicated than what I'm doing.  Multiple IP stacks?
Where are these IP addresses coming from?  Keep in mind the scenario I
outlined.

> What's the immediate motivation for this work?

Virtual machines.  We have no way of providing networking to our guests
and our host at the same time.  I have the nasty devether hack that I
run, but will not commit.  Qemu has a nice solution.  This work
basically allows us to implement what they have.

I don't have a spare IP address that I can give to the guest.  In a
world where that happens, we can pursue something like openvswitch, or
even the setup we currently have.  Long term, I'd like our VMs to
support both types of networking for the guest.


> This concerns me: not every protocol uses the port abstraction. For
> example, how does one handle ICMP?

The easiest thing is like qemu does: ignore it.  usermode networking in
qemu doesn't really deal with ICMP. 


> Not every protocol is connection-oriented. I suspect it would not be long
> until we found ourselves maintaining a fair amount of state akin to simply
> building a full NAT, the scope of which is pretty big.

I didn't go into details on this, since it's not the purpose of the
kernel bypass, but the VMM / virtio-net NAT will maintain state and
my current plan is to time out connections.


> > If real-addr mode is a pain, then I'll scrap it, but I figured it'd
> > help out with system administration.  And I think I can make it work
> > with no real changes.  At first, arp seems like it might be hard, but I
> > think the host side of virtio-net responds with itself as the target
> > for any ARPs.
> >
> > The details don't matter much - suffice to say, userspace will need to
> > reserve ports in a protocol and preferably access IP headers in packets.
> >  
> 
> One advantage of the namespace style network stack is that it's recursively
> stackable. One can implement something like this by providing a filesystem
> that composes with the real /net and itself presents a /net; the
> implication is that one could prototype entirely in user space and not
> involve the kernel at all: much in the way that the window system under
> Plan 9 can run itself recursively in a window.

The discussion about 'real-addr' mode has little to do with the kernel
interface.  I don't see how another /net helps here.  This is the
interface to the virtual machine. It answers questions like "what IP
does the guest think it has" and "what IP does the guest route packets
to".  We'll see when I flush it out a little more, but it might be as
simple as "don't hardcode the guest to 10.0.2.2."


> Scenario: I want to detect whether a guest running e.g. Linux is reachable
> using 'ping': how does one accommodate that? Even if I were on the host
> running the VM, I don't see how I could do that.

You don't, just like qemu's user-mode networking.  Try pinging a guest
on Qemu that isn't set up with tun/tap.


> I could see this working for UDP and TCP, but going beyond requires another
> mechanism. Would we be better off porting over a full NAT implementation? I
> believe NATs have been written for Plan 9 before; one may be floating
> around that we could leverage to this end. By extending devbridge to be
> devswitch and marrying that to a NAT, I think we'd end up with a really
> nice solution indeed, that would neatly solve all of the related problems.

Would it be workable for VMMs natting traffic for guests running on Plan
9?  

If there's something less complicated than what I'm doing and it
actually works, then I'd be interested.  But people have been talking
about this stuff for a while now and we haven't had anything concrete.


> Hmm; I'm not sure I understand what you mean when you say "The flag will be
> one-way".

I mean that a conversation, once in bypass mode, cannot be turned back
into a regular conversation.  It'll actually be a state in ip.h (like
Connected and Announced).  Note that technically conversations are
never freed in #ip - they just get reused.  When the conv gets reused,
it'll no longer by a Bypass.


> Am I understanding the proposal correctly in that `bypass` would only be
> used on the input side, and on the output side the assumption would
> be that the stack receives a fully-formed TCP segment from the
> thing-you-bypassed-to, which would have already handled e.g. the
> sequence-number issue? I think that's the case but want to confirm.

Yes, but to be clear, bypass occurs on both sides.  It bypasses the
*kernel's* protocol implementation.  The user gets the IP packet, fresh
from the IP stack. It can then do what it wants, preferably
participating in TCP.  It gives the kernel raw IP packets, after it
already set up the TCP headers.

It's as if the user can send and receive raw IP packets, but for a
particular {protocol, port}.


> For the case of migrating things from a guest to a native service,
> what specifically is wrong with giving native and guest different IP
> addresses

I don't always have extra IP addresses to hand out.  Consider a case in
a datacenter where the IP address of the node is fixed.  I'd like to
run a VM.  What IP do I give it? 


> and introducing user-level proxies on the guest that
> simply forward to the native side? 

That requires modifications to the guest.  I can avoid that by
interposing at the VMM / virtio-net layer.  For right now, I just want
to ssh into both Akaros and Linux.


> Splitting the port space for a given protocol so that we can share IP
> addresses feels like a hack to me. This would get us some percentage
> of the way to a PNAT, and at that point, I would have to ask why not
> just implement a real PNAT? That seems to be the standard solution to
> this problem.

Ultimately these solutions split the protocol port space, so I don't
see that as being a hack.  Someone has to do it somewhere in
the codebase.  When the packet hits the networking stack, we have to
decide something at some point.


> One thing that wasn't obvious to me until I looked into it was that we
> > don't need to involve the protocols at all.  Initially, I thought
> > we'd need to check with TCP's data structures to make sure we get a
> > free port or something.  However, #ip handles the ports for all of
> > the protocols.  It might be called "IP", but it knows more than
> > just IP headers.
> >  
> 
> This does conflate things across the stack, though. Specifically, this
> proposal would supporting sharing IP addresses between various
> applications (to the extend that e.g. a VMMCP running Linux is just
> an application as far as Akaros is concerned) by segmenting the port
> space for a subset of protocols that layer on top of IP.

I don't see this as a problem.  We already share IP addresses between
various applications.  The webserver gets port 80, sshd gets 22.  What
if I want to run my webserver in a Linux VM?  From Akaros's
perspective, it's just a different TCP stack running.  It's broader
than just VMs - any app that wants to run their own TCP stack can do
so.  

As a side note, user-level TCP is something we talked about a long time
ago - some people want to do this, but not being able to dedicate cores
to it might lead to latency problems.


> If the idea is to do this really really quickly to scratch and
> immediate use case, then that's one thing, but I don't know that I
> understand what that use case is. On the other hand, I don't think
> this is a viable long-term solution, and for just migrating things
> one-by-one from a guest to a native application while retaining the
> appearance of a single externally-facing IP address, then user-level
> proxies give the same effect and don't modify the kernel at all.

It's not just for moving services between VM and host.  It's so we can
do any concurrent networking at all.  

Here's what we currently do:
- I run with a nasty hack that sends all TCP port 23 traffic to the
  guest, everything else to Akaros
- Fergus turns off the host's stack.  Guest gets everything, host
  nothing.
- Not sure what Ron/Gan do.  Probably the same: no host networking

I didn't view this as a quick and dirty hack.  I viewed it as a simple
way to have kernel support for our guest's networking requirements,
similar to qemu's user-mode networking (shared port space).  And we get
potential user-mode TCP for free, at least on a port-by-port basis for
apps that want to try it out.  (I have none in mind at this point).

The added kernel support helps the VMM by not having to fake being the
TCP endpoint.  If you grep qemu's code for TCP, there's a mountain of
stuff that I don't want to get involved with.  

Barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] #ip 'bypass' command (for virtio-net and maybe user-level TCP)

Reply via email to