On 2016-12-14 at 15:43 Dan Cross <[email protected]> wrote:
> to the guest and host. I think we'd get what you want, but at layer 2
> instead of 2/3/4.

Except we need stuff at the higher layers too.  Two packets show up at
Akaros, one for port 22 and another for 23.  We also need to handle the
case where the guest sends out an unsolicited packet, say to a
webserver on port 80.  We need to be able to trace how those packets
and the responses flow through various parts of the stack.  Also, how
many processes are involved here, and how does it hook up to the guest? 

Somewhere, someone is reserving port 23 or whatever at layer 4 and
receiving on it.  It needs to know how to send it to the guest.  It
also needs to know how to catch any packet that comes from the guest
and route it out the outbound IP stack in such a way that it receives
the returning TCP connection. 

If there is Plan 9 software that does that and can easily be ported to
Akaros, then great.  Otherwise, I don't want to get involved in writing
NAT software that mucks with TCP or whatnot.  Maybe it's actually easy
and I'm missing something.  Tracing the flow of packets and which code
is responsible for what will clear that up.  

Incidentally, the bypass solution is mostly layer 3, actually.  The
real work at layer 4 is done in the guest/app and in the remote machine.

> As for the stacks.... The Plan 9 networking stuff we've inherited already
> supports separate IP stacks (unless we stripped that out? I confess I
> haven't looked)

I've looked; it's there.  Though there are two things in this area:
multiple NICs per IP stack with different IPs (which we've done before,
where the IP stack knows how to route them), and actual separate
instances of #ip (differentiated by the spec).  No one has tried that
on Akaros. 

> How does bypass work for talking between
> VMs on the same host? It *sounds* like we'd have to pop the frame out onto
> the ethernet unless we intercepted it in the virtio layer, in which case
> we've built half of a virtual layer 2 switch.

That's not the use case here, just like with qemu's usermode
networking, but you could do it.

If the VMM was in 'real-mode' networking, then Guest A would need to
connect to the ROUTER_IP port X (where X is forward to Guest B).  The
IP routing in the kernel will handle it.  It'd be just like a guest
connecting to a host service (which is a use case).

If we wanted to do something fancier, we could, if we ran a DHCP server
or whatever in the VMM.  I don't think we need that.

> Yes. I think what I'm suggesting is closer to the QEMU model on TUN/TAP. We
> can possibly do the TUN/TAP-style much more cleanly than it must be done
> under Linux.

Linux has a few options for guest networking, one of which is the
user-mode networking.  I don't see why we can't have multiple ones too. 

> I'm not sure who has definitively said that you don't have a spare IP
> address for the guest; there are any number of IPs in un-routable
> private/sharable ranges that may be used for such things; if entirely
> internal to a single host then who is to say one cannot arbitrarily use
> one? Sure, there has got to be some local policy around this within an
> *organization*, but that's a separate issue.

We're conflating two different things.  Yes if a network is
virtualized, then you can do whatever you want.  That's what qemu does
with 10.0.2.2.  That's also what I'd do with the 'real-addr' mode. It's
all virtual, so I can tell the guest it has any addr I want - including
the host addr!

The comment about not having free IPs is for the *other* virtualization
scenario: where guests have globally visible and routable IPs.  This is
where something like OpenVSwitch comes in.  That is a topic for another
time, and is not the use case I need to support right now.

Last week I mentioned three things we need in the area of virtual
networking:
1) externally visible IPs for guests (maybe openvswitch)
2) any form of concurrent networking for the guest and host
3) a way to communicate between the host and guest.

In the Linux world, you can do qemu's user-mode networking or the
tun/tap for 2 and 3.  Same with Akaros.  My thought was that bypass is
relatively simple, much like user-mode networking.  A tun/tap style
approach seems more complicated and might require software that we
don't have yet.

I also have a legitimate need to have the guest think it has the real IP
address of the node.  That's what Fergus et al are doing with the
virtio-block pass-through.

> Right now, doesn't the host-end of the virtio network works with a bridge
> device that talks to /net? Sounds like yes.

No.  There is no bridge device either.  Check out virtio-net's code for
details.  

Basically, it just acts like snoopy and wiretaps the ethernet
connection.  That's the -n0 flag you pass to vmrunkernel: which NIC to
tap.  It gets all traffic, and it blasts traffic back out.  There's no
multiplexing of protocols or anything.

That's why I have my "Nasty Devether Hack" patch on the tip of my vmm
branch (which is not merged anywhere, but which is where I tell people
to run if they want to test things).  

> > You don't, just like qemu's user-mode networking.  Try pinging a guest
> > on Qemu that isn't set up with tun/tap.
> 
> This doesn't feel like a viable long-term solution. I think usermode
> networking is a bit of a hack because most systems don't support TAP/TUN.

It's pretty good - we've been using it for a while with Akaros
development.  One machine I have uses TUN/TAP with bridges, the other
usermode networking.  Both are fine.  I'd be fine with either approach
on Akaros, but I prefer one that is simple and can actually work. 

> Consider: would the QEMU authors have done user-mode if TUN/TAP had been 
> universally available?
> Sure, it's a speculative question but I think relevant.

Maybe.  usermode qemu comes with all the 10.0.2.2 and 10.0.2.15 stuff
built in.  For tun/tap, you probably have to do more.  At least I did.
brctl, tunctl, ifconfig br0, run a dhcp/dns server, etc.  but it's
been years since i set it up.  Actually, you can see a lot of the
stuff I do.  Check out scripts/kvm-up.sh and the instructions in
GETTING_STARTED.

> I think we should ask Jim/Charles if there's something out there that's
> available. 

That'd be great if it existed.

> This does keep coming up, but honestly I haven't seen anyone
> hurting for it.

I've been bringing this up for months, off-list in discussions, and as
recently as last week.  Fergus uses shitty, lossy serial connections in
lieu of ssh on some machines.  I have a devether hack that filters
TCP port 23 and sends it to the guest.  It's getting ridiculous.  

> The thing is, this crosses many layers of the stack, but in separate
> places. The idea of a PNAT is that one does that (necessarily, as you point
> out) but it's done in one place. That is, Linux (or whatever) and Akaros
> don't both thing they are IP address w.x.y.z; there's one intelligent agent
> that handles the relevant mappings.

It's actually not that bad.  Layer 2 is hardly involved at all.  It's
emulated in the VMM.  Layer 3 (IP) is done by Akaros.  Layer 4
(TCP/UDP) by the VM itself.  Note that the VM is always doing Layer 4
stuff, and that the bypass solution does almost nothing in Layer 4 -
it's about bypassing layer 4.  

Linux and Akaros are *not* fighting for the same IP address /
protocol / port / anything.  The bypass is a tool to implement a NAT,
one that doesn't even get involved in Layer 4.  Linux *thinks* it has a
particular IP (10.0.2.15 in qemu style, or HOST_IP in real-addr
style).  But the only code that ever sees that IP address is the NAT
module in virtio-net, which knows to rewrite it (if necessary).

> Except that Linux thinks it owns the entire stack; that's perhaps the best
> summation of my concern in that area.

It thinks it does, but it doesn't.  It owns its own stack, and it only
listens on the ports that the VMM set up (e.g. 23).

> The tie-in with usermode TCP is interesting. I wonder, however if we can do
> that now by having a user-mode program implement it's own tcp/* filesystem
> and bind it over /net/tcp. That would presumably open /ether/$I/data and
> just read/write raw IP datagrams. I don't know what the kernel would do in
> that case without a `bypass` but it may be an interesting experiment.

You'd need an IP address.  If you used the host IP address, then you'd
conflict with the native stack.  Both stacks would get resets when they
hear responses to traffic from the *other* stack.  That's what we have
now, minus the fake 9p server at /net/tcp.  That's why I have the
"Nasty Devether Hack" patch.  The namespace stuff would just trick a
process to use its TCP stack, which virtio-net does for the guest.
(Both the namespace and virtio-net are layers of interposition).

Barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to