On Tue, Dec 13, 2016 at 6:36 PM, Barret Rhoden <[email protected]> wrote:
> Hi -
>
> Here's what i'm thinking for the #ip 'bypass' command, which can be
> applied to conversations. Feel free to forward to Jim / Charles /
> etc. (#ip is the old #I (devip.c)).
>
> This can be used by both virtio-net as well as user-level TCP
> implementations (on a port-by-port basis, where the app does its own
> TCP).
>
I have some questions/concerns about this. In general, I think we'd be
better off supporting a virtual switch layer for layer two and then having
a traditional PNAT that's stateful for layer 3 (and above) protocols. I
think we can do this fairly simply by extending the existing bridge device,
building a NAT layer that fronts that and that we bind over e.g. an
ethernet interface, and then layering multiple IP stacks on top of that.
For that other layer, we could maybe take some existing solution like PF
and plug it into the virtual switch environment.
Scenario: (feel free to skip, but this is the motivation). I want to
> build something for our virtual machines where we can support the
> guest's networking, with NAT, similar to what qemu does.
What's the immediate motivation for this work?
We'll
> eventually need something else for the non-NAT, every-VM-gets-an-IP
> that openvswitch uses.
> - we can reserve some number of ports (for given protocols) on the
> host. inbound connections get forwarded to the guest, with suitable
> IP addr/port rewriting.
>
This concerns me: not every protocol uses the port abstraction. For
example, how does one handle ICMP?
- when the guest initiates connections, we create a conversation and
> rewrite the IP header for using the new port.
>
Not every protocol is connection-oriented. I suspect it would not be long
until we found ourselves maintaining a fair amount of state akin to simply
building a full NAT, the scope of which is pretty big.
Overall, for a given protocol, host port X == guest port Y. For IP
> addrs, I'll support both qemu style and a "real addr" style. Qemu
> style is 10.0.2.15 Guest_IP, 10.0.2.2 Router_IP (which doubles as the
> host).
>
> "real addr" style is where guest_IP == the host's IP, and the Router_IP
> is the host's router IP. 'real addr' mode is used when the guest wants
> to config itself as if it was the host. We use this when we do our
> 'pass-through' virtual machines, where the VM gets the real disk that
> Linux would otherwise use, and that disk has scripts that set a static
> IP.
>
> If real-addr mode is a pain, then I'll scrap it, but I figured it'd
> help out with system administration. And I think I can make it work
> with no real changes. At first, arp seems like it might be hard, but I
> think the host side of virtio-net responds with itself as the target
> for any ARPs.
>
> The details don't matter much - suffice to say, userspace will need to
> reserve ports in a protocol and preferably access IP headers in packets.
>
One advantage of the namespace style network stack is that it's recursively
stackable. One can implement something like this by providing a filesystem
that composes with the real /net and itself presents a /net; the
implication is that one could prototype entirely in user space and not
involve the kernel at all: much in the way that the window system under
Plan 9 can run itself recursively in a window.
Anyways, the kernel change to support this is a protocol bypass
> mechanism that happens on a port-by-port basis. The kernel handles the
> IP layer still. The actual protocol layer in #ip is ignored, and the
> user communicates directly with #ip, just like a protocol would.
>
Scenario: I want to detect whether a guest running e.g. Linux is reachable
using 'ping': how does one accommodate that? Even if I were on the host
running the VM, I don't see how I could do that.
I could see this working for UDP and TCP, but going beyond requires another
mechanism. Would we be better off porting over a full NAT implementation? I
believe NATs have been written for Plan 9 before; one may be floating
around that we could leverage to this end. By extending devbridge to be
devswitch and marrying that to a NAT, I think we'd end up with a really
nice solution indeed, that would neatly solve all of the related problems.
As a review, packets come into the networking stack in a function like
> ipiput{4,6}. After defrag and other routing stuff, it eventually gets
> handed off to a proto's receive method (e.g. tcpiput). The packet gets
> passed with the IP header intact. Eventually, the protocol strips all
> headers and dumps the data in conv->rq.
>
> For outgoing packets, when data gets dropped in the conv->wq, the proto
> usually has a 'kick' method, which does the protocol magic (e.g.
> udpkick, which puts a UDP header layer on it) and then pushes the
> packets to ipoput{4,6}. ipoput figures out the route/interface to send
> the packet, maybe fragments the packet, deals with xsum, then blasts the
> packet out the medium. At that point, it's in the ethermedium, which
> sorts out the arp address and all that.
>
>
> Here's where the changes come in. The "bypass" command will be just
> like "bind", which mostly just sets the local port (either tries to
> use a specific port and fails if it is in use, or it finds a random
> available port). bypass will additionally set a flag in the conv
> marking it as a protocol-bypasser. The flag will be one-way.
>
> When #ip is about to call a proto rcv method, if the flag it set, it'll
> drop the packet, IP headers and all, into the Qdata rq. When the user
> writes a packet to the conv's Qdata, if the flag is set, we'll just()
> pass the blob, expecting it to have IP headers, directly to ipoput
> (will inspect to see if it is v4 or v6).
>
> Side note: we'll have to verify that the kernel doesn't
> particularly trust any part of the IP header that the user gets
> (for instance, it looks like it sets the fragmentation bits on
> its own).
>
> Side note 2: the block data will be in the form of the actual
> IP packet. the IP addr will *not* be in the plan 9 struct IP
> format. (where v6 and v4 take the same space).
>
> From the user's perspective, for a given conversation, it reads and
> writes IP packets, just like they are being routed. For virtio-net, it
> can do whatever rewrite of addr/port it wants when it talks to the
> guest. For user-level TCP, it can run whatever algorithm it wants.
> The bypass interface puts the user in the same spot of the #ip stack
> where the protocol would take over.
>
Hmm; I'm not sure I understand what you mean when you say "The flag will be
one-way". Specifically, how does this work vis-a-vis TCP having to
acknowledge a sequence number on the next packet sent to the distant end?
Am I understanding the proposal correctly in that `bypass` would only be
used on the input side, and on the output side the assumption would be that
the stack receives a fully-formed TCP segment from the
thing-you-bypassed-to, which would have already handled e.g. the
sequence-number issue? I think that's the case but want to confirm.
The trick is that the bypass happens on a port-by-port basis. The
> protocol stack (e.g. TCP) is only bypassed for specific conversations,
> and those conversations have specific ports (the port that we passed to
> "bind" earlier, or -1 which gave us any old port). So from a remote
> machine's perspective, there is no difference between a normal TCP port
> and a bypass port. Among other things, this allows us to move services
> between the host and guest. Much like qemu's port forwarding. But we
> still get the benefits of Akaros's IP stack, such as checksum offload
> and whatever else. Plus, we want to use the host's routing tables -
> not the guests.
>
For the case of migrating things from a guest to a native service, what
specifically is wrong with giving native and guest different IP addresses,
and introducing user-level proxies on the guest that simply forward to the
native side? Barring that, a PNAT would accomplish the same thing, but I
argue more cleanly.
This approach differs from the nasty hacks we've been doing recently
> where the VMM and virtio-net directly hack into #ether, sending all
> packets, unfiltered, to the NIC. By doing it at the #ip layer, we can
> split the port-space of a protocol.
>
Splitting the port space for a given protocol so that we can share IP
addresses feels like a hack to me. This would get us some percentage of the
way to a PNAT, and at that point, I would have to ask why not just
implement a real PNAT? That seems to be the standard solution to this
problem.
One thing that wasn't obvious to me until I looked into it was that we
> don't need to involve the protocols at all. Initially, I thought we'd
> need to check with TCP's data structures to make sure we get a free
> port or something. However, #ip handles the ports for all of the
> protocols. It might be called "IP", but it knows more than just IP
> headers.
>
This does conflate things across the stack, though. Specifically, this
proposal would supporting sharing IP addresses between various applications
(to the extend that e.g. a VMMCP running Linux is just an application as
far as Akaros is concerned) by segmenting the port space for a subset of
protocols that layer on top of IP.
Anyway, the kernel change is pretty small, and I can probably do it in
> less time than it took to write this email. =) The virtio-net stuff
> will take a bit longer. One little thing is that reading from Qdata is
> a stream, so the user will need to figure out IP packet boundaries.
> It'd require some more invasive changes to get around that (actually
> not that bad - need a syscall for block read/write. maybe i'll do
> that too.).
>
If the idea is to do this really really quickly to scratch and immediate
use case, then that's one thing, but I don't know that I understand what
that use case is. On the other hand, I don't think this is a viable
long-term solution, and for just migrating things one-by-one from a guest
to a native application while retaining the appearance of a single
externally-facing IP address, then user-level proxies give the same effect
and don't modify the kernel at all.
Oh, one other thing - if the guest sends virtio-net fragmented packets,
> I think the VMM might need to reassemble them. Otherwise, they'll get
> sent as-is to the remote machine. (which could also be a local port.
> that's how I picture having the guest access local services will work
> - just like someone tried to connect to HOST_IP:PORT).
>
This is something that existing NAT solutions already know how to deal
with. E.g., if we ported PF to Akaros and layered it on top of the existing
/net stack we'd get this for free. That said, porting a system as
complicated as PF would be non-trivial and certainly more work than this
proposal. But I think that should serve as a warning that the problem
domain is very complex. I think we'd find ourselves in a situation of
wanting to do more complicated things very quickly.
Let me know if you spot any issues.
- Dan C.
--
You received this message because you are subscribed to the Google Groups
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.