Hi -
Here's what i'm thinking for the #ip 'bypass' command, which can be
applied to conversations. Feel free to forward to Jim / Charles /
etc. (#ip is the old #I (devip.c)).
This can be used by both virtio-net as well as user-level TCP
implementations (on a port-by-port basis, where the app does its own
TCP).
Scenario: (feel free to skip, but this is the motivation). I want to
build something for our virtual machines where we can support the
guest's networking, with NAT, similar to what qemu does. We'll
eventually need something else for the non-NAT, every-VM-gets-an-IP
that openvswitch uses.
- we can reserve some number of ports (for given protocols) on the
host. inbound connections get forwarded to the guest, with suitable
IP addr/port rewriting.
- when the guest initiates connections, we create a conversation and
rewrite the IP header for using the new port.
Overall, for a given protocol, host port X == guest port Y. For IP
addrs, I'll support both qemu style and a "real addr" style. Qemu
style is 10.0.2.15 Guest_IP, 10.0.2.2 Router_IP (which doubles as the
host).
"real addr" style is where guest_IP == the host's IP, and the Router_IP
is the host's router IP. 'real addr' mode is used when the guest wants
to config itself as if it was the host. We use this when we do our
'pass-through' virtual machines, where the VM gets the real disk that
Linux would otherwise use, and that disk has scripts that set a static
IP.
If real-addr mode is a pain, then I'll scrap it, but I figured it'd
help out with system administration. And I think I can make it work
with no real changes. At first, arp seems like it might be hard, but I
think the host side of virtio-net responds with itself as the target
for any ARPs.
The details don't matter much - suffice to say, userspace will need to
reserve ports in a protocol and preferably access IP headers in packets.
Anyways, the kernel change to support this is a protocol bypass
mechanism that happens on a port-by-port basis. The kernel handles the
IP layer still. The actual protocol layer in #ip is ignored, and the
user communicates directly with #ip, just like a protocol would.
As a review, packets come into the networking stack in a function like
ipiput{4,6}. After defrag and other routing stuff, it eventually gets
handed off to a proto's receive method (e.g. tcpiput). The packet gets
passed with the IP header intact. Eventually, the protocol strips all
headers and dumps the data in conv->rq.
For outgoing packets, when data gets dropped in the conv->wq, the proto
usually has a 'kick' method, which does the protocol magic (e.g.
udpkick, which puts a UDP header layer on it) and then pushes the
packets to ipoput{4,6}. ipoput figures out the route/interface to send
the packet, maybe fragments the packet, deals with xsum, then blasts the
packet out the medium. At that point, it's in the ethermedium, which
sorts out the arp address and all that.
Here's where the changes come in. The "bypass" command will be just
like "bind", which mostly just sets the local port (either tries to
use a specific port and fails if it is in use, or it finds a random
available port). bypass will additionally set a flag in the conv
marking it as a protocol-bypasser. The flag will be one-way.
When #ip is about to call a proto rcv method, if the flag it set, it'll
drop the packet, IP headers and all, into the Qdata rq. When the user
writes a packet to the conv's Qdata, if the flag is set, we'll just()
pass the blob, expecting it to have IP headers, directly to ipoput
(will inspect to see if it is v4 or v6).
Side note: we'll have to verify that the kernel doesn't
particularly trust any part of the IP header that the user gets
(for instance, it looks like it sets the fragmentation bits on
its own).
Side note 2: the block data will be in the form of the actual
IP packet. the IP addr will *not* be in the plan 9 struct IP
format. (where v6 and v4 take the same space).
>From the user's perspective, for a given conversation, it reads and
writes IP packets, just like they are being routed. For virtio-net, it
can do whatever rewrite of addr/port it wants when it talks to the
guest. For user-level TCP, it can run whatever algorithm it wants.
The bypass interface puts the user in the same spot of the #ip stack
where the protocol would take over.
The trick is that the bypass happens on a port-by-port basis. The
protocol stack (e.g. TCP) is only bypassed for specific conversations,
and those conversations have specific ports (the port that we passed to
"bind" earlier, or -1 which gave us any old port). So from a remote
machine's perspective, there is no difference between a normal TCP port
and a bypass port. Among other things, this allows us to move services
between the host and guest. Much like qemu's port forwarding. But we
still get the benefits of Akaros's IP stack, such as checksum offload
and whatever else. Plus, we want to use the host's routing tables -
not the guests.
This approach differs from the nasty hacks we've been doing recently
where the VMM and virtio-net directly hack into #ether, sending all
packets, unfiltered, to the NIC. By doing it at the #ip layer, we can
split the port-space of a protocol.
One thing that wasn't obvious to me until I looked into it was that we
don't need to involve the protocols at all. Initially, I thought we'd
need to check with TCP's data structures to make sure we get a free
port or something. However, #ip handles the ports for all of the
protocols. It might be called "IP", but it knows more than just IP
headers.
Anyway, the kernel change is pretty small, and I can probably do it in
less time than it took to write this email. =) The virtio-net stuff
will take a bit longer. One little thing is that reading from Qdata is
a stream, so the user will need to figure out IP packet boundaries.
It'd require some more invasive changes to get around that (actually
not that bad - need a syscall for block read/write. maybe i'll do
that too.).
Oh, one other thing - if the guest sends virtio-net fragmented packets,
I think the VMM might need to reassemble them. Otherwise, they'll get
sent as-is to the remote machine. (which could also be a local port.
that's how I picture having the guest access local services will work
- just like someone tried to connect to HOST_IP:PORT).
Let me know if you spot any issues.
Barret
--
You received this message because you are subscribed to the Google Groups
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.