Branch: refs/heads/master
Home: https://github.com/NixOS/nixops
Commit: 4bbceb294fc50146a4fd47352fba786ee1af6654
https://github.com/NixOS/nixops/commit/4bbceb294fc50146a4fd47352fba786ee1af6654
Author: aszlig <[email protected]>
Date: 2017-03-10 (Fri, 10 Mar 2017)
Changed paths:
M nix/ssh-tunnel.nix
Log Message:
-----------
nix/ssh-tunnel: Don't set addrgenmode to none
This partially reverts 9b382b10d7d861f1dfc9ce88db7bd80d54b7553b.
It turns out that
torvalds/linux@cc9da6cc4f56e05cc9e591459fe0192727ff58b3 itself isn't the
real issue, but just caused the real bug to surface.
After hours of debugging together with Hannes Frederic Sowa, we finally
found what's the culprit here, so the commit I'm partially reverting is
merely a workaround which doesn't fix anything but just causes the bug
to show up less frequently.
As for the issue itself: OpenSSH polls the file descriptors of the tun
devices and as soon as poll returns the descriptor as being writeable,
the data coming from the SSH stream is written to the tun device.
In this particular case the mentioned kernel commit causes router
solicitations to be sent to the tun device, which in theory shouldn't be
a problem either.
However, if one tun device is up earlier than the counterpart, these
solicitation packets are sent to the tun device and get forwarded over
the SSH connection to the other peer, which then tries to write these
packets into the tun device and fails because the interface is down.
Let's illustrate this so it becomes clear:
_ ____
.-------------. Channel A ___( )( )___ Channel B .--------------.
| SSH Client | --------> (__ Internet ___) --------> | SSH Server |
| /|\ | (____)(____) | | |
| | | | \|/ |
| Tun A (UP) | | Tun B (DOWN) |
`-------------' `--------------'
So we have tunnel device A here which is already up while the tunnel
device B isn't yet. Tun A now gets a router solicitation packed which
the SSH client sends over to the SSH server over Channel A and B.
The SSH server writes the packet into Tun B, but it is down, so the
write() fails with errno EIO (Input/output error) and that in turn
causes the SSH server to shut down its channel, resulting in follow-up
traffic to end up in the middle of nowhere.
While trying to write a patch for OpenSSH to either avoid writing to the
tun device when it's down or ignore EIO, I figured out that this really
is a kernel bug and also a very longlasting one.
Since torvalds/linux@1bd4978a88ac2589f3105f599b1d404a312fb7f6 the
following code is at the start of tun_get_user(), which is called when
user space tries to write() to the device:
if (!(tun->dev->flags & IFF_UP))
return -EIO;
Unfortunately tun_chr_poll() doesn't take this into account and
designate the device as being writeable even if it's not UP, so the real
fix here should indicate the file descriptor being able to write when
it's really up.
As an analogy, it's a bit like this:
tun_chr_poll(): "Hey, you can write to the FD, go on :-)"
openssh: write(...);
tun_get_user(): "Haha, I lied, you can't!"
In summary: Setting addrgenmode to none isn't addressing this at all and
it only worked in my tests because the addrgenmode has been set quickly
enough to not cause router solicitations to be sent. It's also a kernel
problem, so it's not the resonsibility of NixOps to work around or try
to fix it.
The reason I'm only doing a partial revert is that using iproute2
instead of nettools makes the Nix expression more readable and along
switching to iproute2 I've also chained the commands via && to catch
failures instead of ignoring them.
I also did more comprehensive tests during the last few days and I could
confirm that the workaround didn't prevent the bug from happening at all
but only caused it to be less frequent.
Here is a VM test I wrote to confirm this:
https://gist.github.com/aszlig/5e8c8462d17f787b7f171c34090fd590
So if all goes well, the real fix should land with the upcoming stable
kernel updates.
Thanks again to Hannes for help on debugging this properly plus getting
this fixed soon in the kernel (I did a first patch, but that one wasn't
sufficient enough and he's obviously the expert in this matter).
Signed-off-by: aszlig <[email protected]>
_______________________________________________
nix-commits mailing list
[email protected]
http://lists.science.uu.nl/mailman/listinfo/nix-commits