Lots of conversation overnight! Let me see if I can answer. But before I do,
I'm putting an observation first:
Here's something that just jumped out at me.
We're pushing the network by issuing the command:
dd if=/dev/zero of=~/foo count=1 bs=33554432
which is writing data from the VPN client to my home directory (which is an NFS
file server on the other side of the VPN)
netstat -s before:
Ip:
194773 total packets received
0 forwarded
0 incoming packets discarded
194770 incoming packets delivered
199630 requests sent out
4238 outgoing packets dropped
38 dropped because of missing route
Tcp:
569 active connections openings
4 passive connection openings
25 failed connection attempts
0 connection resets received
8 connections established
96254 segments received
117889 segments send out
4057 segments retransmited
0 bad segments received.
358 resets sent
Udp:
97795 packets received
10 packets to unknown port received.
0 packet receive errors
119674 packets sent
0 receive buffer errors
4238 send buffer errors
Then do the dd command (which takes about 17 seconds to complete because writes
are so slow)
After:
Ip:
214798 total packets received
0 forwarded
0 incoming packets discarded
214795 incoming packets delivered
237868 requests sent out
5729 outgoing packets dropped
38 dropped because of missing route
Tcp:
572 active connections openings
4 passive connection openings
25 failed connection attempts
0 connection resets received
8 connections established
106255 segments received
143265 segments send out
5547 segments retransmited
0 bad segments received.
364 resets sent
Udp:
107804 packets received
10 packets to unknown port received.
0 packet receive errors
145061 packets sent
0 receive buffer errors
5729 send buffer errors
The delta is that:
1491 outgoing IP packets dropped
1490 TCP segments retransmitted
1491 UDP send buffer errors.
So to me that reads as "1491 outbound UDP packets suffered "send buffer
errors."
That resulted in 1491 IP packets being dropped.
That, in turn yielded 1490 TCP segments (which were inside IPSec UDP packets)
being lost, and thus retransmitted.
This REALLY smells funny.
On a lark, I tried setting net.core_wmem_max, net.core.wmem_default,
net.upv4.udp_wmem_min, etc to much larger values, with no measurable results.
Now, as to your questions/observations:
>> But a reconnect/rekey would be clearly visible in OpenConnect output.
>> Tony, presumably you'd have seen that and mentioned it?
The openconnect output is very quiet once the tunnel is established. We reboot
the VMs every 24 hours, and during that whole 24 hours, this is the last thing
in the log:
Tunnel timeout (rekey interval) is 2880 minutes.
(blah blah)
ESP tunnel connected; exiting HTTPS mainloop.
... and that's seen at bootup. So since 2880 minutes (48 hours) exceeds the
life of the VM, there's not going to be a re-key at all.
>> I think Tony said it was the TCP sequence number, no? That part is
>> mystifying. Unless/until the gateway has successfully decrypted the
>> ESP packet, it should have no idea about the TCP seqno, right?
Yes, it is the TCP sequence number seen in the retransmitted un-VPN'd traffic.
When we tcpdump the iperf flows, we can see that the original packet containing
TCP Sequence number is lost because a new packet with that sequence number
arrives much later than it should.
-----Original Message-----
From: Daniel Lenski [mailto:[email protected]]
Sent: Monday, March 25, 2019 5:14 AM
To: David Woodhouse
Cc: Phillips, Tony; Nikos Mavrogiannopoulos;
[email protected]
Subject: Re: [EXTERNAL] Re: What throughput is reasonable?
On Mon, Mar 25, 2019 at 11:54 AM David Woodhouse <[email protected]> wrote:
>
> On Mon, 2019-03-25 at 11:41 +0200, Daniel Lenski wrote:
> > On Mon, Mar 25, 2019 at 10:29 AM David Woodhouse <[email protected]>
> > wrote:
> > >
> > > On Sun, 2019-03-24 at 19:13 +0200, Daniel Lenski wrote:
> > > >
> > > > Do I have this right? High packet loss from client→VPN, low packet
> > > > loss from VPN→client?
> > > >
> > > > If so, I'm guessing your problems are MTU-related.
> > >
> > > Hm, wouldn't we expect that to be more consistent? If the full-sized
> > > packets are getting lost, that would just stall and not lose the
> > > *occasional* packet?
> >
> > Yeah… should be. My guess is based on a couple of previous
> > less-detailed reports from users of earlier versions with GP.
> >
> > > If it really is a repeatable drop every N packets, I might be inclined
> > > to look at sequence numbers and epoch handling. Are we doing any ESP
> > > rekeying?
> >
> > We are rekeying, but only using the most naïve "tunnel rekey" method.
> > AFAIK, that's all that GP supports.
> > https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1153-1157
> >
> > After a certain time has elapsed, we tear down the TLS connection and
> > reconnect (using the same auth cookie), which also invalidates the
> > previous ESP keys and requires us to start using new ones. We should
> > handle late incoming packets using the "old" ESP keys correctly, using
> > the same method as with Juniper.
>
> We might handling late incoming packets correctly, but we stop actually
> sending them. I wonder if we should continue to send ESP packets on the
> "old" connection even while we're doing the reconnect?
Hmmm… I don't think so. I did a whole of testing of the "tap-dance"
required to enable the ESP tunnel early on, and as far as I can tell
there are two main points:
1) As soon as the client config request (POST /ssl-vpn/getconfig.esp)
is received, any pre-existing ESP keys become invalid immediately, and
new ESP keys become valid immediately.
2) If the client ever connects to the TLS tunnel (bogus pseudo-CONNECT
GET request to /ssl-tunnel-connect.sslvpn), the existing ESP keys
immediately become invalid.
Given this, I believe the safe behavior is to disable the UDP
connection entirely before the reconnect starts, and just let the
outgoing packet queue grow. Currently, we're actually *not* disabling
the UDP connection before starting the reconnect
(https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1162-1171),
but we probably should be.
Maybe try this patch…?
diff --git a/gpst.c b/gpst.c
index a0dc81f..5cd1aab 100644
--- a/gpst.c
+++ b/gpst.c
@@ -1160,6 +1160,8 @@ int gpst_mainloop(struct openconnect_info
*vpninfo, int *timeout)
vpn_progress(vpninfo, PRG_ERR,
_("GPST Dead Peer Detection detected dead peer!\n"));
do_reconnect:
+ if (vpninfo->proto->udp_close)
+ vpninfo->proto->udp_close(vpninfo);
ret = ssl_reconnect(vpninfo);
if (ret) {
vpn_progress(vpninfo, PRG_ERR, _("Reconnect failed\n"));
> But a reconnect/rekey would be clearly visible in OpenConnect output.
> Tony, presumably you'd have seen that and mentioned it?
Yeah, there should be a PRG_INFO message on initial connection and
reconnection: "Tunnel timeout (rekey interval) is %d minutes."
Anyway, I kind of doubt reconnect/rekey is playing a role here… all
the real GP VPNs I've heard about have rekey intervals of at least 20
minutes.
> Also, you said that you hit this a repeatable 4142 packets into a TCP
> connection? That was regardless of how long the VPN had been up?
I think Tony said it was the TCP sequence number, no? That part is
mystifying. Unless/until the gateway has successfully decrypted the
ESP packet, it should have no idea about the TCP seqno, right?
Dan
_______________________________________________
openconnect-devel mailing list
[email protected]
http://lists.infradead.org/mailman/listinfo/openconnect-devel