Re: [OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7

Jeffrey E Altman Mon, 29 Jan 2024 09:56:57 -0800

On 1/26/2024 1:53 PM, Michael Laß wrote:

I captured the following traces and will comment inline on what I could
find:



Starting with a client running on Linux 6.6.13, trying to access
/afs/desy.de:
fstrace: https://homepages.upb.de/lass/openafs/6.6.13.fstrace
pcapng:  https://homepages.upb.de/lass/openafs/6.6.13.pcapng

The packet trace (pcapng, can be opened with Wireshark) shows that the
reply to fetch-data-64 (i.e., the directory listing) arrives in
fragments (e.g., frames 127+128). Nevertheless, the reception of the
packet is acknowledged in frame 131. In the end, everything works fine.


Running the same scenario on Linux 6.7:
fstrace: 
https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.fstrace
pcapng:  
https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.pcapng

The receiving side looks very similar, we still receive the reply to
fetch-data-64 in fragments (frames 127+128, 129+130, etc.). However,
the reception is never acknowledged by the client. The getdents64
syscall hangs forever.


Reducing the maximum RX MTU via -rxmaxmtu 1400 on Linux 6.7:
fstrace: 
https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmtu-1400.fstrace
pcapng:  
https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmaxmtu-1400.pcapng

The reply to fetch-data-64 is not fragmented anymore because the RX
packets are sufficiently small (frames 149-152). The reception is ACK'd
in frame 154.


It could be that the larger UDP packets are segmented by my provider,
as my IPv4 connection is realized via DS-Lite (a carrier-grade NAT
[1][2]), which may reduce the MTU. This segmentation may be key to
reproduce this issue.

Still, it worked fine with Linux 6.6, even when receiving fragmented
responses, and it is not working anymore with Linux 6.7. I may start
bisecting the Linux kernel changes between 6.6 and 6.7, but I fear that
this will take weeks...

Best regards,
Michael


[1] https://en.wikipedia.org/wiki/Carrier-grade_NAT
[2] 
https://en.wikipedia.org/wiki/IPv6_transition_mechanism#Dual-Stack_Lite_(DS-Lite)


Dear openafs-devel list,

Michael and I spent some time over the weekend reproducing the behaviorwith packet captures on both the client host and the fileserver host.

Michael's ISP connection is Vodofone DS_Lite which transmits IPv4traffic over a tunnel with a 1460 MTU and his LAN MTU is 1500 and apreferred Rx MTU of 1444 (that is 1444 bytes of data per Rx DATA packet).

The OpenAFS cache manager advertises a willingness to accept packets aslarge as 5692 bytes and up to four Rx packets in each datagram (akajumbograms).

When a desy.de fileserver replies to a FetchData RPC for the rootdirectory of the root.cell.readonly volume it must return 6144 bytes. This requires 5 DATA packets of sizes (1444, 1444, 1444, 1444, 368).

Of these five DATA packets only the 5th packet can be transferred acrossthe tunnel without fragmentation.

The Linux network stack attempts to emulate the behavior of IPv6 withregards to the transmission of fragmented packets. In IPv6 only theinitial sender of a packet is permitted to fragment the packet. Routers/switches along the path are not permitted to fragment. Instead,any router/switch that cannot forward a packet because it is too largefor the next hop must return an ICMPV6 TOO_BIG packet documenting theMTU of the next hop. Upon receipt of the ICMPV6 the sending hostupdates a local path mtu cache and the next time a packet is sent alongthe path that is larger than the path mtu, the packet will be fragmented.

The way that Linux emulates the IPv6 behavior when using IPv6 is to setthe IP DONT_FRAGMENT flag on all outgoing packets. This prevents thepackets from being forwarded onto network segments with smaller MTUs andis supposed to trigger the reply of an ICMP TOO_BIG. This process isreferred to a Path MTU Discovery. A summary can be found athttps://erg.abdn.ac.uk/users/gorry/course/inet-pages/pmtud.html.

The AuriStorFS fileservers begin each call with a congestion window setto 4 packets. This permits four packets to be placed onto the wire. In the case of fetching the desy.de root.cell.readonly root directorythere are five DATA packets. The first four are placed onto the wirewith the DONT_FRAGMENT flag. They cannot fit in the tunnel and so thepackets are dropped and ICMP packets are replied.

However, it appears that not all ICMP packets are being received fromvodofone or perhaps there are two layers of tunneling. The first mighthave MTU 1480 and the second MTU 1460.

Since no ACK has been received for the transmitted DATA packets, thefileserver retransmits the DATA packets when the RTO timeout occurs anddoubles the RTO timeout. The retransmitted packets have the RxREQUEST_ACK flag set and the IP DONT_FRAGMENT flag set. Eventually anICMP TOO_BIG is delivered which advertises a small enough MTU and thenext RTO retransmission will result in the DATA packets being sentfragmented.

Once the fragmented DATA packets are small enough to fit in the tunnelthey are delivered to the client host where they are reassembled anddelivered to the cache manager's socket. As the DATA packets with RxREQUEST_ACK are received ACK packets are returned to the fileserver.

Once the cache manager ACKs the first DATA packet the fileserver cansend the 5th DATA packet which does not require fragmentation and thecall completes. This pattern is observed in the 6.6.13.pcapng trace.

In the 6.7_default-mtu_default-rxmaxmtu.pcapng capture the pattern isthe same except that the cache manager never receives the DATA packetsand therefore never sends ACK packets. There appears to be aregression introduced in the v6.7 kernel which prevents reassembly ordelivery of reassembled packets to the cache manager's Rx. A Linuxkernel bisect is required to determine which commit introduced thebehavior change.

Since there is no hard dead timeout configured for fileserverconnections there is nothing to timeout the FetchData call from theclient side. The cache manager and fileserver will exchangePING/PING_RESPONSE packets since those fit within the tunnel withoutfragmentation but the DATA packets will never make it through. As aresult Michael observes the "ls -l /afs/desy.de/" process blocking inthe syscall for what feels like forever. The AuriStorFS fileserver willeventually kill the call when the DATA packets have been retransmittedtoo often. However, there are no timeouts that will result in failureof the call.

When Michael configures afsd with -rxmaxmtu 1400 the cache managerinforms the fileserver that it will not accept any packets larger thanRx MTU 1400 (IP MTU 1456) which is smaller than the tunnel's MTU whichpermits all of the DATA packets to be delivered without fragmentation.

As presented at the AFS Tech Workshop in June 2023, the 2023 builds ofAuriStor Rx include Path MTU Discovery. AuriStor's PMTUDimplementation starts with a Rx MTU of 1144 and probes upwards fromthere. No DATA packets are constructed which are larger than theverified Path MTU. Therefore, all packets are delivered withoutfragmentation. This permits AFS access from the Linux v6.7 kernel evenwith the regression.

Hopefully a bisect between v6.6 and v6.7 will identify the source of theregression so that it can be fixed for v6.8 and then back-ported to oneof the v6.7 stable branches.


Jeffrey Altman



_______________________________________________
OpenAFS-devel mailing list
OpenAFS-devel@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] 1.8.11pre1 client hanging on Linux 6.7

Reply via email to