On 1/26/2024 1:53 PM, Michael Laß wrote:
I captured the following traces and will comment inline on what I could
find:


Starting with a client running on Linux 6.6.13, trying to access
/afs/desy.de:
fstrace: https://homepages.upb.de/lass/openafs/6.6.13.fstrace
pcapng:  https://homepages.upb.de/lass/openafs/6.6.13.pcapng

The packet trace (pcapng, can be opened with Wireshark) shows that the
reply to fetch-data-64 (i.e., the directory listing) arrives in
fragments (e.g., frames 127+128). Nevertheless, the reception of the
packet is acknowledged in frame 131. In the end, everything works fine.


Running the same scenario on Linux 6.7:
fstrace: 
https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.fstrace
pcapng:  
https://homepages.upb.de/lass/openafs/6.7_default-mtu_default-rxmaxmtu.pcapng

The receiving side looks very similar, we still receive the reply to
fetch-data-64 in fragments (frames 127+128, 129+130, etc.). However,
the reception is never acknowledged by the client. The getdents64
syscall hangs forever.


Reducing the maximum RX MTU via -rxmaxmtu 1400 on Linux 6.7:
fstrace: 
https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmtu-1400.fstrace
pcapng:  
https://homepages.upb.de/lass/openafs/6.7_default-mtu_rxmaxmtu-1400.pcapng

The reply to fetch-data-64 is not fragmented anymore because the RX
packets are sufficiently small (frames 149-152). The reception is ACK'd
in frame 154.


It could be that the larger UDP packets are segmented by my provider,
as my IPv4 connection is realized via DS-Lite (a carrier-grade NAT
[1][2]), which may reduce the MTU. This segmentation may be key to
reproduce this issue.

Still, it worked fine with Linux 6.6, even when receiving fragmented
responses, and it is not working anymore with Linux 6.7. I may start
bisecting the Linux kernel changes between 6.6 and 6.7, but I fear that
this will take weeks...

Best regards,
Michael


[1] https://en.wikipedia.org/wiki/Carrier-grade_NAT
[2] 
https://en.wikipedia.org/wiki/IPv6_transition_mechanism#Dual-Stack_Lite_(DS-Lite)

Dear openafs-devel list,

Michael and I spent some time over the weekend reproducing the behavior with packet captures on both the client host and the fileserver host.

Michael's ISP connection is Vodofone DS_Lite which transmits IPv4 traffic over a tunnel with a 1460 MTU and his LAN MTU is 1500 and a preferred Rx MTU of 1444 (that is 1444 bytes of data per Rx DATA packet).

The OpenAFS cache manager advertises a willingness to accept packets as large as 5692 bytes and up to four Rx packets in each datagram (aka jumbograms).

When a desy.de fileserver replies to a FetchData RPC for the root directory of the root.cell.readonly volume it must return 6144 bytes.  This requires 5 DATA packets of sizes (1444, 1444, 1444, 1444, 368).

Of these five DATA packets only the 5th packet can be transferred across the tunnel without fragmentation.

The Linux network stack attempts to emulate the behavior of IPv6 with regards to the transmission of fragmented packets.   In IPv6 only the initial sender of a packet is permitted to fragment the packet.  Routers/switches along the path are not permitted to fragment.  Instead, any router/switch that cannot forward a packet because it is too large for the next hop must return an ICMPV6 TOO_BIG packet documenting the MTU of the next hop.  Upon receipt of the ICMPV6 the sending host updates a local path mtu cache and the next time a packet is sent along the path that is larger than the path mtu, the packet will be fragmented.

The way that Linux emulates the IPv6 behavior when using IPv6 is to set the IP DONT_FRAGMENT flag on all outgoing packets.  This prevents the packets from being forwarded onto network segments with smaller MTUs and is supposed to trigger the reply of an ICMP TOO_BIG.  This process is referred to a Path MTU Discovery.   A summary can be found at https://erg.abdn.ac.uk/users/gorry/course/inet-pages/pmtud.html.

The AuriStorFS fileservers begin each call with a congestion window set to 4 packets.   This permits four packets to be placed onto the wire.   In the case of fetching the desy.de root.cell.readonly root directory there are five DATA packets.  The first four are placed onto the wire with the DONT_FRAGMENT flag.   They cannot fit in the tunnel and so the packets are dropped and ICMP packets are replied.

However, it appears that not all ICMP packets are being received from vodofone or perhaps there are two layers of tunneling.  The first might have MTU 1480 and the second MTU 1460.

Since no ACK has been received for the transmitted DATA packets, the fileserver retransmits the DATA packets when the RTO timeout occurs and doubles the RTO timeout.   The retransmitted packets have the Rx REQUEST_ACK flag set and the IP DONT_FRAGMENT flag set.   Eventually an ICMP TOO_BIG is delivered which advertises a small enough MTU and the next RTO retransmission will result in the DATA packets being sent fragmented.

Once the fragmented DATA packets are small enough to fit in the tunnel they are delivered to the client host where they are reassembled and delivered to the cache manager's socket.   As the DATA packets with Rx REQUEST_ACK are received ACK packets are returned to the fileserver.

Once the cache manager ACKs the first DATA packet the fileserver can send the 5th DATA packet which does not require fragmentation and the call completes.   This pattern is observed in the 6.6.13.pcapng trace.

In the 6.7_default-mtu_default-rxmaxmtu.pcapng capture the pattern is the same except that the cache manager never receives the DATA packets and therefore never sends ACK packets.   There appears to be a regression introduced in the v6.7 kernel which prevents reassembly or delivery of reassembled packets to the cache manager's Rx.   A Linux kernel bisect is required to determine which commit introduced the behavior change.

Since there is no hard dead timeout configured for fileserver connections there is nothing to timeout the FetchData call from the client side.   The cache manager and fileserver will exchange PING/PING_RESPONSE packets since those fit within the tunnel without fragmentation but the DATA packets will never make it through.  As a result Michael observes the "ls -l /afs/desy.de/" process blocking in the syscall for what feels like forever.  The AuriStorFS fileserver will eventually kill the call when the DATA packets have been retransmitted too often.  However, there are no timeouts that will result in failure of the call.

When Michael configures afsd with -rxmaxmtu 1400 the cache manager informs the fileserver that it will not accept any packets larger than Rx MTU 1400 (IP MTU 1456) which is smaller than the tunnel's MTU which permits all of the DATA packets to be delivered without fragmentation.

As presented at the AFS Tech Workshop in June 2023, the 2023 builds of AuriStor Rx include Path MTU Discovery.   AuriStor's PMTUD implementation starts with a Rx MTU of 1144 and probes upwards from there.  No DATA packets are constructed which are larger than the verified Path MTU. Therefore, all packets are delivered without fragmentation. This permits AFS access from the Linux v6.7 kernel even with the regression.

Hopefully a bisect between v6.6 and v6.7 will identify the source of the regression so that it can be fixed for v6.8 and then back-ported to one of the v6.7 stable branches.

Jeffrey Altman



_______________________________________________
OpenAFS-devel mailing list
OpenAFS-devel@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to