Dear list, This mail is to summarise a problem that we have seen with OpenAFS 1.6.0/1.6.1 clients and that people might be interested in:
Following an upgrade of a remote site's batch cluster to 1.6.0, we noticed a linearly increasing rate of small packets sent to our AFS servers here at CERN. After a couple of days, the rate from these 1'500+ machines became so high that it severely impacted the functionality of our firewall and our AFS servers (the rate meanwhile exceeded 1 million packets/sec). After being informed about the issue, the colleagues at the remote site applied a patch from 1.6.1 that was supposed to mitigate this problem. In fact, the rate dropped, but continued to grow again shortly after. Finally, they decided to downgrade their cluster to 1.5.78. This "solved" the issue, see the attached plot showing the incoming traffic as seen by one of our servers. From what we see, the source of these packets is the NAT keep alive feature which sends Rx version packets to keep the NAT port mapping alive. In our case, the rate of these pings was increasing for all clients (in a load-dependent way), and reached >5000 pings/sec for the worst clients. The 1.6.1 patch improved the situation, but did not solve it as the slope was simply less steep; after ~2 weeks of production, some nodes already reached 300-400 pings/sec again. 1.5.78 stays at 5 pings/sec, even after being in production for some time. As a potential cause, leaking connection reference counters and subsequently failing garbage collection come to mind, but we cannot point to anything in the code yet. Anyway, as 1.6.x starts to become more wide-spread we considered it worthwhile to raise awareness of this issue. We will probably now start to monitor this specific traffic more closely and will block sites in case their traffic endangers the stability of our cell. Cheers, Arne -- Arne Wiebalck CERN IT
<<attachment: NAT_keep_alive_packet_flood.png>>
smime.p7s
Description: S/MIME Cryptographic Signature
