Dear list,

This mail is to summarise a problem that we have seen with
OpenAFS 1.6.0/1.6.1 clients and that people might be
interested in:

Following an upgrade of a remote site's batch cluster to
1.6.0, we noticed a linearly increasing rate of small
packets sent to our AFS servers here at CERN.

After a couple of days, the rate from these 1'500+ machines
became so high that it severely impacted the functionality of
our firewall and our AFS servers (the rate meanwhile exceeded
1 million packets/sec).

After being informed about the issue, the colleagues at
the remote site applied a patch from 1.6.1 that was
supposed to mitigate this problem. In fact, the rate
dropped, but continued to grow again shortly after. Finally,
they decided to downgrade their cluster to 1.5.78. This
"solved" the issue, see the attached plot showing the
incoming traffic as seen by one of our servers.

From what we see, the source of these packets is the NAT
keep alive feature which sends Rx version packets to keep
the NAT port mapping alive.

In our case, the rate of these pings was increasing for all
clients (in a load-dependent way), and reached >5000 pings/sec
for the worst clients.

The 1.6.1 patch improved the situation, but did not solve it
as the slope was simply less steep; after ~2 weeks of production,
some nodes already reached 300-400 pings/sec again. 1.5.78 stays
at 5 pings/sec, even after being in production for some time.

As a potential cause, leaking connection reference counters
and subsequently failing garbage collection come to mind,
but we cannot point to anything in the code yet.

Anyway, as 1.6.x starts to become more wide-spread we
considered it worthwhile to raise awareness of this issue.

We will probably now start to monitor this specific traffic
more closely and will block sites in case their traffic
endangers the stability of our cell.

Cheers,
 Arne


--
Arne Wiebalck
CERN IT

<<attachment: NAT_keep_alive_packet_flood.png>>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to