Responses inline

On 2/10/2025 2:31 PM, tegular...@mail2tor.com wrote:
Dear OpenAFS team

I have been running into a problem.  I have three OpenAFS fileservers in
my cell, which happen to also be the VLDB servers.
What versions of OpenAFS are running on each of the servers and clients having troubles?
Occasionally, due to circumstances unrelated to OpenAFS, one of the
fileservers becomes unreachable on the network for a brief period of time,
say 30 minutes.  During this time, clients cannot access files hosted at
this fileserver, as I would expect.  On one of the other two fileservers,
I see the message 'afs: Lost contact with file server...' in the logs of
its client, as I would expect.  On the third fileserver, I see no such
entry, which I assume simply means that it did not have a client that was
trying to access any of the files on the server that is temporarily
unavailable due to network reasons (note: not because of server downtime,
in case this is important).
The full text of the "Lost contact with ..." message matters as it can contain the error code (aka reason) for the loss of contact.

A client will only contact a fileserver that it is reading volume content from.   If it never has a reason to read from the fileserver, then it will never contact it.

"rxdebug localhost 7001 -noconn -peer" will provide the list of all file and location server peers that have been contacted by the cache manager.

But when the network outage ends, recovery is only partial.  The OpenAFS
client on the fileserver that did not notice the outage continues to work
just fine.  But the OpenAFS client on the fileserver with the 'Lost
contact' message never prints an 'is back up' message, and when I run 'fs
checks' on the fileserver that noticed the outage, the following is
printed: 'These servers unavailable due to network or server problems:'
followed by the name of the server with the outage.

When you say the "client on the fileserver", is the client actually running on the same machine that the fileserver process is running on?

Or do you mean a client on another machine reading from the fileserver?

"fs checkservers" by itself only issues an RPC to fileservers that are known to be down within the default cell. Does "fs wscell" report the cell that the fileservers are located in?

Every time this happens, I try restarting the fileserver that was
unreachable, and I even try restarting all of the fileservers in the whole
cell.
If there was an actual network outage which separated the client from the fileserver, what change in circumstances are you attempting to trigger by restarting the fileserver process?
I'm running these servers on Debian machines with systemd, so I try
shutting them down and bringing them back up with systemd, and I also try
shutting them down and bringing them back up with 'bos', e.g. 'bos
shutdown'.  I try shutting them down one by one, and all at once, and all
at once with long lags of two minutes before restarting them all.

Both bos and systemctl will signal the fileserver process to save all client and callback state to a file and terminate.

When the fileserver process is restarted (regardless of the method) the saved state will be reloaded so that the client see no change in circumstance.

I try
'fs flush -all' on the client and the server, but to no avail.

"fs flush" instructs the cache manager to invalidate its cache metadata for all files and directories.  It doesn't alter the location or file server state information.

"fs flush" does not invalidate volume location information.   To do that use "fs checkvolumes".

I try
removing the IP address of the unreachable fileserver with vos remaddrs
and then putting it back with vos setaddrs, but to no avail.

Please do not run "vos remaddrs" unless the fileserver is permanently dead.   Removing the fileserver entry for a "UUID" and then creating a new one using "vos setaddrs" creates a new entry with an initial version number.    Each time the fileserver is restarted it registers itself with the location service.  If the list of addresses changes, then the version number of the entry is incremented.  When the cache manager fetches location information for a volume, a list of fileserver uuids and fileserver version numbers are provided.  If the cache manager does have the address information cached for the uuid, version, then it fetches it from the location service.   The cache manager will not fetch new fileserver address information if the fileserver location entry's version number is reset.

If the fileserver address information changes, that can be corrected by restarting the fileserver.

Nothing
seems to convince the client that noticed the outage from believing in the
fileserver that was briefly unavailable.

There are a few possibilities:

1. the network stack on the system running the client believes that there is no route to the fileserver

2. the probes which are sent from the cache manager every few minutes are in fact not responded to and therefore the fileserver is still unreachable

3. the "afs_checkserver" thread is not sending probes.  Perhaps it it is blocked somewhere

ps ax | grep afs_checkserver
cat /proc/<afs_checkserver-pid>/stack

The only thing that ever works is rebooting the machine with the affected
client.  I hate rebooting, and as far as I am aware, it is not possible to
shut down an OpenAFS client otherwise.

The generic problem is that its not possible to unmount a filesystem that is in use; and its not possible to unload a filesystem kernel module if there is an active mount.

I have read suggestions that this could be an issue on the fileserver
side, but 'vos status' shows no transactions.
"vos status" is reporting on the status of volserver transactions created for volume management operations such as create, clone, dump, forward, delete, etc.
Is there a way to force the client and the fileserver to rediscover each
other?

Its not possible to force anything until the root cause of the problem is understood.

Provide more details on what happened to the network and perhaps we can infer something about how the cache manager or the client how might have reacted to it.

Jeffrey Altman AuriStor, Inc.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to