We're seeing a similar issue. We just recently migrated all of our
dafileservers to 1.8.6 (the three dbs are still on 1.6.24). We're running
CentOS 7.9 (kernel 3.10.0-1160.2.2) and these are all vms on vmware.

The db servers appear to be okay (vos listvldb works, udebug shows recovery
state 1f), and the fileservers still *seem* to be serving content (could be
cached), but a 'vos partinfo localhost -localauth' returns:

Could not fetch the list of partitions from the server
Possible communication failure
Error in vos listpart command.
Possible communication failure


even though the underlying storage is attached, and 'find /vicepa -ls' can
traverse the vice mount and hasn't returned any errors.

I restarted the afs processes on one server, and post restart I'm seeing
the following in FileLog:

> Thu Jan 14 07:17:34 2021 File server has terminated normally at Thu Jan 14
> 07:17:34 2021
> Thu Jan 14 07:17:34 2021 File server starting (/usr/afs/bin/dafileserver
> -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000
> -nobusy -udpsize 524288 -rxpck 800 -b 16000)
> Thu Jan 14 07:19:54 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=5377, err=0)
> Thu Jan 14 07:24:35 2021 File server starting (/usr/afs/bin/dafileserver
> -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000
> -nobusy -udpsize 524288 -rxpck 800 -b 16000)
> Thu Jan 14 07:26:55 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=-1, err=0)
> Thu Jan 14 07:30:25 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:32:40 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:34:55 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:37:10 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
>

The dasalvager process keeps exiting (exit code 1), and SalsrvLog shows:

> Thu Jan 14 08:22:57 2021 @(#)OpenAFS 1.8.6 2020-07-15
> [email protected]
> Thu Jan 14 08:22:57 2021 Starting OpenAFS Online Salvage Server 2.4
> (/usr/afs/bin/salvageserver)
> Thu Jan 14 08:23:43 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:23:59 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:24:23 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:24:55 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:25:35 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> SYNC_connect failed (giving up!): Connection refused
> Thu Jan 14 08:26:23 2021 Unable to connect to file server; aborted
>

Really at a loss at what else to look for.

Best regards,
k-


On Thu, Jan 14, 2021 at 7:45 AM Valtteri Vuorikoski <[email protected]>
wrote:

>
> I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages.
> Last night everything was working fine, this morning machines were
> timing out trying to talk to volume servers. Database replication was
> also stuck.
>
> While there is a single backup database and file server, databases and
> volumes are primarily on a single server. I logged in to that server
> ("afs1"), made it the only machine in the cell by editing client and
> server CellServDB and set out trying to restore things.
>
> afs1 is running Debian bullseye. Kernel 5.8 (running at the time when
> things broke) and 5.10 result in an equally non-functional system. There
> are no iptables rules on the system.
>
> OpenAFS is almost 100% dead for no apparent reason:
>
> - "pts listentries" and "vos listvldb localhost" work. udebug shows both
>   servers in recovery state 1f, site is sync site and there are no
>   replicas (as expected at this point).
>
> - After restarting services, vos status -localauth -server localhost
>   prints the following:
>
> Could not access status information about the server
> Possible communication failure
> Error in vos status command.
> Possible communication failure
>
> - After a while, vos status no longer prints anything, just hangs. All
>   AFS client access times out.
>
> - There is mostly nothing in the logs. Starting
>   vlserver/ptserver/dafileserver with -d 125 doesn't lead to any extra
>   output. Nothing out of the ordinary (except AFS client errors) appears
>   in dmesg or journalctl -b. After starting dafileserver -L, the following
> log appears:
>
> Thu Jan 14 11:59:54 2021 File server starting
> (/usr/lib/openafs/dafileserver -L)
> Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=5376, err=0)
> Thu Jan 14 12:01:04 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 12:02:09 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
>  [the last message keeps repeating]
>
> - dasalvager appears to run successfully. I'm currently running a
>   voldump to recover data and it's running fine so far. There is plenty
>   of disk space.
>
> - Kerberos appears to be working. kinit works, aklog works, pts/vos
> commands without
>   -localauth work when a superuser token is present. KDC (Samba) doesn't
>   show any problems related to the afs principal. Clocks are accurate.
>
> - Rebooting the whole system (a qemu VM) makes no difference.
>
> After four hours of debugging, I'm at the end of my wits. Even
> temporarily removing all databases, restarting ptserver and vlserver and
> touching NoAuth won't make fileserver/volserver happy. It seems like RX
> communication is failing somehow, but I have no idea why.
>
> Any ideas what's going on here?
>
>  -Valtteri
>
> _______________________________________________
> OpenAFS-info mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-info
>


-- 
Kendrick Hernandez
*UNIX Systems Administrator*
Division of Information Technology
University of Maryland, Baltimore County

Reply via email to