Folks, One of our AFS file servers crashed this afternoon. OpenAFS 1.6.1 on RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64. It looks like the salvager hung and eventually the dafileserver stopped responding to clients.
We're rebooted, fsck'd the ext4 partitions, and finally ran the dasalvager -force by hand to attempt to correctly salvage the server. In all cases once the dafs instance starts up, it serves requests, it dispatches a volume salvage or 4, all the salvager processes get stuck and we start all over again. We've salvaged the server multiple times at this point -- our next hope is that we can restart the file server with the traditional file server process. (BTW, 2 and 3 GiB cores from dafileserver and dasalvager abound.) SalsrvLog messages are usually along the following: 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC' 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit 'FSSYNC'; attempting reconnect to server 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC' 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit 'FSSYNC'; attempting reconnect to server 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266) 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2 10/18/2012 17:55:11 AskOffline: request for fileserver to take volume offline failed; trying again... 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC' 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit 'FSSYNC'; attempting reconnect to server 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265) 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2 10/18/2012 17:55:11 AskOffline: request for fileserver to take volume offline failed; trying again... 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC' or 10/18/2012 22:20:49 dispatching child to salvage volume 540007729... 10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC' 10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit 'FSSYNC'; attempting reconnect to server and from FileLog (this looks like I'm restoring from backups) Thu Oct 18 22:25:30 2012 FSYNC_com: invalid protocol version (2574739029) Thu Oct 18 22:25:30 2012 FSYNC_com: invalid protocol version (3774863615) Thu Oct 18 22:25:30 2012 FSYNC_com: invalid protocol version (944130375) Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged. Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part /vicepb over SALVSYNC Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU Thu Oct 18 22:25:31 2012 SYNC_getCom: error receiving command Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part /vicepb over SALVSYNC Thu Oct 18 22:25:31 2012 FSYNC_com: read failed; dropping connection (cnt=103291) Thu Oct 18 22:25:37 2012 FSYNC_com: invalid protocol version (2023862981) I've checked, all my binaries are from my 1.6.1 build. What's going on? Jack Neely -- Jack Neely <[email protected]> Linux Czar, OIT Campus Linux Services Office of Information Technology, NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
