Hi all, we've got a serious problem here. Whole directories are disappearing. Even a restore from a tape backup is not working properly -- the internal afs storage structure seems to be corrupted, such that a restore is reproducing the same kind of error!
Here are the details: We have 3 servers with 150 users, may not very active one. Accumulated used space is 200 GB. We are in production (after migrating from NFS / AMD) for over 2 month, now. We are using Redhat 7.2 and 7.3 and Server 1.2.3 / 1.2.4. During this time entire users directories became unavailable, twice. (ls results in "connection timed out") the FileServer log contains: Thu Jun 13 09:37:28 2002 ProbeUuid failed for host 172.22.85.135:7001 Thu Jun 13 09:46:05 2002 CopyOnWrite failed: volume 536871014 in partition /vicepa (tried reading 8192, read 0, wrote 0, errno 4) volume needs salvage Thu Jun 13 10:40:36 2002 VAttachVolume: volume salvage flag is ON for /vicepa//V0536871014.vol; volume needs salvage We salvages the volume and there the disaster increases: @(#) OpenAFS 1.2.4 built 2002-06-01 06/13/2002 10:43:14 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepa 536 871014 -tmpdir /tmp/ -orphans attach) 06/13/2002 10:43:27 CHECKING CLONED VOLUME 536871016. 06/13/2002 10:43:27 user.goetz.backup (536871016) updated 06/13/2002 01:02 06/13/2002 10:43:27 Vnode 1: length incorrect; (is 616448 should be 0) 06/13/2002 10:43:27 SALVAGING VOLUME 536871014. 06/13/2002 10:43:27 user.goetz (536871014) updated 06/13/2002 09:46 06/13/2002 10:43:27 Vnode 50318: version < inode version; fixed (old status) 06/13/2002 10:43:27 Vnode 50336: version < inode version; fixed (old status) 06/13/2002 10:43:27 Vnode 51128: version < inode version; fixed (old status) **** etc. **** 06/13/2002 10:43:27 Vnode 1: length incorrect; changed from 616448 to 0 06/13/2002 10:43:27 Vnode 3413: length incorrect; changed from 139264 to 0 06/13/2002 10:43:27 Vnode 4841: length incorrect; changed from 2048 to 0 06/13/2002 10:44:55 First page in directory does not exist. 06/13/2002 10:44:55 Directory bad, vnode 1; salvaging... 06/13/2002 10:44:55 Salvaging directory 1... 06/13/2002 10:44:55 Failed to read first page of fromDir! 06/13/2002 10:44:55 Checking the results of the directory salvage... 06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced file .__ afs7B72 is deleted (vnode 110664) 06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced file .__ afsF894 is deleted (vnode 92952) 06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced file .__ afs3A43 is deleted (vnode 97004) 06/13/2002 10:44:57 dir vnode 3401: special old unlink-while-referenced file .__ afs43D9 is deleted (vnode 99872) 06/13/2002 10:44:57 First page in directory does not exist. 06/13/2002 10:44:57 Directory bad, vnode 3413; salvaging... **** etc. **** So vnode 1 is incorrect?! They systems seems to like this idea and kills all data in the root directory of the volume! Receiving alls this hundreds of __ORPHANDIR__ and files doesn't help. To reconstruct all information would have taken days. So we decided to go back to the tape backup that was done from a backup volume the prior night. We restored everything... but as we mounted the volume no data seems to be in it. The fileserver says the same as in the case of the original volume 2 hours before... volume needs salvage! We did it again, same result, too!!! Out rescue arises from a backup that was two days old. There was no problem anymore: just vos volrestore ...; fs mkmount ...; and enjoy AFS ;) This is the second incident of that disastrous dimensions. A third occurred this morning, but only some directories where affected and strangely there were __ORPHANDIR__ created, but the originals were there, still. The errors occurred on different servers with different server-software versions 1.2.3 / 1.2.4. The client that mainly used the volumes were different, too. Sorry for the cynics, but people here at my site are making me a hard time, since I was the one that suggested AFS. Your help and suggestions are very welcome, as many of our institute are very concerned about this issues. They even suggested moving back to NFS, because AFS seems not to be ready for a production environment!? Thanks, Ruby -- Rubino Geiss, Universitaet Karlsruhe, IPD Goos Postfach 6980, D-76128 Karlsruhe, GERMANY Adenauerring 20a, 50.41 (AVG), Zi. 235 [EMAIL PROTECTED] Tel: (+49) 721 / 608-8352 Fax: (+49) 721 / 30047 _______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
