I guess these problems have nothing to do with your hardware.
Have you checked the file he tries to read exists in the /vicepX partition and has the correct size? If so, I would try to run the non-threaded version of the fileserver which is built in src/viced. Also the problem of processes not going away looks for me like a problem in the pthread environment. Hartmut On Mon, 4 Mar 2002 [EMAIL PROTECTED] wrote: > Marco Foglia ([EMAIL PROTECTED]) wrote: > > > we are having continuous troubles with CopyOnWrite failures on a linux > > file server and as a result orphaned files. We have tried several setups > > and even changed hardware but it is still there. The current setup is > > redhat-6.2 with 2.2.19-6.2.12smp and Openafs 1.2.2. > > > > It only happens after a full backup (vos dump <volume>.backup -time 0) of > > a backup volume. You can find the following message in the file server > > log file: > > > > CopyOnWrite failed: volume 536872685 in partition /vicepa (tried > > reading 8192, read 0, wrote 0, errno 4) volume needs salvage > > > > Does anybody have the same problems? Solutions? > > > I have been fighting this problem for almost a year with no solution. > I didn't post anything to this list before because I wasn't convinced > that the problem wasn't in my hardware. By now, I've tried enough > different combinations that I'm convinced that it's a software problem. > > My hardware: Dual 800 MHz Pentium IIIs on two different motherboards, > 1 GB SDRAM > DPT Century VI RAID controller > or > Adaptec 3200S RAID controller > or > Mylex AcceleRAID 352 RAID controller > or > non-RAID Symbios SCSI controller > > My software: RedHat 6.2, 2.2.16-3 > Transarc AFS Base configuration afs3.6 2.0 > or > RedHat 6.2, 2.2.16-3 > Transarc AFS Base configuration afs3.6 2.3 > or > RedHat 6.2, 2.2.16-3 > OpenAFS 1.0.3 > or > [ many different versions in between ] > or > RedHat 7.2, kernel 2.4.9-21 > OpenAFS 1.2.3 > > As you can see, I've tried many different hardware and software > combinations and I still get corrupted volumes. It can happen > on any backup, not just full backups. I have tried running the > servers in uniprocessor mode and while that seemed to help, it > did not eliminate the problem. > > > Here are some excerpts from the log files on the most recent > corruption, using the last software configuration above: > > FileLog: > Thu Feb 28 15:47:19 2002 CopyOnWrite failed: volume 536877621 in partition /vicepc >(tried reading 8192, read 0, wrote 0, errno 4) volume needs salvage > Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for >/vicepc//V0536877621.vol; volume needs salvage > > VolserLog: > Thu Feb 28 15:35:09 2002 1 Volser: Clone: Recloning volume 536876261 to volume >536876263 > [...] > Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for >/vicepc/V0536877621.vol; volume needs salvage > Thu Feb 28 15:58:46 2002 1 Volser: ListVolumes: Could not attach volume 536877621 >(V0536877621.vol) error=101 > > SalvageLog.old: > @(#) OpenAFS 1.2.3 built 2002-02-01 > 02/28/2002 16:00:48 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc >536877621) > 02/28/2002 16:00:48 CHECKING CLONED VOLUME 536877622. > 02/28/2002 16:00:48 cs.usr0.naveen.backup (536877622) updated 02/28/2002 15:30 > 02/28/2002 16:00:48 Vnode 1: length incorrect; (is 8192 should be 0) > 02/28/2002 16:00:48 SALVAGING VOLUME 536877621. > 02/28/2002 16:00:48 cs.usr0.naveen (536877621) updated 02/28/2002 15:46 > 02/28/2002 16:00:48 Vnode 1198: version < inode version; fixed (old status) > 02/28/2002 16:00:48 Vnode 1514: version < inode version; fixed (old status) > 02/28/2002 16:00:48 Vnode 2754: version < inode version; fixed (old status) > [ similar lines deleted ] > 02/28/2002 16:00:48 Vnode 10078: version < inode version; fixed (old status) > 02/28/2002 16:00:48 Vnode 1: length incorrect; changed from 8192 to 0 > 02/28/2002 16:02:19 First page in directory does not exist. > 02/28/2002 16:02:19 Directory bad, vnode 1; salvaging... > 02/28/2002 16:02:19 Salvaging directory 1... > 02/28/2002 16:02:19 Failed to read first page of fromDir! > 02/28/2002 16:02:19 Checking the results of the directory salvage... > 02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file >.__afsD8D is deleted (vnode 1120) > 02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file >.__afs8036 is deleted (vnode 780) > 02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file >.__afsDED1 is deleted (vnode 7988) > [ similar lines deleted ] > 02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file >.__afs4975 is deleted (vnode 5862) > 02/28/2002 16:02:20 Vnode 1: link count incorrect (was 45, now 2) > 02/28/2002 16:02:30 Found 4784 orphaned files and directories (approx. 587713 KB) > 02/28/2002 16:02:30 Salvaged cs.usr0.naveen (536877621): 4785 files, 587714 blocks > > SalvageLog: > @(#) OpenAFS 1.2.3 built 2002-02-01 > 02/28/2002 16:05:06 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc >536877621 -orphans attach) > 02/28/2002 16:05:07 SALVAGING VOLUME 536877621. > 02/28/2002 16:05:07 cs.usr0.naveen (536877621) updated 02/28/2002 15:46 > 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as >__ORPHANDIR__.3.207280 > 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as >__ORPHANDIR__.11.46709 > 02/28/2002 16:05:07 Attaching orphaned directory to volume's root dir as >__ORPHANDIR__.13.7 > [ similar lines deleted ] > 02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as >__ORPHANFILE__.7846.198967 > 02/28/2002 16:05:07 Attaching orphaned file to volume's root dir as >__ORPHANFILE__.9320.219854 > 02/28/2002 16:05:07 Vnode 1: link count incorrect (was 2, now 45) > 02/28/2002 16:05:07 Vnode 3: link count incorrect (was 2, now 3) > 02/28/2002 16:05:07 Vnode 11: link count incorrect (was 1, now 2) > 02/28/2002 16:05:07 Vnode 13: link count incorrect (was 7, now 8) > 02/28/2002 16:05:07 Vnode 15: link count incorrect (was 2, now 3) > [ similar lines deleted ] > 02/28/2002 16:05:07 Vnode 9320: link count incorrect (was 0, now 1) > 02/28/2002 16:05:07 Salvaged cs.usr0.naveen (536877621): 4785 files, 587715 blocks > 02/28/2002 16:05:07 The volume header file V0536877622.vol is not associated with >any actual data (deleted) > > > The FileLog says "errno 4", which is Interrupted System Call. Could this > be a clue? > > We have a lot of older machines of many architectures here running older > client software. Could that cause this problem? > > Is there anything I can do to instrument the servers to help find the > root cause of the problem? > > > Another problem that may or may not be related: > > On these same machines, the fileserver processes can get into a state such > that a bos restart, shutdown or stop will not kill them and neither will a > simple "kill <pid>". You must do a "kill -9" on each of the fileserver > processes to make them go away. In the case of a bos restart, new > fileserver processes are started without the old ones having been killed, > the new processes fail, causing the salvager to start. When the salvager > finishes, the fileserver processes are started again and fail again, > leading to an endless fileserver-salvager-fileserver-salvager cycle. > This has led to us disabling the 4:00 AM Sunday restarts. > > > Thanks in advance for your help. > > ---Bob. > -- > Bob Hoffman, N3CVL University of Pittsburgh Tel: +1 412 624 8404 > [EMAIL PROTECTED] Department of Computer Science Fax: +1 412 624 8854 > _______________________________________________ > OpenAFS-info mailing list > [EMAIL PROTECTED] > https://lists.openafs.org/mailman/listinfo/openafs-info > ----------------------------------------------------------------- Hartmut Reuter e-mail [EMAIL PROTECTED] phone +49-89-3299-1328 RZG (Rechenzentrum Garching) fax +49-89-3299-1301 Computing Center of the Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik (IPP) ----------------------------------------------------------------- _______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
