David R Boldt wrote: > > We use Solaris 10 SPARC exclusively for our AFS servers. > After upgrading to 1.4.10 from 1.4.8 we had a very few > volumes that started spontaneously going off-line, recovering, > and then going off-line again until they needed to be salvaged. > > Hearing that this might be related to inode, we moved these > volumes to a set of little use fileservers that were running > namei at 1.4.10. It made no discernible difference. > > Two volumes in particular accounted for >90% of our off-line > volume issues. > > FileLog: > Mon Apr 27 10:56:09 2009 Volume 2023867468 now offline, must be salvaged. > Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged. > Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged. > Mon Apr 27 10:56:22 2009 fssync: volume 2023867469 restored; breaking > all call backs > (restored vol above being R/O for R/W in need of salvage)
That's interesting: I saw similar behavior on some of our volumes, however, with AFS/OSD fileservers. I then made the ViceLog messages more eloquent and found out that this always happened when IH_OPEN failed. This can fail if the handle in the vnode is missing. To prevent that I added some lines in VGetVnode_r when an already existing vnode structure is found to check whether the handle is in place and if not do a new IH_INIT (and write a message into the log). I found about 100 cases per day in our cell, but not all of them would have ended in taking the volume off-line because in many cases the handle never would have been used (All the GetStatus RPCs). Since then I never again saw volumes going off-line. Hartmut > > Both of the volumes most frequently impacted have content > completely rewritten roughly every 20 minutes while being on > an automated replication schedule of 15 minutes. One of them > 25MB, the other 95MB, both at about 80% quota. > > We downgraded just the fileserver binary to 1.4.8 on all of > our servers and have not seen a single off-line message in > 36 hours. > > > -- David Boldt > <[email protected]> -- ----------------------------------------------------------------- Hartmut Reuter e-mail [email protected] phone +49-89-3299-1328 fax +49-89-3299-1301 RZG (Rechenzentrum Garching) web http://www.rzg.mpg.de/~hwr Computing Center of the Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik (IPP) ----------------------------------------------------------------- _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
