We use Solaris 10 SPARC exclusively for our AFS servers.
After upgrading to 1.4.10 from 1.4.8 we had a very few
volumes that started spontaneously going off-line, recovering,
and then going off-line again until they needed to be salvaged.

Hearing that this might be related to inode, we moved these
volumes to a set of little use fileservers that were running 
namei at 1.4.10. It made no discernible difference.

Two volumes in particular accounted for >90% of our off-line 
volume issues.

FileLog:
Mon Apr 27 10:56:09 2009 Volume 2023867468 now offline, must be salvaged.
Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged.
Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged.
Mon Apr 27 10:56:22 2009 fssync: volume 2023867469 restored; breaking all 
call backs 
(restored vol above being R/O for R/W in need of salvage)

Both of the volumes most frequently impacted have content 
completely rewritten roughly every 20 minutes while being on 
an automated replication schedule of 15 minutes. One of them 
25MB, the other 95MB, both at about 80% quota.

We downgraded just the fileserver binary to 1.4.8 on all of 
our servers and have not seen a single off-line message in 
36 hours.


                                         -- David Boldt
                                         <[email protected]>

Reply via email to