It appears that periodically bos backupsys silently fails for a volume. I'm curious if others have seen the same issue.
We recently updated some of our AFS data gathering and internal health scripts to report on volumes that had not had a recent .backup volume created. This has mostly been a win, as it finds volumes which were left in a locked state, etc, etc. However, it has uncovered another problem where bos backupsys seems to miss a volume occasionally. Environment - running a linux from scratch, openafs 1.4.12. We have 30 servers, with about 275,000 RW volumes, ie, not counting .backups and .readonly volumes. Bos config for backupsys: bnode cron snapshot 1 parm /usr/sbin/vos backupsys -se localhost -localauth parm 18:00 end Our new volume health checker was implemented near the end of May. It runs about 14 hours after the backupsys. If it sees a volume with .backup older than 36 hours, it will report that old backup as g.presoff.smartgl: 1 days, 12 hrs, 32 min This particular volume came up on June 7. vos examine -fo verified the .backup being older than expected. We did a manual vos backup on it and was fine for a few days. It occurred again on the same value June 12, we did a manual backup again. On Jun 19 it hit again. This time we deliberately didn't do a manual backup, the next day it reported being an additional 24 hours out of date. We forced a backup. On Jun 22 it again reported being out of date, we let it go and it came up again on the 23rd. At that point we vos moved it from one partition to another on the same server, then back to the original partition. The next day, it once again had not gotten a .backup, but the next day it did (weekend). On July 5 it again had an old .backup. We then moved it to another server, and it has not yet thrown problems. We have had several other volumes generate similar errors, tho none this persistent. The three volumes involved were on three different file servers at the time that the .backup volumes were not created. One of them was migrated to a different server and ceased being a problem, one threw only one error and has not complained again. All three of the servers involved were of the same type (hardware configuration); there are four others of that type active, none of which have shown any errors. Through all of this, there were no entries in BosLog, FileLog or VolserLog show any reference to the volume aside from normal access stuff. The volumes were not locked, and normal backups (vos dump) worked without errors. No kernel or daemon messages were generated about disk problems. Anybody seen anything like this on their systems? Steve_______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
