It appears that periodically bos backupsys silently fails for a volume. I'm 
curious if others have seen the same issue.

We recently updated some of our AFS data gathering and internal health scripts 
to report on volumes that had not had a recent .backup volume created. This has 
mostly been a win, as it finds volumes which were left in a locked state, etc, 
etc. However, it has uncovered another problem where bos backupsys seems to 
miss a volume occasionally.

Environment - running a linux from scratch, openafs 1.4.12. We have 30 servers, 
with about 275,000 RW volumes, ie, not counting .backups and .readonly volumes.

Bos config for backupsys:

  bnode cron snapshot 1
  parm /usr/sbin/vos backupsys -se localhost -localauth
  parm 18:00
  end

Our new volume health checker was implemented near the end of May. It runs 
about 14 hours after the backupsys. If it sees a volume with .backup older than 
36 hours, it will report that old backup as

   g.presoff.smartgl: 1 days, 12 hrs, 32 min

This particular volume came up on June 7. vos examine -fo verified the .backup 
being older than expected. We did a manual vos backup on it and was fine for a 
few days. It occurred again on the same value June 12, we did a manual backup 
again. On Jun 19 it hit again. This time we deliberately didn't do a manual 
backup, the next day it reported being an additional 24 hours out of date. We 
forced a backup. On Jun 22 it again reported being out of date, we let it go 
and it came up again on the 23rd. At that point we vos moved it from one 
partition to another on the same server, then back to the original partition. 
The next day, it once again had not gotten a .backup, but the next day it did 
(weekend). On July 5 it again had an old .backup. We then moved it to another 
server, and it has not yet thrown problems. We have had several other volumes 
generate similar errors, tho none this persistent.

The three volumes involved were on three different file servers at the time 
that the .backup volumes were not created. One of them was migrated to a 
different server and ceased being a problem, one threw only one error and has 
not complained again. All three of the servers involved were of the same type 
(hardware configuration); there are four others of that type active, none of 
which have shown any errors.

Through all of this, there were no entries in BosLog, FileLog or VolserLog show 
any reference to the volume aside from normal access stuff. The volumes were 
not locked, and normal backups (vos dump) worked without errors. No kernel or 
daemon messages were generated about disk problems.

Anybody seen anything like this on their systems?

Steve_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to