-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/11/2011 06:32 PM, Andrew Deason wrote: > On Mon, 19 Sep 2011 20:22:17 +0200 > Torbjörn Moa <[email protected]> wrote: > >>>> : No such device >>>> Volume does not exist on server sysafs2.physto.se as indicated by the VLDB >>> >>> What version? Some things used to have problems with volume IDs over >>> 2147483648 but I thought we've fixed them all by now. >> >> On this particular node we run 1.4.6, but it varies between servers. > > I lied, this bug still exists. At least, it does for me on a 32-bit x86 > host. What platform was this? Through a quirk of atol/atoi it doesn't > seem to be a problem on amd64 for me, which is probably why I thought it > wasn't a problem. (gerrit 5594, bug 130266) >
Ouff! Good that I didn't panic-update my servers then... All my servers are 32-bit. >>> Something bumped the "max volume id" counter in the vldb by a large >>> number. This could happen in many different ways... unfortuntely, if >>> you don't have the logging level turned up in the vlserver or have >>> audit logs turned on, it's going to be difficult to determine what >>> did it. Do you run any kind of periodic checking for consistency of >>> volumes vs vldb or anything like that? >> >> Hmmm, yes we do. We have a nagios check running on all servers that >> does a "vos syncserv "$server" -d" and "vos syncvldb "$server" >> -dryrun" periodically. I guess you are implying I shouldn't do that... > > No, I don't mean to say that, but it's a possible cause. The -dryrun > option to these does not currently prevent "vos" from raising the max > volume id in the database. That's a bug, but it's what they currently > do. It doesn't even print out anything when it does this, so you > wouldn't know when it happened. (bug 130267) > OK, the nagios checks are still running, and again the problem is back. The max volume id is now 2267649774. Stupidly, I didn't keep a constant watch on it after we reset it manually. So, mainly as a test, I will disable the nagios checks, manually reset the maxvolid again, and then keep watching it. If it doesn't move then, in a couple of days or so, I may run the syncvldb and syncserv checks manually, one by one, server by server, and see what happens. Unless you have some other suggestion. For me the top prio is to find out what causes this. The problem is not really that nobody's _telling_ that they're bumping maxvolid, but rather that it _gets_ bumped in the first place. Here's the output from "vldb_check -database vldb.DB0 -vheader" on one of the vldb servers: - -- Ubik header size is 0 (should be 64) vldb header vldbversion = 4 headersize = 132120 [actual=132120] freePtr = 0x889ec eofPtr = 559744 allocblock calls = 3055026176 freeblock calls = 2769092608 MaxVolumeId = 2267649774 rw vol entries = 0 ro vol entries = 0 bk vol entries = 0 multihome info = 0x20418 (132120) server ip addr table: size = 255 entries volume name hash table: size = 8191 buckets volume id hash table: 3 tables with 8191 buckets each Header's maximum volume id is 2267649774 and largest id found in VLDB is 536936453 Scanning 3783 entries for possible repairs - -- Running "vos listvol" on all file servers and sorting the output, I find the largest volume ID existing on any server is actually 536936451 (a RW volume), which is consistent with what's in VLDB. So there wouldn't be a reason for syncvldb (or anyone else for that matter) to bump maxvolid at all, would there? Cheers, Torbjörn -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6Wt7UACgkQ0PwHef/zquApRQCfUKSL2j73aNE8WJecqllzUjL+ 1+IAn0dKiZ5jgUTVQllRMEafqxWlsIZl =i94P -----END PGP SIGNATURE----- _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
