On Mon, 28 Oct 2013 10:32:44 -0400 Kendrick Hernandez <[email protected]> wrote:
> In order to minimize downtime, I'd like to run vldb_check on an > offline copy of the vldb, but I'm wondering if the vlserver needs to > be completely shutdown prior to making the copy, or if read-only mode > would be good enough? This "read-only mode" (basically, "not having quorum") is about as safe as just shutting down the server processes, yes. The sync site will not immediately go into "read-only mode" until about a minute after the other two servers are sutdown, but I don't think you need to actually wait for that. If the other two servers are down, the sync site won't be able to commit any data, so the database should be unchanging so it should be "safe" for copying. However, neither this, nor shutting down the servers, is completely guaranteed to be safe. If the sync site is in the middle of committing data to the database when you do this, you will get a vldb.DB0 file that has some data partially written to it. There are some ways to deal with that, but I'm not sure if there's a completely robust solution for what you're talking about: If you SIGSTOP the process, copy both vldb.DB0 and vldb.DBSYS1 out, and then SIGCONT the process, that will give you enough information to reconstruct a valid database. The DBSYS1 file is a journal log, though, and I don't think we have any tooling to manually just replay the log into the DB0 file. (Maybe we should; that would solve this pretty easily, I think.) Without replying the log, though, you can just look at if the DBSYS1 log is "empty". If it is, the corresponding DB0 file is definitely fine, and you can just use that. If it's not, you can just throw away the recorded files and try again. But if the other two sites are shutdown, we may be in the middle of a write transaction that will not complete until we timeout, so it can be difficult to detect a false positive. The DBSYS1 file is "empty" I think if it's 64 bytes full of 0s; but there may be other ways it can appear "empty". Another way of getting a robust vldb.DB0 copy is using the propsed ubik_cp tool <http://gerrit.openafs.org/#change,9700>. I don't think that will work, though, if two of the sites are shutdown and we're in the middle of a write transaction. I could be wrong about that, though, I haven't tried. Also, all of this possible corruption is probably pretty rare. If you don't care so much about 100% guarantees and such, you can probably just copy the database twice, waiting a few seconds between each copy, and if they are the same, it's really likely that they're fine. That is especially true if two of the sites are shutdown, since we should have at most 1 in-progress write. > Then I'd make a copy of vldb.db0, run 'vldb_check -fix' on the copy, > hand-propagate that out to the remaining sites, and then bring down > the vlserver on the lowest ip site just before moving the fixed copy > into place. I could then bring up the vlserver on the lowest ip site, > and the remaining sites. If 'vldb_check -fix' bumps the database version number, you would be able to just install the new db on one site and let them synchronize themselves. I don't think it does that (yet), so yeah, you should hand-propagate the files. That's faster, anyway. -- Andrew Deason [email protected] _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
