A situation described below prompts a rather important general question: What rules do the OpenAFS tools use to contact one of the DB servers?
Because they seem different from those used by the cache clients, and I have not found where they are documented. There are (some) notes on how the DB servers handle "down" sibling, and how the cache clients do. I have a single-host test OpenAFS cell with 1.6.5.2, and I have added a second IP address to '/etc/openafs/CellServDB' with an existing DNS entry (just to be sure) but not assigned to any machine: sometimes 'vos vldb' hangs for a while (105 seconds), doing 8 attempts to connect to the "down" DB server; sometimes it connects to the "up" server and returns instantly. The wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s, 21.4s, 4.6s, 25.4s, 26.2s, 3.8s. The worry I have is that the OpenAFS tools handle "down" DB servers less resiliently than the DB servers and the cache clients, as the situation described below seems to suggest, and this can have dismaying consequences. Context: "typical" cell with 3 DB servers, a few fileservers, and a few clients, with one of them doing backups in various ways, typically 'vos dump -clone'; each of > 100 (usually largish, dozens to hundreds of GiB) AFS-volumes incrementally dumped every day, so a 'vos dump -clone' every 10-15 minutes. Upgrading the cell servers from 1.4 to 1.6 (Debian), with all 3 DB servers #1, #2, #3 being replaced by servers with new OS and importantly new Ip addresses. Planned to do this incremental by adding a new DB server to the 'CellServDB', then starting it up, then removing the an old DB server, and so on until all 3 have been replaced in turn with new DB servers #4, #5, #6. At some point during this slow incremental plan there were 4 entries in both 'CellServDB's and the new one had not been started up yet, and would not be for a couple days. The OpenAFS client caches seemed to cope well as expected, as in a cell with a "quorum" of 3 "up" DB servers, and 1 "down". I think the only consequence I noticed was sometimes 'aklog' taking around 15 seconds. However *some* backups started to hang and some AFS-volumes became unaccessible to all clients. The fairly obvious cause was that the cloning transaction instead of being very quick would not end, and cloning locks the AFS-volume. An 'strace' of the relevant 'vos' instances would show repeated (for a very long time) attempts to contact the 1 "down" DB server. Some of the instances of 'vos dump -clone' seemed to contact one of the 3 "up" DB servers and had no issues. The backups server regrettably has a 1.4.7 client cache package (soon to be upgraded to 1.6.x). Perhaps newer packages have some different logic, but it seemed as if 'vos' would choose at random an entry from '/etc/openafs/CellServDB' and then stick with it even if it did not respond to a connection attempt. With a curious attempt to open "$HOME/.AFSSERVER" (which did not exist). the 1.6.5.2 'vos' also tries to open "/.AFSSERVER". _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
