Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
The problem is that you the client to scan quickly to find a server that is up, but because networks are not perfectly reliable and drop packets all the time, it cannot know that a server is not up until that server has failed to respond to multiple retransmissions of the request. Those retransmissions cannot be sent quickly; in fact, they _must_ be sent with exponentially-increasing backoff times. Otherwise, when your network becomes congested, the retransmission of dropped packets will act as a runaway positive feedback loop, making the congestion worse and saturating the network. You are completely right if one must talk to that server. But I think that AFS/RX sometimes hangs to loong on waiting for one server instead of trying the next one. For example for questions that could be answered by any VLDB. I'm thinking of operation like group membership and volume location. Harald. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On 24 Jan 2014, at 07:48, Harald Barth h...@kth.se wrote: You are completely right if one must talk to that server. But I think that AFS/RX sometimes hangs to loong on waiting for one server instead of trying the next one. For example for questions that could be answered by any VLDB. I'm thinking of operation like group membership and volume location. I have long thought that we should be using multi for vldb lookups, specifically to avoid the problems with down database servers. The problem is that doing so may cause issues for sites that have multiple dbservers for scalability, rather than redundancy. Instead of each dbserver seeing a third (or a quarter, or ...) of requests it will see them all. Even if the client aborts the remaining calls when it receives the first response, the likelihood is that the other servers will already have received, and responded to, the request. There are ways we could be more intelligent (for example measuring the normal RTT of an RPC to the current server, and only doing a multi if that is succeeded) But we would have to be very careful that this wouldn't amplify a congestive collapse. Cheers, Simon___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
For example in an ideal world putting more or less DB servers in the client 'CellServDB' should not matter, as long as one that belongs to the cell is up; again if the logic were for all types of client: scan quickly the list of potential DB servers, find one that is up and belongs to the cell and reckons is part of the quorum, and if necessary get from it the address of the sync site. The problem is that you the client to scan quickly to find a server that is up, but because networks are not perfectly reliable and drop packets all the time, it cannot know that a server is not up until that server has failed to respond to multiple retransmissions of the request. That has nothing to do with how quickly the probes are sent... Those retransmissions cannot be sent quickly; in fact, they _must_ be sent with exponentially-increasing backoff times. That has nothing to do with how quickly they can be sent... The duration of the intervals betwen the probes is a different matter from what should be the ratio of intervals. Otherwise, when your network becomes congested, the retransmission of dropped packets will act as a runaway positive feedback loop, making the congestion worse and saturating the network. I am sorry I have not been clear about the topic: I was not meaning to discussing flow control is back-to-back streaming connections, my concern was about the frequency of *probing* servers for accessibility. Discovering the availability of DB servers is not the same thing as streaming data from/to a fileserver, both in nature and as to amount of traffic involved. In TCP congestion control for example one could be talking about streams of 100,000x 8192B packets per second. DB database discovery But even if I had meant to discuss back-to-back streaming packet congestion control, the absolute numbers are still vastly different. In the case of *probing* for the liveness of a *single* DB server I have observed the 'vos' command send packets with these intervals: «The wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s, 21.4s, 4.6s, 25.4s, 26.2s, 3.8s.» with randomish variations around that, That's around 5 packets per minute with intervals between 3,600ms and 26,200ms. Again, to a single DB server, not say roundrobin to all DB servers in 'CellServDB'. With TCP congestion control back of (the 'RTO' parameter) for 200ms (two hundreds milliseconds). With another rather different distributed filesystem, Lustre, I observed some issue with that very long backoff time, with high throughput (600-800MB/s) back-to-back packet streams, and there is a significant amount of research that on fast low latency links 200ms RTO seems way excessive. For example in a paper that is already 5 years old: http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf «Under severe packet loss, TCP can experience a timeout that lasts a minimum of 200ms, determined by the TCP minimum retransmission timeout (RTOmin ). While the default values operating systems use today may suffice for the wide-area, datacenters and SANs have round trip times that are orders of magnitude below the RTOmin defaults (Table 1). ScenarioRTT OS TCP RTOmin WAN 100ms Linux 200ms Datacenter 1msBSD 200ms SAN 0.1ms Solaris 400ms Table 1: Typical round-trip-times and minimum TCP retransmission bounds.» «FINE-GRAINED RTO How low must the RTO be to retain high throughput under TCP incast collapse conditions, and to how many servers does this solution scale? We explore this question using real-world measurements and ns-2 simulations [26], finding that to be maximally effective, the timers must operate on a granularity close to the RTT of the network—hundreds of microseconds or less.» «Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTOmin to microseconds improves goodput.» «Aggressively lowering both the RTO and RTOmin shows practical benefits for datacenters. In this section, we investigate if reducing the RTOmin value to microseconds and using finer granularity timers is safe for wide area transfers. We find that the impact of spurious timeouts on long, bulk data flows is very low – within the margins of error – allowing RTO to go into the microseconds without impairing wide-area performance.» ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 23 Jan 2014 21:55:15 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: Otherwise, when your network becomes congested, the retransmission of dropped packets will act as a runaway positive feedback loop, making the congestion worse and saturating the network. I am sorry I have not been clear about the topic: I was not meaning to discussing flow control is back-to-back streaming connections, my concern was about the frequency of *probing* servers for accessibility. That's the only way we have to probe if a server is up; by creating a streaming connection and issuing an RPC. How else do you propose we contact a server? Those retransmissions that jhutz is talking about is for a single probe request. We can possibly run those probes in parallel, which is getting discussed a bit in another part of the thread, but that still involves going through this process. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
I have long thought that we should be using multi for vldb lookups, specifically to avoid the problems with down database servers. The situation is a little bit different for cache managers who can remember which servers are down and command line tools which normally discocver how the world looks on each startup. If the 'we ask everyone' strategy is not used all the time but only on startup, it will not happen that frequent. Probably not as frequent to cause problems for the scalability folks. Harald. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 2014-01-24 at 08:01 +, Simon Wilkinson wrote: On 24 Jan 2014, at 07:48, Harald Barth h...@kth.se wrote: You are completely right if one must talk to that server. But I think that AFS/RX sometimes hangs to loong on waiting for one server instead of trying the next one. For example for questions that could be answered by any VLDB. I'm thinking of operation like group membership and volume location. I have long thought that we should be using multi for vldb lookups, specifically to avoid the problems with down database servers. The problem is that doing so may cause issues for sites that have multiple dbservers for scalability, rather than redundancy. Instead of each dbserver seeing a third (or a quarter, or ...) of requests it will see them all. Even if the client aborts the remaining calls when it receives the first response, the likelihood is that the other servers will already have received, and responded to, the request. There are ways we could be more intelligent (for example measuring the normal RTT of an RPC to the current server, and only doing a multi if that is succeeded) But we would have to be very careful that this wouldn't amplify a congestive collapse. The thing is, the OP specifically wasn't complaining about the behavior of the CM, which remembers when a vlserver is down and then doesn't talk to it again until it comes up, except for the occasional probe. The problem is the one-off clients that make _one RPC_ and then exit. They have no opportunity to remember what didn't work last time. It might help some for these sorts of clients to use multi, if they're doing read-only requests, and probably wouldn't create much load. However, for a call that results in a ubik write transaction, I'm not entirely sure it's desirable to do a multi call. That will require some additional thought. In the meantime, another thing that might be helpful is for clients about to make such an RPC to query the CM's record of which servers are up, and use that to decide which server to contact. A quick VIOCCKSERV with the fast flag set could make a big difference. -- Jeff ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 2014-01-24 at 11:41 -0500, Jeffrey Hutzelman wrote: The problem is the one-off clients that make _one RPC_ and then exit. They have no opportunity to remember what didn't work last time. It Has it been considered to write a cache file somewhere (even a user dotfile) that could be used if it's not stale? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonadhttp://sinenomine.net
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On 1/24/2014 11:45 AM, Brandon Allbery wrote: On Fri, 2014-01-24 at 11:41 -0500, Jeffrey Hutzelman wrote: The problem is the one-off clients that make _one RPC_ and then exit. They have no opportunity to remember what didn't work last time. It Has it been considered to write a cache file somewhere (even a user dotfile) that could be used if it's not stale? pts has interactive and source modes which would permit the pts process to cache the up/down information between requests. There is no reason the vos command could not have equivalent functionality which would significantly improve performance in a number of ways. Besides remembering which servers are up/down the same rx connections can be reused and repetitive rx security class challenge/responses can be avoided. I would prefer to see someone work on this approach combined with jhutz's VIOCCKSERV suggestion if a cache manager is installed on the local machine. Jeffrey Altman smime.p7s Description: S/MIME Cryptographic Signature
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 24 Jan 2014 11:41:35 -0500 Jeffrey Hutzelman jh...@cmu.edu wrote: The problem is the one-off clients that make _one RPC_ and then exit. They have no opportunity to remember what didn't work last time. It might help some for these sorts of clients to use multi, if they're doing read-only requests, and probably wouldn't create much load. However, for a call that results in a ubik write transaction, I'm not entirely sure it's desirable to do a multi call. That will require some additional thought. At least we could multi call the VOTE_GetSyncSite call, perhaps, in situations where we actually use GetSyncSite. In the meantime, another thing that might be helpful is for clients about to make such an RPC to query the CM's record of which servers are up, and use that to decide which server to contact. A quick VIOCCKSERV with the fast flag set could make a big difference. We have some code to implement this and honoring the client server preferences; finishing it I think just got buried behind higher priorities. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
[ ... ] adding the new machines to the CellServDB before the new server is up. You could bring up e.g. dbserver 4, and only after you're sure it's up and available, then add it to the client CellServDB. Then remove dbserver #3 from the client CellServDB, and then turn off dbserver #3. For the client 'CellServDB' I simply did not expect any issues: my expectation was that the clients would scan very quickly the list of those addresses, starting with the lowest numbered for example, and finding a live one member of quorum, and then if ncessary getting from it the address of the sync site; which is close to what it seems to do, only very slowly. I would have wished to put all 6 (different) IP addresses (3 up, 3 down) in the client 'CellServDB' and in 'fs newcell' to minimize the number of times I would do updates, but I could not because of a local configuration management system that puts the same list in the client and server 'CellServDB'. But done manually on a test client seemed to work fine, except for the 'vos' clients and their very long search timeouts. My real issue was 'server/CellServeDB' because we could not prepare ahead of time all 3 new servers, but only one at a time. The issue is that with 'server/CellServDB' update there is potentially a DB daemon (PT, VL) restart (even if the rekeying instructions hint that when the mtime of 'server/CellServDB' changes the DB daemons reread it) and in any case a sync site election. Because each election causes a blip with the client I would rather change the 'server/CellServDB' by putting in extra entries ahead of time or leaving in entries for disabled servers, to reduce the number of times elections are triggered. Otherwise I can only update one server per week... Ideally if I want to reshape the cell from DB servers 1, 2, 3 to 4, 5, 6, I'd love to be able to do it by first putting in the 'server/CellServDB' all 6 with 4, 5, 6 not yet available, and only at the end remove 1, 2, 3. What does not play well (if one of the 3 live servers fails) with the quorum :-) so went halfway. You would need to keep the server-side CellServDB accurate on the dbservers in order for them to work, but the client CellServDB files can be missing dbservers. [ ... ] It would be nice to know more about the details here to make planning easier in future updates. For example in an ideal world putting more or less DB servers in the client 'CellServDB' should not matter, as long as one that belongs to the cell is up; again if the logic were for all types of client: scan quickly the list of potential DB servers, find one that is up and belongs to the cell and reckons is part of the quorum, and if necessary get from it the address of the sync site. Similarly (within limits) deliberately having non-up DB servers to the 'server/CellServDB' should not matter that much, because non-up DB servers happen anyhow in case of failures. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
[ ... ] At some point during this slow incremental plan there were 4 entries in both 'CellServDB's and the new one had not been started up yet, and would not be for a couple days. Oh also, I'm not sure why you're adding the new machines to the CellServDB before the new server is up. You could bring up e.g. dbserver 4, and only after you're sure it's up and available, then add it to the client CellServDB. Then remove dbserver #3 from the client CellServDB, and then turn off dbserver #3. As mentioned at greater length in a just sent message, to minimize the number of DB daemons restarts/resets. But even without that motivation, if I have a cell with 4 DB servers, the fact that 1 is down should really have no or little noticeable impact, or else the redundancy they provide is not that worthwhile... You would need to keep the server-side CellServDB accurate on the dbservers in order for them to work, Well, accurate perhaps is a bit too strong: there needs to be enough listed to form a quorum, and there should be at least one that is common between the client 'CellServDB' and the 'server/CellServDB', and ideally all the 'server/CellServDB' members should be also in the client 'CellServDB' (here I guess everybody understands that 'CellServDB' means that file or equivalent means in DNS etc.). Because the crucial properties are: * DB servers for the wrong cell name or without the key don't matter. * DB servers outside the quorum for a cell name don't matter. * All quorum members know each other and which of them is the sync site. and therefore I hope this happens: * Each DB server knows which cell it belongs to and its key(s). * Each DB server knows whether it is part of the quorum, and the list of quorum members. * Each DB server that is part of the quorum knows which one is the sync site. Then whenever a DB server contacts or is contacted by another DB servers it should: * Check that the cell name is the same. * Verify the cell is the same using the shared key. * Ask the other server for a list of quorum members. * Check whether the other server is in its quorum list: - If missing, add to quorum list and trigger an election. - If present, check the 'sync site' is the same: o If not same, trigger an election. CellServDB files can be missing dbservers. What clients (cache or tools) should probably do: * Contact all 'CellServDB' servers quickly. * For each or first DB server that replies check whether it has the same cell name. * If is the same cell name, try to use a token to get the list of quorum member: - If that failes, the DB server does not have the right key, so skip it. - If that succeeds: o Choose a quorum member at random for a query. o Choose the sync-site for an update. This won't work if a client needs the sync-site, and the sync-site is missing from the CellServDB, but in all other situations, that should work fine. Then current client libraries could be improved, because any quorum member could be asked for the address of the sync-site. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 23 Jan 2014 14:58:35 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: The issue is that with 'server/CellServDB' update there is potentially a DB daemon (PT, VL) restart (even if the rekeying instructions hint that when the mtime of 'server/CellServDB' changes the DB daemons reread it) and in any case a sync site election. The daemons do reread the local configuration if the CellServDB mtime changes. But they don't reinitialize the voting algorithm data and rx connections etc that would be required to incorporate a new dbserver into the quorum. So, for that you need to restart, yes. You would need to keep the server-side CellServDB accurate on the dbservers in order for them to work, but the client CellServDB files can be missing dbservers. [ ... ] It would be nice to know more about the details here to make planning easier in future updates. I'm not sure what additional details you want. You just always make sure the client CellServDB doesn't refer to dbservers that don't exist. So, when you add a new dbserver, don't add it to the client CellServDB until it's up and running. And when you remove a dbserver, remove it from the client CellServDB before decommissioning it. For example in an ideal world putting more or less DB servers in the client 'CellServDB' should not matter, as long as one that belongs to the cell is up; again if the logic were for all types of client: scan quickly the list of potential DB servers, find one that is up and belongs to the cell and reckons is part of the quorum, and if necessary get from it the address of the sync site. There is an idea we had pending for performing a VL_ProbeServer multi_rx call on 'vos' startup to see which servers are up before doing anything. The possible argument against this is that it adds a little bit of load and a little bit of delay on every operation, even if all of the servers are up. But maybe it's worth it. Another possible optimization that can be made is that ubik-using utilities could try the lowest-ip dbserver first when doing something that requires db write access (or just randomly pick a site from the lowest half+1 of the quorum), which would speed up the process in a majority of cases. The argument against that, of course, is that the lowest IP heuristic may not always apply in future implementations of ubik, and in general it can make the minority of cases worse (when the lower IPs are unreachable). -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 2014-01-23 at 10:44 -0600, Andrew Deason wrote: For example in an ideal world putting more or less DB servers in the client 'CellServDB' should not matter, as long as one that belongs to the cell is up; again if the logic were for all types of client: scan quickly the list of potential DB servers, find one that is up and belongs to the cell and reckons is part of the quorum, and if necessary get from it the address of the sync site. The problem is that you the client to scan quickly to find a server that is up, but because networks are not perfectly reliable and drop packets all the time, it cannot know that a server is not up until that server has failed to respond to multiple retransmissions of the request. Those retransmissions cannot be sent quickly; in fact, they _must_ be sent with exponentially-increasing backoff times. Otherwise, when your network becomes congested, the retransmission of dropped packets will act as a runaway positive feedback loop, making the congestion worse and saturating the network. -- Jeff ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 2014-01-23 at 14:58 +, Peter Grandi wrote: My real issue was 'server/CellServeDB' because we could not prepare ahead of time all 3 new servers, but only one at a time. The issue is that with 'server/CellServDB' update there is potentially a DB daemon (PT, VL) restart (even if the rekeying instructions hint that when the mtime of 'server/CellServDB' changes the DB daemons reread it) and in any case a sync site election. Because each election causes a blip with the client I would rather change the 'server/CellServDB' by putting in extra entries ahead of time or leaving in entries for disabled servers, to reduce the number of times elections are triggered. Otherwise I can only update one server per week... There's not really any such thing as a new election. Elections happen approximately every 15 seconds, all the time. An interruption in service occurs only when an election _fails_; that is, when no one server obtains the votes of more than half of the servers that exist(*). That can happen if not enough servers are up, of course, but it can also happen when one or more servers that are up are unable to vote for the ideal candidate. Generally, the rule is that one cannot vote for two different servers within 75 seconds, or vote for _any_ server within 75 seconds of startup. From a practical matter, what this means when restarting database servers for config updates is that you must not restart them all at the same time. You _can_ restart even the coordinator without causing an interruption in service longer than the time it takes the server to restart (on the order of milliseconds, probably). Even though the server that just restarted cannot vote for 75 seconds, that doesn't mean it cannot run in _and win_ the election. However, after restarting one server, you need to wait for things to completely stabilize before restarting the next one. This typically takes from 75-90 seconds, and can be observed in the output of 'udebug'. What you are looking for is for the recovery state to be f or 1f, and for the coordinator to be getting yes votes from every server you think is supposed to be up. Of course, you _will_ have an interruption in service when you retire the machine that is the coordinator. At the moment, there is basically no way to avoid that. However, if you plan and execute the transition carefully, you only need to take that outage once. (*) Special note: The server with the lowest IP address gets an extra one-half vote, but only when voting for itself. This helps to break ties when the CellServDB contains an even number of servers. Ideally if I want to reshape the cell from DB servers 1, 2, 3 to 4, 5, 6, I'd love to be able to do it by first putting in the 'server/CellServDB' all 6 with 4, 5, 6 not yet available, and only at the end remove 1, 2, 3. What does not play well (if one of the 3 live servers fails) with the quorum :-) so went halfway. This doesn't work because, with 6 servers in the CellServDB, to maintain a quorum you must have four servers running, or three servers if one of them is the one with the lowest address. In fact, you can't even transition safely from three to four servers, because once you have four servers in your CellServDB, if the one with the lowest address goes down before the new server is brought up, you'll have two out of four servers up and no quorum. However, you can safely and cleanly transition to and from larger numbers of servers, one server at a time. Just be sure that before you start up a new server, every existing server has been restarted with a CellServDB naming that server. Similarly, make sure to shut a server down before removing it from remaining servers' CellServDB files. At one point, I believe I worked out a sequence involving careful use of out-of-sync CellServDB files and the -ubiknocoord option (gerrit #2287) to allow safely transitioning from 3 servers to 4. However, this is not recommended unless you have a deep understanding of the election code, because it is easy to screw up and create a situation where you can have two sync sites. I also worked out (but never implemented) a mechanism to allow an administrator to trigger a clean transition of the coordinator role from one server to another _without_ a 75-second interruption. I'm sure at some point that we'll revisit that idea. -- Jeff ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 23 Jan 2014 14:33:58 -0500 Jeffrey Hutzelman jh...@cmu.edu wrote: The problem is that you the client to scan quickly to find a server that is up, but because networks are not perfectly reliable and drop packets all the time, it cannot know that a server is not up until that server has failed to respond to multiple retransmissions of the request. So... what about issuing a multi_rx VL_ProbeServer, like I said in the removed context? You only need to wait for one site to respond, and then you can immediately kill the other calls. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Thu, 23 Jan 2014 15:39:03 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: Oh also, I'm not sure why you're adding the new machines to the CellServDB before the new server is up. You could bring up e.g. dbserver 4, and only after you're sure it's up and available, then add it to the client CellServDB. Then remove dbserver #3 from the client CellServDB, and then turn off dbserver #3. As mentioned at greater length in a just sent message, to minimize the number of DB daemons restarts/resets. I thought you were already migrating one server at a time, so my suggestion doesn't impact the number of dbserver restarts. In fact, what I was suggesting doesn't really touch the servers at all; the only difference I was suggesting was changing when you modify the client-side CellServDB relative to when you perform the dbserver migrations. But even without that motivation, if I have a cell with 4 DB servers, the fact that 1 is down should really have no or little noticeable impact, or else the redundancy they provide is not that worthwhile... I think it's quite worthwhile to have the database be available yet slow, as opposed to not being available at all. But yes, I believe there is at least 1 way in which the relevant code might be improved. In the meantime, there are existing procedures with the existing code that existing sites perform to mostly avoid the problems you are seeing. It's up to you if you want to use them. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 17 Jan 2014 18:50:13 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: What rules do the OpenAFS tools use to contact one of the DB servers? Most of the time (read requests), we'll pick a random dbserver, and use it. If contacting a dbserver fails for network reasons, we will try to avoid that server unless we run out of servers. Possibly the big difference in the behavior you're seeing is that the kernel clients only perform unauthenticated read operations on the dbservers. Userspace tools like vos sometimes need to perform write operations, which changes things somewhat. For a write operation to the db, things are slightly different because we must pick the sync site; we cannot access just any dbserver to fulfill the request. If there are 3 or fewer dbservers, we pick randomly like we do for read operations. If there are more than 3 sites, we ask one of the sites who the sync site is, and then we contact the sync site in order to perform the request. If the request is successful, we remember who the sync site is, and keep using it until we get a network error (or a I'm not the sync site error). With userspace tools, we have no way of remembering which servers are down between invocations, so each time a tool runs, it picks a random server again. The kernel clients are running for a much longer period of time, so presumably if we contact a downed dbserver, the client will not try to contact that dbserver for quite some time. That's just for choosing a dbserver site, though; if you want to know how long we take to fail to connect to a specific site: I have a single-host test OpenAFS cell with 1.6.5.2, and I have added a second IP address to '/etc/openafs/CellServDB' with an existing DNS entry (just to be sure) but not assigned to any machine: sometimes 'vos vldb' hangs for a while (105 seconds), doing 8 attempts to connect to the down DB server; I'm not sure how you are determining that we're making 8 attempts to contact the down server. Are you just seeing 8 packets go by? We can send many packets for a single attempt to contact the remote site. By default openafs code tends to wait about 50 seconds for a site to respond to a request. vos sets this to 90 seconds for most things (I don't know why), during which period, it will retry sending packets. 105 seconds is close enough that that should explain it; the timeouts are not always exact, since we kind of poll outstanding calls to see if they have timed out. The OpenAFS client caches seemed to cope well as expected, as in a cell with a quorum of 3 up DB servers, and 1 down. I think the only consequence I noticed was sometimes 'aklog' taking around 15 seconds. The kernel client will not notice changes to the CellServDB until you restart it, or run 'fs newcell'. The client also usually doesn't need to contact the dbservers very often; it could easily take an hour for you to notice even if all of the dbservers were down. If the client hits a downed dbserver, it will hang, too (at least around 50 seconds). However *some* backups started to hang and some AFS-volumes became unaccessible to all clients. The fairly obvious cause was that the cloning transaction instead of being very quick would not end, and cloning locks the AFS-volume. I _think_ this is because we are hanging on allocating a new volume id for the temporary clone. If you run with -verbose, do you see Allocating new volume id for clone of volume... before it hangs? We could possibly do that before we mark the volume as busy, but then we might allocate a vol id we never use, if the volume isn't usable. Maybe that's better, though. Fixing that doesn't eliminate the hanging behavior you're seeing, but it would mean the volume would be accessible to clients while 'vos' is hanging. May I ask why you are not just dumping .backup volumes? You could create the .backup volumes en-masse with 'vos backupsys', and then you could just 'vos dump' them afterwards. Performing operations in big bulk operations like that as much as possible would make the tools more resilient to the errors you are seeing, since then the tool is a single command and can remember which dbserver is down. With a curious attempt to open $HOME/.AFSSERVER (which did not exist). the 1.6.5.2 'vos' also tries to open /.AFSSERVER. This is for rmtsys support with certain environments usually involving the NFS translator. I assume this happens when 'vos' tries to get your tokens in order to authenticate; if that's correct, it'll go away if you run with -noauth or -localauth. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 17 Jan 2014 18:50:13 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: Planned to do this incremental by adding a new DB server to the 'CellServDB', then starting it up, then removing the an old DB server, and so on until all 3 have been replaced in turn with new DB servers #4, #5, #6. At some point during this slow incremental plan there were 4 entries in both 'CellServDB's and the new one had not been started up yet, and would not be for a couple days. Oh also, I'm not sure why you're adding the new machines to the CellServDB before the new server is up. You could bring up e.g. dbserver #4, and only after you're sure it's up and available, then add it to the client CellServDB. Then remove dbserver #3 from the client CellServDB, and then turn off dbserver #3. You would need to keep the server-side CellServDB accurate on the dbservers in order for them to work, but the client CellServDB files can be missing dbservers. This won't work if a client needs the sync-site, and the sync-site is missing from the CellServDB, but in all other situations, that should work fine. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 2014-01-17 at 14:12 -0600, Andrew Deason wrote: time, so presumably if we contact a downed dbserver, the client will not try to contact that dbserver for quite some time. To elaborate: the cache manager keeps track of every server, and periodically sends a sort of ping to each server to find out which servers are up. So, it will discover a server is down even if you're not using it. And, other than the periodic pings, the cache manager will never direct a request to a server it thinks is down. So, failover for the CM itself is automatic, persistent, and often completely transparent. The fileserver works a little differently, but also keeps track of which server it is using, fails over when that server stops responding, and generally avoids switching when it doesn't need to. Ubik database servers all communicate among themselves, which is a necessary part of the database replication mechanism. That happens even when one server is down, but in such a way that you'll never notice a communication failure between dbservers except in an unusual combination of circumstances which can sometimes happen if a server goes down while you are making a request that requires writing to the database. I have a single-host test OpenAFS cell with 1.6.5.2, and I have added a second IP address to '/etc/openafs/CellServDB' with an existing DNS entry (just to be sure) but not assigned to any machine: sometimes 'vos vldb' hangs for a while (105 seconds), doing 8 attempts to connect to the down DB server; I'm not sure how you are determining that we're making 8 attempts to contact the down server. Are you just seeing 8 packets go by? We can send many packets for a single attempt to contact the remote site. Right. Even though AFS communicates over UDP, which itself is connectionless, Rx does have the notion of connections and includes a full transport layer including retransmission, sequencing, flow control, and exponential backoff for congestion control. What you are actually seeing is multiple retransmissions of a request, which may or may not be the first packet in a new connection. The packet is retransmitted because the server did not reply with an acknowledgement, and the intervals get longer because of exponential backoff, which is an key factor in making sure that congested networks eventually get better rather than only getting worse. -- Jeff ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools
On Fri, 2014-01-17 at 14:21 -0600, Andrew Deason wrote: On Fri, 17 Jan 2014 18:50:13 + p...@afs.list.sabi.co.uk (Peter Grandi) wrote: Planned to do this incremental by adding a new DB server to the 'CellServDB', then starting it up, then removing the an old DB server, and so on until all 3 have been replaced in turn with new DB servers #4, #5, #6. At some point during this slow incremental plan there were 4 entries in both 'CellServDB's and the new one had not been started up yet, and would not be for a couple days. Oh also, I'm not sure why you're adding the new machines to the CellServDB before the new server is up. You could bring up e.g. dbserver #4, and only after you're sure it's up and available, then add it to the client CellServDB. Then remove dbserver #3 from the client CellServDB, and then turn off dbserver #3. Yup; that's the sane thing to do. New servers should be in service before you publish them in AFSDB or SRV records or in clients' CellServDB files, and old servers should not be removed from service until after they have been unpublished and all the clients you care about have picked up the change. You would need to keep the server-side CellServDB accurate on the dbservers in order for them to work, but the client CellServDB files can be missing dbservers. This won't work if a client needs the sync-site, and the sync-site is missing from the CellServDB, but in all other situations, that should work fine. This is what gerrit #2287 is about. It adds a switch that will allow you to configure your dbservers so that they will not be elected coordinator. Unpublished servers should be run with this switch, or configured as non-voting servers, so that they don't become sync site. Unfortunately, progress on getting that merged has been stalled for a while, in no small part because there are changes still needed and a related patch required significant rework, and I haven't had time to touch this stuff in a few months. So in the meantime, the best you can do is insure the unpublished server will not become sync site by some combination of careful selection of the IP addresses involved, careful monitoring and management of the election process, and/or marking the unpublished server as nonvoting. Some care is required for nonvoting servers, as in theory all dbservers must agree on who the voting servers are. Some mismatches are possible and even safe, but figuring out which those are and what the behavior will be requires a thorough understanding of what checks are done and how the voting process works. -- Jeff ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info