Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Harald Barth
 The problem is that you the client to scan quickly to find a server
 that is up, but because networks are not perfectly reliable and drop
 packets all the time, it cannot know that a server is not up until that
 server has failed to respond to multiple retransmissions of the request.
 Those retransmissions cannot be sent quickly; in fact, they _must_ be
 sent with exponentially-increasing backoff times.  Otherwise, when your
 network becomes congested, the retransmission of dropped packets will
 act as a runaway positive feedback loop, making the congestion worse and
 saturating the network.


You are completely right if one must talk to that server. But I think
that AFS/RX sometimes hangs to loong on waiting for one server
instead of trying the next one. For example for questions that could
be answered by any VLDB. I'm thinking of operation like group
membership and volume location.

Harald.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Simon Wilkinson
On 24 Jan 2014, at 07:48, Harald Barth h...@kth.se wrote:

 You are completely right if one must talk to that server. But I think
 that AFS/RX sometimes hangs to loong on waiting for one server
 instead of trying the next one. For example for questions that could
 be answered by any VLDB. I'm thinking of operation like group
 membership and volume location.

I have long thought that we should be using multi for vldb lookups, 
specifically to avoid the problems with down database servers. The problem is 
that doing so may cause issues for sites that have multiple dbservers for 
scalability, rather than redundancy. Instead of each dbserver seeing a third 
(or a quarter, or ...) of requests it will see them all. Even if the client 
aborts the remaining calls when it receives the first response, the likelihood 
is that the other servers will already have received, and responded to, the 
request.

There are ways we could be more intelligent (for example measuring the normal 
RTT of an RPC to the current server, and only doing a multi if that is 
succeeded) But we would have to be very careful that this wouldn't amplify a 
congestive collapse.

Cheers,

Simon___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Peter Grandi
 For example in an ideal world putting more or less DB servers
 in the client 'CellServDB' should not matter, as long as one
 that belongs to the cell is up; again if the logic were for
 all types of client: scan quickly the list of potential DB
 servers, find one that is up and belongs to the cell and
 reckons is part of the quorum, and if necessary get from it
 the address of the sync site.

 The problem is that you the client to scan quickly to find a
 server that is up, but because networks are not perfectly
 reliable and drop packets all the time, it cannot know that a
 server is not up until that server has failed to respond to
 multiple retransmissions of the request.

That has nothing to do with how quickly the probes are sent...

 Those retransmissions cannot be sent quickly; in fact, they
 _must_ be sent with exponentially-increasing backoff times.

That has nothing to do with how quickly they can be sent...  The
duration of the intervals betwen the probes is a different matter
from what should be the ratio of intervals.

 Otherwise, when your network becomes congested, the
 retransmission of dropped packets will act as a runaway positive
 feedback loop, making the congestion worse and saturating the
 network.

I am sorry I have not been clear about the topic: I was not
meaning to discussing flow control is back-to-back streaming
connections, my concern was about the frequency of *probing*
servers for accessibility.

Discovering the availability of DB servers is not the same thing
as streaming data from/to a fileserver, both in nature and as to
amount of traffic involved. In TCP congestion control for example
one could be talking about streams of 100,000x 8192B packets per
second. DB database discovery 

But even if I had meant to discuss back-to-back streaming packet
congestion control, the absolute numbers are still vastly
different. In the case of *probing* for the liveness of a *single*
DB server I have observed the 'vos' command send packets with
these intervals:

  «The wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s,
  21.4s, 4.6s, 25.4s, 26.2s, 3.8s.»

with randomish variations around that, That's around 5 packets per
minute with intervals between 3,600ms and 26,200ms. Again, to a
single DB server, not say roundrobin to all DB servers in
'CellServDB'.

With TCP congestion control back of (the 'RTO' parameter) for
200ms (two hundreds milliseconds). With another rather different
distributed filesystem, Lustre, I observed some issue with that
very long backoff time, with high throughput (600-800MB/s)
back-to-back packet streams, and there is a significant amount of
research that on fast low latency links 200ms RTO seems way
excessive.

For example in a paper that is already 5 years old:

  http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf

«Under severe packet loss, TCP can experience a timeout that
lasts a minimum of 200ms, determined by the TCP minimum
retransmission timeout (RTOmin ).

While the default values operating systems use today may
suffice for the wide-area, datacenters and SANs have round
trip times that are orders of magnitude below the RTOmin
defaults (Table 1).

ScenarioRTT OS  TCP RTOmin
WAN 100ms   Linux   200ms
Datacenter  1msBSD 200ms
SAN 0.1ms  Solaris 400ms

Table 1: Typical round-trip-times and minimum
TCP retransmission bounds.»

«FINE-GRAINED RTO

How low must the RTO be to retain high throughput under TCP
incast collapse conditions, and to how many servers does this
solution scale? We explore this question using real-world
measurements and ns-2 simulations [26], finding that to be
maximally effective, the timers must operate on a granularity
close to the RTT of the network—hundreds of microseconds or
less.»

«Figure 3: Experiments on a real cluster validate the
simulation result that reducing the RTOmin to microseconds
improves goodput.»

«Aggressively lowering both the RTO and RTOmin shows practical
benefits for datacenters. In this section, we investigate if
reducing the RTOmin value to microseconds and using finer
granularity timers is safe for wide area transfers.

We find that the impact of spurious timeouts on long, bulk
data flows is very low – within the margins of error –
allowing RTO to go into the microseconds without impairing
wide-area performance.»
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Andrew Deason
On Thu, 23 Jan 2014 21:55:15 +
p...@afs.list.sabi.co.uk (Peter Grandi) wrote:

  Otherwise, when your network becomes congested, the
  retransmission of dropped packets will act as a runaway positive
  feedback loop, making the congestion worse and saturating the
  network.
 
 I am sorry I have not been clear about the topic: I was not
 meaning to discussing flow control is back-to-back streaming
 connections, my concern was about the frequency of *probing*
 servers for accessibility.

That's the only way we have to probe if a server is up; by creating a
streaming connection and issuing an RPC. How else do you propose we
contact a server?

Those retransmissions that jhutz is talking about is for a single probe
request. We can possibly run those probes in parallel, which is getting
discussed a bit in another part of the thread, but that still involves
going through this process.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Harald Barth

 I have long thought that we should be using multi for vldb lookups, 
 specifically to avoid the problems with down database servers.

The situation is a little bit different for cache managers who can
remember which servers are down and command line tools which normally
discocver how the world looks on each startup.

If the 'we ask everyone' strategy is not used all the time but only on
startup, it will not happen that frequent. Probably not as frequent to
cause problems for the scalability folks.

Harald.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Jeffrey Hutzelman
On Fri, 2014-01-24 at 08:01 +, Simon Wilkinson wrote:
 On 24 Jan 2014, at 07:48, Harald Barth h...@kth.se wrote:
 
  You are completely right if one must talk to that server. But I think
  that AFS/RX sometimes hangs to loong on waiting for one server
  instead of trying the next one. For example for questions that could
  be answered by any VLDB. I'm thinking of operation like group
  membership and volume location.
 
 I have long thought that we should be using multi for vldb lookups,
 specifically to avoid the problems with down database servers. The
 problem is that doing so may cause issues for sites that have multiple
 dbservers for scalability, rather than redundancy. Instead of each
 dbserver seeing a third (or a quarter, or ...) of requests it will see
 them all. Even if the client aborts the remaining calls when it
 receives the first response, the likelihood is that the other servers
 will already have received, and responded to, the request.
 
 There are ways we could be more intelligent (for example measuring the
 normal RTT of an RPC to the current server, and only doing a multi if
 that is succeeded) But we would have to be very careful that this
 wouldn't amplify a congestive collapse.

The thing is, the OP specifically wasn't complaining about the behavior
of the CM, which remembers when a vlserver is down and then doesn't talk
to it again until it comes up, except for the occasional probe.

The problem is the one-off clients that make _one RPC_ and then exit.
They have no opportunity to remember what didn't work last time.  It
might help some for these sorts of clients to use multi, if they're
doing read-only requests, and probably wouldn't create much load.
However, for a call that results in a ubik write transaction, I'm not
entirely sure it's desirable to do a multi call.  That will require some
additional thought.


In the meantime, another thing that might be helpful is for clients
about to make such an RPC to query the CM's record of which servers are
up, and use that to decide which server to contact.  A quick VIOCCKSERV
with the fast flag set could make a big difference.

-- Jeff

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Brandon Allbery
On Fri, 2014-01-24 at 11:41 -0500, Jeffrey Hutzelman wrote:
 The problem is the one-off clients that make _one RPC_ and then exit.
 They have no opportunity to remember what didn't work last time.  It

Has it been considered to write a cache file somewhere (even a user
dotfile) that could be used if it's not stale?

-- 
brandon s allbery kf8nh   sine nomine associates
allber...@gmail.com  ballb...@sinenomine.net
unix, openafs, kerberos, infrastructure, xmonadhttp://sinenomine.net



Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Jeffrey Altman
On 1/24/2014 11:45 AM, Brandon Allbery wrote:
 On Fri, 2014-01-24 at 11:41 -0500, Jeffrey Hutzelman wrote:
 The problem is the one-off clients that make _one RPC_ and then exit.
 They have no opportunity to remember what didn't work last time.  It
 
 Has it been considered to write a cache file somewhere (even a user
 dotfile) that could be used if it's not stale?

pts has interactive and source modes which would permit the pts process
to cache the up/down information between requests.

There is no reason the vos command could not have equivalent
functionality which would significantly improve performance in a number
of ways.  Besides remembering which servers are up/down the same rx
connections can be reused and repetitive rx security class
challenge/responses can be avoided.

I would prefer to see someone work on this approach combined with
jhutz's VIOCCKSERV suggestion if a cache manager is installed on the
local machine.

Jeffrey Altman



smime.p7s
Description: S/MIME Cryptographic Signature


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-24 Thread Andrew Deason
On Fri, 24 Jan 2014 11:41:35 -0500
Jeffrey Hutzelman jh...@cmu.edu wrote:

 The problem is the one-off clients that make _one RPC_ and then exit.
 They have no opportunity to remember what didn't work last time.  It
 might help some for these sorts of clients to use multi, if they're
 doing read-only requests, and probably wouldn't create much load.
 However, for a call that results in a ubik write transaction, I'm not
 entirely sure it's desirable to do a multi call.  That will require some
 additional thought.

At least we could multi call the VOTE_GetSyncSite call, perhaps, in
situations where we actually use GetSyncSite.

 In the meantime, another thing that might be helpful is for clients
 about to make such an RPC to query the CM's record of which servers
 are up, and use that to decide which server to contact.  A quick
 VIOCCKSERV with the fast flag set could make a big difference.

We have some code to implement this and honoring the client server
preferences; finishing it I think just got buried behind higher
priorities.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Peter Grandi
 [ ... ] adding the new machines to the CellServDB before the
 new server is up. You could bring up e.g. dbserver 4, and only
 after you're sure it's up and available, then add it to the
 client CellServDB. Then remove dbserver #3 from the client
 CellServDB, and then turn off dbserver #3.

For the client 'CellServDB' I simply did not expect any issues:
my expectation was that the clients would scan very quickly the
list of those addresses, starting with the lowest numbered for
example, and finding a live one member of quorum, and then if
ncessary getting from it the address of the sync site; which is
close to what it seems to do, only very slowly.

I would have wished to put all 6 (different) IP addresses (3 up,
3 down) in the client 'CellServDB' and in 'fs newcell' to
minimize the number of times I would do updates, but I could not
because of a local configuration management system that puts the
same list in the client and server 'CellServDB'. But done
manually on a test client seemed to work fine, except for the
'vos' clients and their very long search timeouts.

My real issue was 'server/CellServeDB' because we could not
prepare ahead of time all 3 new servers, but only one at a time.

The issue is that with 'server/CellServDB' update there is
potentially a DB daemon (PT, VL) restart (even if the rekeying
instructions hint that when the mtime of 'server/CellServDB'
changes the DB daemons reread it) and in any case a sync site
election.

Because each election causes a blip with the client I would
rather change the 'server/CellServDB' by putting in extra
entries ahead of time or leaving in entries for disabled
servers, to reduce the number of times elections are triggered.
Otherwise I can only update one server per week...

Ideally if I want to reshape the cell from DB servers 1, 2, 3 to
4, 5, 6, I'd love to be able to do it by first putting in the
'server/CellServDB' all 6 with 4, 5, 6 not yet available, and
only at the end remove 1, 2, 3. What does not play well (if one
of the 3 live servers fails) with the quorum :-) so went
halfway.

 You would need to keep the server-side CellServDB accurate on
 the dbservers in order for them to work, but the client
 CellServDB files can be missing dbservers. [ ... ]

It would be nice to know more about the details here to make
planning easier in future updates.

For example in an ideal world putting more or less DB servers in
the client 'CellServDB' should not matter, as long as one that
belongs to the cell is up; again if the logic were for all types
of client: scan quickly the list of potential DB servers, find
one that is up and belongs to the cell and reckons is part of
the quorum, and if necessary get from it the address of the sync
site.

Similarly (within limits) deliberately having non-up DB servers
to the 'server/CellServDB' should not matter that much, because
non-up DB servers happen anyhow in case of failures.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Peter Grandi
[ ... ]

 At some point during this slow incremental plan there were 4
 entries in both 'CellServDB's and the new one had not been
 started up yet, and would not be for a couple days.

 Oh also, I'm not sure why you're adding the new machines to
 the CellServDB before the new server is up. You could bring up
 e.g. dbserver 4, and only after you're sure it's up and
 available, then add it to the client CellServDB. Then remove
 dbserver #3 from the client CellServDB, and then turn off
 dbserver #3.

As mentioned at greater length in a just sent message, to
minimize the number of DB daemons restarts/resets.

But even without that motivation, if I have a cell with 4 DB
servers, the fact that 1 is down should really have no or little
noticeable impact, or else the redundancy they provide is not
that worthwhile...

 You would need to keep the server-side CellServDB accurate on
 the dbservers in order for them to work,

Well, accurate perhaps is a bit too strong: there needs to be
enough listed to form a quorum, and there should be at least
one that is common between the client 'CellServDB' and the
'server/CellServDB', and ideally all the 'server/CellServDB'
members should be also in the client 'CellServDB' (here I guess
everybody understands that 'CellServDB' means that file or
equivalent means in DNS etc.).

Because the crucial properties are:

  * DB servers for the wrong cell name or without the key don't matter.
  * DB servers outside the quorum for a cell name don't matter.
  * All quorum members know each other and which of them is
   the sync site.

and therefore I hope this happens:

  * Each DB server knows which cell it belongs to and its key(s).
  * Each DB server knows whether it is part of the quorum, and
the list of quorum members.
  * Each DB server that is part of the quorum knows which one
is the sync site.

Then whenever a DB server contacts or is contacted by another DB
servers it should:

  * Check that the cell name is the same.
  * Verify the cell is the same using the shared key.
  * Ask the other server for a list of quorum members.
  * Check whether the other server is in its quorum list:
- If missing, add to quorum list and trigger an election.
- If present, check the 'sync site' is the same:
  o If not same, trigger an election.

 CellServDB files can be missing dbservers.

What clients (cache or tools) should probably do:

  * Contact all 'CellServDB' servers quickly.
  * For each or first DB server that replies check whether it
has the same cell name.
  * If is the same cell name, try to use a token to get the
list of quorum member:
- If that failes, the DB server does not have the right key,
  so skip it.
- If that succeeds:
  o Choose a quorum member at random for a query.
  o Choose the sync-site for an update. 

 This won't work if a client needs the sync-site, and the
 sync-site is missing from the CellServDB, but in all other
 situations, that should work fine.

Then current client libraries could be improved, because any
quorum member could be asked for the address of the sync-site.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Andrew Deason
On Thu, 23 Jan 2014 14:58:35 +
p...@afs.list.sabi.co.uk (Peter Grandi) wrote:

 The issue is that with 'server/CellServDB' update there is
 potentially a DB daemon (PT, VL) restart (even if the rekeying
 instructions hint that when the mtime of 'server/CellServDB'
 changes the DB daemons reread it) and in any case a sync site
 election.

The daemons do reread the local configuration if the CellServDB mtime
changes. But they don't reinitialize the voting algorithm data and rx
connections etc that would be required to incorporate a new dbserver
into the quorum. So, for that you need to restart, yes.

  You would need to keep the server-side CellServDB accurate on
  the dbservers in order for them to work, but the client
  CellServDB files can be missing dbservers. [ ... ]
 
 It would be nice to know more about the details here to make
 planning easier in future updates.

I'm not sure what additional details you want. You just always make sure
the client CellServDB doesn't refer to dbservers that don't exist. So,
when you add a new dbserver, don't add it to the client CellServDB until
it's up and running. And when you remove a dbserver, remove it from the
client CellServDB before decommissioning it.

 For example in an ideal world putting more or less DB servers in
 the client 'CellServDB' should not matter, as long as one that
 belongs to the cell is up; again if the logic were for all types
 of client: scan quickly the list of potential DB servers, find
 one that is up and belongs to the cell and reckons is part of
 the quorum, and if necessary get from it the address of the sync
 site.

There is an idea we had pending for performing a VL_ProbeServer multi_rx
call on 'vos' startup to see which servers are up before doing anything.
The possible argument against this is that it adds a little bit of load
and a little bit of delay on every operation, even if all of the servers
are up. But maybe it's worth it.

Another possible optimization that can be made is that ubik-using
utilities could try the lowest-ip dbserver first when doing something
that requires db write access (or just randomly pick a site from the
lowest half+1 of the quorum), which would speed up the process in a
majority of cases. The argument against that, of course, is that the
lowest IP heuristic may not always apply in future implementations of
ubik, and in general it can make the minority of cases worse (when the
lower IPs are unreachable).

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Jeffrey Hutzelman
On Thu, 2014-01-23 at 10:44 -0600, Andrew Deason wrote:


  For example in an ideal world putting more or less DB servers in
  the client 'CellServDB' should not matter, as long as one that
  belongs to the cell is up; again if the logic were for all types
  of client: scan quickly the list of potential DB servers, find
  one that is up and belongs to the cell and reckons is part of
  the quorum, and if necessary get from it the address of the sync
  site.

The problem is that you the client to scan quickly to find a server
that is up, but because networks are not perfectly reliable and drop
packets all the time, it cannot know that a server is not up until that
server has failed to respond to multiple retransmissions of the request.
Those retransmissions cannot be sent quickly; in fact, they _must_ be
sent with exponentially-increasing backoff times.  Otherwise, when your
network becomes congested, the retransmission of dropped packets will
act as a runaway positive feedback loop, making the congestion worse and
saturating the network.

-- Jeff

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Jeffrey Hutzelman
On Thu, 2014-01-23 at 14:58 +, Peter Grandi wrote:


 My real issue was 'server/CellServeDB' because we could not
 prepare ahead of time all 3 new servers, but only one at a time.

 The issue is that with 'server/CellServDB' update there is
 potentially a DB daemon (PT, VL) restart (even if the rekeying
 instructions hint that when the mtime of 'server/CellServDB'
 changes the DB daemons reread it) and in any case a sync site
 election.

 Because each election causes a blip with the client I would
 rather change the 'server/CellServDB' by putting in extra
 entries ahead of time or leaving in entries for disabled
 servers, to reduce the number of times elections are triggered.
 Otherwise I can only update one server per week...

There's not really any such thing as a new election.  Elections happen
approximately every 15 seconds, all the time.  An interruption in
service occurs only when an election _fails_; that is, when no one
server obtains the votes of more than half of the servers that exist(*).
That can happen if not enough servers are up, of course, but it can also
happen when one or more servers that are up are unable to vote for the
ideal candidate.  Generally, the rule is that one cannot vote for two
different servers within 75 seconds, or vote for _any_ server within 75
seconds of startup.


From a practical matter, what this means when restarting database
servers for config updates is that you must not restart them all at the
same time.  You _can_ restart even the coordinator without causing an
interruption in service longer than the time it takes the server to
restart (on the order of milliseconds, probably).  Even though the
server that just restarted cannot vote for 75 seconds, that doesn't mean
it cannot run in _and win_ the election.  However, after restarting one
server, you need to wait for things to completely stabilize before
restarting the next one.  This typically takes from 75-90 seconds, and
can be observed in the output of 'udebug'.  What you are looking for is
for the recovery state to be f or 1f, and for the coordinator to be
getting yes votes from every server you think is supposed to be up.

Of course, you _will_ have an interruption in service when you retire
the machine that is the coordinator.  At the moment, there is basically
no way to avoid that.  However, if you plan and execute the transition
carefully, you only need to take that outage once.



(*) Special note:  The server with the lowest IP address gets an extra
one-half vote, but only when voting for itself.  This helps to break
ties when the CellServDB contains an even number of servers.


 Ideally if I want to reshape the cell from DB servers 1, 2, 3 to
 4, 5, 6, I'd love to be able to do it by first putting in the
 'server/CellServDB' all 6 with 4, 5, 6 not yet available, and
 only at the end remove 1, 2, 3. What does not play well (if one
 of the 3 live servers fails) with the quorum :-) so went
 halfway.

This doesn't work because, with 6 servers in the CellServDB, to maintain
a quorum you must have four servers running, or three servers if one of
them is the one with the lowest address.  In fact, you can't even
transition safely from three to four servers, because once you have
four servers in your CellServDB, if the one with the lowest address goes
down before the new server is brought up, you'll have two out of four
servers up and no quorum.  

However, you can safely and cleanly transition to and from larger
numbers of servers, one server at a time.  Just be sure that before you
start up a new server, every existing server has been restarted with a
CellServDB naming that server.  Similarly, make sure to shut a server
down before removing it from remaining servers' CellServDB files.

At one point, I believe I worked out a sequence involving careful use of
out-of-sync CellServDB files and the -ubiknocoord option (gerrit #2287)
to allow safely transitioning from 3 servers to 4.  However, this is not
recommended unless you have a deep understanding of the election code,
because it is easy to screw up and create a situation where you can have
two sync sites.


I also worked out (but never implemented) a mechanism to allow an
administrator to trigger a clean transition of the coordinator role from
one server to another _without_ a 75-second interruption.  I'm sure at
some point that we'll revisit that idea.

-- Jeff

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Andrew Deason
On Thu, 23 Jan 2014 14:33:58 -0500
Jeffrey Hutzelman jh...@cmu.edu wrote:

 The problem is that you the client to scan quickly to find a server
 that is up, but because networks are not perfectly reliable and drop
 packets all the time, it cannot know that a server is not up until that
 server has failed to respond to multiple retransmissions of the request.

So... what about issuing a multi_rx VL_ProbeServer, like I said in the
removed context? You only need to wait for one site to respond, and then
you can immediately kill the other calls.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-23 Thread Andrew Deason
On Thu, 23 Jan 2014 15:39:03 +
p...@afs.list.sabi.co.uk (Peter Grandi) wrote:

  Oh also, I'm not sure why you're adding the new machines to
  the CellServDB before the new server is up. You could bring up
  e.g. dbserver 4, and only after you're sure it's up and
  available, then add it to the client CellServDB. Then remove
  dbserver #3 from the client CellServDB, and then turn off
  dbserver #3.
 
 As mentioned at greater length in a just sent message, to minimize the
 number of DB daemons restarts/resets.

I thought you were already migrating one server at a time, so my
suggestion doesn't impact the number of dbserver restarts. In fact, what
I was suggesting doesn't really touch the servers at all; the only
difference I was suggesting was changing when you modify the client-side
CellServDB relative to when you perform the dbserver migrations.

 But even without that motivation, if I have a cell with 4 DB servers,
 the fact that 1 is down should really have no or little noticeable
 impact, or else the redundancy they provide is not that worthwhile...

I think it's quite worthwhile to have the database be available yet
slow, as opposed to not being available at all.

But yes, I believe there is at least 1 way in which the relevant code
might be improved. In the meantime, there are existing procedures with
the existing code that existing sites perform to mostly avoid the
problems you are seeing. It's up to you if you want to use them.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-17 Thread Andrew Deason
On Fri, 17 Jan 2014 18:50:13 +
p...@afs.list.sabi.co.uk (Peter Grandi) wrote:

   What rules do the OpenAFS tools use to contact one of
   the DB servers?

Most of the time (read requests), we'll pick a random dbserver, and
use it. If contacting a dbserver fails for network reasons, we will try
to avoid that server unless we run out of servers.

Possibly the big difference in the behavior you're seeing is that the
kernel clients only perform unauthenticated read operations on the
dbservers. Userspace tools like vos sometimes need to perform write
operations, which changes things somewhat.

For a write operation to the db, things are slightly different because
we must pick the sync site; we cannot access just any dbserver to
fulfill the request. If there are 3 or fewer dbservers, we pick randomly
like we do for read operations. If there are more than 3 sites, we ask
one of the sites who the sync site is, and then we contact the sync site
in order to perform the request. If the request is successful, we
remember who the sync site is, and keep using it until we get a network
error (or a I'm not the sync site error).

With userspace tools, we have no way of remembering which servers are
down between invocations, so each time a tool runs, it picks a random
server again. The kernel clients are running for a much longer period of
time, so presumably if we contact a downed dbserver, the client will not
try to contact that dbserver for quite some time.

That's just for choosing a dbserver site, though; if you want to know
how long we take to fail to connect to a specific site:

   I have a single-host test OpenAFS cell with 1.6.5.2, and I
   have added a second IP address to '/etc/openafs/CellServDB'
   with an existing DNS entry (just to be sure) but not assigned
   to any machine: sometimes 'vos vldb' hangs for a while (105
   seconds), doing 8 attempts to connect to the down DB server;

I'm not sure how you are determining that we're making 8 attempts to
contact the down server. Are you just seeing 8 packets go by? We can
send many packets for a single attempt to contact the remote site. By
default openafs code tends to wait about 50 seconds for a site to
respond to a request. vos sets this to 90 seconds for most things (I
don't know why), during which period, it will retry sending packets. 105
seconds is close enough that that should explain it; the timeouts are
not always exact, since we kind of poll outstanding calls to see if they
have timed out.

 The OpenAFS client caches seemed to cope well as expected, as in
 a cell with a quorum of 3 up DB servers, and 1 down. I
 think the only consequence I noticed was sometimes 'aklog'
 taking around 15 seconds.

The kernel client will not notice changes to the CellServDB until you
restart it, or run 'fs newcell'. The client also usually doesn't need to
contact the dbservers very often; it could easily take an hour for you
to notice even if all of the dbservers were down. If the client hits a
downed dbserver, it will hang, too (at least around 50 seconds).

 However *some* backups started to hang and some AFS-volumes
 became unaccessible to all clients. The fairly obvious cause was
 that the cloning transaction instead of being very quick would
 not end, and cloning locks the AFS-volume.

I _think_ this is because we are hanging on allocating a new volume id
for the temporary clone. If you run with -verbose, do you see
Allocating new volume id for clone of volume... before it hangs?

We could possibly do that before we mark the volume as busy, but then
we might allocate a vol id we never use, if the volume isn't usable.
Maybe that's better, though. Fixing that doesn't eliminate the hanging
behavior you're seeing, but it would mean the volume would be accessible
to clients while 'vos' is hanging.

May I ask why you are not just dumping .backup volumes? You could create
the .backup volumes en-masse with 'vos backupsys', and then you could
just 'vos dump' them afterwards. Performing operations in big bulk
operations like that as much as possible would make the tools more
resilient to the errors you are seeing, since then the tool is a single
command and can remember which dbserver is down.

 With a curious attempt to open $HOME/.AFSSERVER (which did not
 exist). the 1.6.5.2 'vos' also tries to open /.AFSSERVER.

This is for rmtsys support with certain environments usually involving
the NFS translator. I assume this happens when 'vos' tries to get your
tokens in order to authenticate; if that's correct, it'll go away if you
run with -noauth or -localauth.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-17 Thread Andrew Deason
On Fri, 17 Jan 2014 18:50:13 +
p...@afs.list.sabi.co.uk (Peter Grandi) wrote:

 Planned to do this incremental by adding a new DB server to the
 'CellServDB', then starting it up, then removing the an old DB
 server, and so on until all 3 have been replaced in turn with
 new DB servers #4, #5, #6.
 
 At some point during this slow incremental plan there were 4
 entries in both 'CellServDB's and the new one had not been
 started up yet, and would not be for a couple days.

Oh also, I'm not sure why you're adding the new machines to the
CellServDB before the new server is up. You could bring up e.g. dbserver
#4, and only after you're sure it's up and available, then add it to the
client CellServDB. Then remove dbserver #3 from the client CellServDB,
and then turn off dbserver #3.

You would need to keep the server-side CellServDB accurate on the
dbservers in order for them to work, but the client CellServDB files can
be missing dbservers. This won't work if a client needs the sync-site,
and the sync-site is missing from the CellServDB, but in all other
situations, that should work fine.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-17 Thread Jeffrey Hutzelman
On Fri, 2014-01-17 at 14:12 -0600, Andrew Deason wrote:



 time, so presumably if we contact a downed dbserver, the client will not
 try to contact that dbserver for quite some time.

To elaborate: the cache manager keeps track of every server, and
periodically sends a sort of ping to each server to find out which
servers are up.  So, it will discover a server is down even if you're
not using it.  And, other than the periodic pings, the cache manager
will never direct a request to a server it thinks is down.  So, failover
for the CM itself is automatic, persistent, and often completely
transparent.

The fileserver works a little differently, but also keeps track of which
server it is using, fails over when that server stops responding, and
generally avoids switching when it doesn't need to.

Ubik database servers all communicate among themselves, which is a
necessary part of the database replication mechanism.  That happens even
when one server is down, but in such a way that you'll never notice a
communication failure between dbservers except in an unusual combination
of circumstances which can sometimes happen if a server goes down while
you are making a request that requires writing to the database.



I have a single-host test OpenAFS cell with 1.6.5.2, and I
have added a second IP address to '/etc/openafs/CellServDB'
with an existing DNS entry (just to be sure) but not assigned
to any machine: sometimes 'vos vldb' hangs for a while (105
seconds), doing 8 attempts to connect to the down DB server;
 
 I'm not sure how you are determining that we're making 8 attempts to
 contact the down server. Are you just seeing 8 packets go by? We can
 send many packets for a single attempt to contact the remote site.

Right.  Even though AFS communicates over UDP, which itself is
connectionless, Rx does have the notion of connections and includes a
full transport layer including retransmission, sequencing, flow control,
and exponential backoff for congestion control.  What you are actually
seeing is multiple retransmissions of a request, which may or may not be
the first packet in a new connection.  The packet is retransmitted
because the server did not reply with an acknowledgement, and the
intervals get longer because of exponential backoff, which is an key
factor in making sure that congested networks eventually get better
rather than only getting worse.


-- Jeff

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: DB servers quorum and OpenAFS tools

2014-01-17 Thread Jeffrey Hutzelman
On Fri, 2014-01-17 at 14:21 -0600, Andrew Deason wrote:
 On Fri, 17 Jan 2014 18:50:13 +
 p...@afs.list.sabi.co.uk (Peter Grandi) wrote:
 
  Planned to do this incremental by adding a new DB server to the
  'CellServDB', then starting it up, then removing the an old DB
  server, and so on until all 3 have been replaced in turn with
  new DB servers #4, #5, #6.
  
  At some point during this slow incremental plan there were 4
  entries in both 'CellServDB's and the new one had not been
  started up yet, and would not be for a couple days.
 
 Oh also, I'm not sure why you're adding the new machines to the
 CellServDB before the new server is up. You could bring up e.g. dbserver
 #4, and only after you're sure it's up and available, then add it to the
 client CellServDB. Then remove dbserver #3 from the client CellServDB,
 and then turn off dbserver #3.

Yup; that's the sane thing to do.  New servers should be in service
before you publish them in AFSDB or SRV records or in clients'
CellServDB files, and old servers should not be removed from service
until after they have been unpublished and all the clients you care
about have picked up the change.

 You would need to keep the server-side CellServDB accurate on the
 dbservers in order for them to work, but the client CellServDB files can
 be missing dbservers. This won't work if a client needs the sync-site,
 and the sync-site is missing from the CellServDB, but in all other
 situations, that should work fine.

This is what gerrit #2287 is about.  It adds a switch that will allow
you to configure your dbservers so that they will not be elected
coordinator.  Unpublished servers should be run with this switch, or
configured as non-voting servers, so that they don't become sync site.

Unfortunately, progress on getting that merged has been stalled for a
while, in no small part because there are changes still needed and a
related patch required significant rework, and I haven't had time to
touch this stuff in a few months.  So in the meantime, the best you can
do is insure the unpublished server will not become sync site by some
combination of careful selection of the IP addresses involved, careful
monitoring and management of the election process, and/or marking the
unpublished server as nonvoting.  Some care is required for nonvoting
servers, as in theory all dbservers must agree on who the voting servers
are.  Some mismatches are possible and even safe, but figuring out
which those are and what the behavior will be requires a thorough
understanding of what checks are done and how the voting process works.


-- Jeff

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info