Re: [OpenAFS] Weird Quorum Issues

Jeffrey Hutzelman Fri, 07 Nov 2003 03:40:21 -0800

On Friday, November 07, 2003 07:40:49 +0000 Dr A V Le Blanc <[EMAIL PROTECTED]> wrote:

On Wed, 05 Nov 2003 at 21:50:09 -0500, Aaron Stanley
<[EMAIL PROTECTED]>:

I was away from my cluster for two days and when I got back I noticed...

u: No Quorum Elected

...

I ran a udebug on all three of my vl servers for both ports 7002 and
7004. On the primary (largest) fileserver, the output would waffle
between the normal "I am sync site" and the not normal "I am not sync
site".  I thought at first that it might be a network issue, but pings
between servers was great and bandwith was not an issue.


We currently have three DB servers, two machines running IRIX 6.5 and
openafs 1.2.10, and one Linux box running 2.4.22 and openafs 1.2.10.
They are all on the same subnet and have only the one IP address
each, but often enough there are quorum problems.  For example,
yesterday I checked and found no quorum for the protection server
(7002), though both volume server (7003) and kaserver (7004) had
sync sites, though for some reason on different machines.  All
three servers had been up and running without restarts for more than
two months.  I stopped the ptserver processes and restarted them
one by one: a quorum was elected in about 5 minutes.

We have seen this kind of problem occasionally, though infrequently,
for several years.  I've always assumed it just happens because of
bugs in ubik; doesn't everyone have this kind of problem?  The main
nuisance occurs when the jobs to create new users fail because they
can't make entries in the ka or pt database, and I end up with a
partially created user.

We certainly don't have this problem. The only time I see lack of quorum or a coordinator change is when there is a network problem or a server that really is down or restarting. Of course, once elected, a server remains coordinator until it goes down or there are not enough votes to sustain it. So if your normal "lowest" machine restarts for some reason, the next one will become and stay coordinator. We see that behaviour on a regular basis, because each of our dbservers restarts on a different day of the week.

It's worth noting that the Ubik election algorithm is very sensitive to proper time synchronization. The maximum permitted clock skew between any two servers is 10 seconds; more than this, and the election algorithm will break down. Thus, running NTP is critical for database servers.

-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
  Sr. Research Systems Programmer
  School of Computer Science - Research Computing Facility
  Carnegie Mellon University - Pittsburgh, PA

_______________________________________________
OpenAFS-info mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Weird Quorum Issues

Reply via email to