On Friday, November 07, 2003 07:40:49 +0000 Dr A V Le Blanc <[EMAIL PROTECTED]> wrote:
On Wed, 05 Nov 2003 at 21:50:09 -0500, Aaron Stanley <[EMAIL PROTECTED]>:I was away from my cluster for two days and when I got back I noticed......
u: No Quorum ElectedI ran a udebug on all three of my vl servers for both ports 7002 and 7004. On the primary (largest) fileserver, the output would waffle between the normal "I am sync site" and the not normal "I am not sync site". I thought at first that it might be a network issue, but pings between servers was great and bandwith was not an issue.
We currently have three DB servers, two machines running IRIX 6.5 and openafs 1.2.10, and one Linux box running 2.4.22 and openafs 1.2.10. They are all on the same subnet and have only the one IP address each, but often enough there are quorum problems. For example, yesterday I checked and found no quorum for the protection server (7002), though both volume server (7003) and kaserver (7004) had sync sites, though for some reason on different machines. All three servers had been up and running without restarts for more than two months. I stopped the ptserver processes and restarted them one by one: a quorum was elected in about 5 minutes.
We have seen this kind of problem occasionally, though infrequently, for several years. I've always assumed it just happens because of bugs in ubik; doesn't everyone have this kind of problem? The main nuisance occurs when the jobs to create new users fail because they can't make entries in the ka or pt database, and I end up with a partially created user.
We certainly don't have this problem. The only time I see lack of quorum or a coordinator change is when there is a network problem or a server that really is down or restarting. Of course, once elected, a server remains coordinator until it goes down or there are not enough votes to sustain it. So if your normal "lowest" machine restarts for some reason, the next one will become and stay coordinator. We see that behaviour on a regular basis, because each of our dbservers restarts on a different day of the week.
It's worth noting that the Ubik election algorithm is very sensitive to proper time synchronization. The maximum permitted clock skew between any two servers is 10 seconds; more than this, and the election algorithm will break down. Thus, running NTP is critical for database servers.
-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]> Sr. Research Systems Programmer School of Computer Science - Research Computing Facility Carnegie Mellon University - Pittsburgh, PA
_______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
