Re: [Linux-HA] question regarding quorumd

Sebastian Reitenbach Tue, 20 Nov 2007 00:05:04 -0800

Hi,
Zhen Huang <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> The DC node should try to connect to the quorumd sever periodically.
> If not, it should be a bug.


I observed this behavior first on a two node Linux cluster. I just did some 
more tests with a two node OpenBSD cluster, and the quorumd on a Linux box.

The following I observed, test 1:
- configure usage of quorumd on the two heartbeat nodes
- start quorumd on the Linux node
- start the first cluster node
   - this is starting communication with quorumd, it gets quorum, and I can 
start managing resources
- start the second cluster node, and everything is still working well
- stop the quorumd
   - the DC still sends packets to the quorumd, for about a minute, then
     stops and never starts again, also the other node, does not start
     trying to contact the quorumd
- then kill one of the cluster nodes, then the remaining node tries to 
  contact the quorumd, fails because it is not running, and the left node is   
  without quorum

Test 2:
- configure usage of quorumd on the two heartbeat nodes
- do NOT start quorumd on the Linux node
- start the first cluster node, see it failing to contact quorumd, 
  it is starting up the cluster without quorum (it only sends one packet to 
  the quorumd, receives a RST package, and seems to never try again)
- start the second cluster node, this seems to trigger the DC to retry 
  contacting the quorumd, (again, only one package, then nothing more)
- both cluster nodes then together decide that the cluster runs without 
  quorum. Shouldn't the two cluster nodes be enough to aquire quorum?
- start the quorumd on the Linux box
- wait forever, see that the cluster nodes not try to contact the quorumd 
  again, therefore the cluster keeps thinking, it has no quorum at all.


As said, last week I observed that initially on a two node Linux test 
cluster with a third node running a quorumd, so it not seems to be OS 
related.


kind regards
Sebastian

> 
> Sebastian Reitenbach wrote:
> > Hi,
> > 
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> >> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote:
> >>
> >>> Hi,
> >>>
> >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>>> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I did some tests with a two node cluster and a third one running a
> >>>>> quorumd.
> >>>>>
> >>>>> I started the quorumd, and then the two cluster nodes.
> >>>>> The one that became DC, started to communicate with the remote
> >>>>> quorumd.
> >>>> The CRM (and thus the "DC") doesn't know anything about quorumd
> >>>> I believe this is purely the domain of the CCM and I've no idea how
> >>>> that works :-)
> >>>>
> >>>> We just consume membership data from it...
> >>>>
> >>>> So anyway, my point is that the fact that a node is the DC is
> >>>> irrelevant when it comes to quorumd.
> >>> but somehow the cluster knows, as only the DC is communicating with 
> >>> the
> >>> external quorumd.
> >> I think that its just a co-incidence that it happens to be the DC... 
> >> at least I hope it is.
> > I thought I read somewhere, that the DC is the one in charge of 
> > communicating with the remote quorumd, but I may be wrong here.
> > 
> >>> I just do not understand, why the cluster does not retry
> >>> to re-contact the quorumd after it lost connection to it. This was 
> >>> what I
> >>> assumed, after a disconnect to the remote quorumd, the cluster nodes 
> >>> should
> >>> try to contact it, and when the contact is there again, use it again.
> >> I agree - but I've never seen that code.  You'll have to contact alan 
> >> or file a bug for him.
> > Alan, in case you think this is a bug, I'll go create a bug report for 
> it.
> > Please let me know.
> > 
> >>>>> I killed the DC, saw the other becoming DC, and start communicating
> >>>>> to the remote quorumd, all fine, cluster still with quorum.
> >>>>> Then I killed the quorumd itself, the DC recognized, and started to
> >>>>> stop
> >>>>> all resource, because of the quorum_policy, as it lost quorum.
> >>>>>
> >>>>> Then I restarted the quorumd again, but the DC, still without 
> >>>>> quorum,
> >>>>> did not tried to communicate to the quorumd again.
> >>>>> I'd expect the still living DC to try to contact the quorumd, in
> >>>>> case it
> >>>>> comes back.
> >>>>>
> >>>>> If there is a good reason, why the DC is not trying to reconnect to
> >>>>> the
> >>>>> remote quorumd I'd really like to get enlightened from someone who
> >>>>> knows.
> 
> It should be trying to reconnect.  It _does_ communicate w/quorumd from
> a single machine/cluster.  I think that it's coincidence that it's the
> DC.  Huang Zhen wrote the code.  I've CCed him.  I'm at the LISA
> conference this week - if HZ doesn't get back to you by next Monday,
> I'll look into it.
> 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] question regarding quorumd

Reply via email to