Re: [Linux-HA] no quorum problem

Dejan Muhamedagic Tue, 18 Jan 2011 07:42:09 -0800

Hi,

On Tue, Jan 18, 2011 at 05:11:13PM +0200, Pavlos Polianidis wrote:
> Dear Andrew
> 
> So is there any solution to make the quorum operate?


Quorum is replaced by stonith in 2-node clusters. Other than
that, there seems to be a problem with the number of expected
votes:

> >>> Last updated: Thu Jan 13 16:00:15 2011
> >>> Stack: Heartbeat
> >>> Current DC: lsc-node02.velti.net (a7e25657-fb85-4cf1-9d9b-5a21484e1583) - 
> >>> partition WITHOUT quorum
> >>> Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> >>> 2 Nodes configured, unknown expected votes
> >>> 0 Resources configured.
> >>> ============

I don't know why is that, but I think that recently somebody
else reported it too with heartbeat. Not sure though.

Thanks,

Dejan

> Thanks in advance
> 
> Pavlos Polianidis
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Andrew Beekhof
> Sent: Tuesday, January 18, 2011 4:19 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] no quorum problem
> 
> On Tue, Jan 18, 2011 at 1:27 PM, Pavlos Polianidis
> <[email protected]> wrote:
> > Dear Andrew,
> >
> > I am not sure if I should say quorum daemon, I just mean HA heartbeat 
> > quorum.
> 
> Well thats the CCM which does appear to have started
> 
> >
> >
> > -----Original Message-----
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Andrew Beekhof
> > Sent: Tuesday, January 18, 2011 2:03 PM
> > To: General Linux-HA mailing list
> > Subject: Re: [Linux-HA] no quorum problem
> >
> > On Tue, Jan 18, 2011 at 1:00 PM, Pavlos Polianidis
> > <[email protected]> wrote:
> >> Thank you Andrew,
> >>
> >> Since what I am going to do is manage resources on 2 nodes, one will be 
> >> the failover of the other but there will be services running on both nodes 
> >> and switching nodes if needed, I cannot use STONITH of suicide methods. 
> >> The nodes will be connected to each other by LAN and also by a parallel 
> >> cable to prevent some of connection losses.
> >> It seems that for some reason the Heartbeat quorum daemon fails to start
> >
> > wait a second... "quorum daemon" ?
> >
> >> and if I change the xml configuration manually to have-quorum="1", after a 
> >> few seconds it returns to have-quorum="0" :).
> >>
> >> What I found in logs regarding ccm is the below:
> >>
> >> Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: info: 
> >> crm_cluster_connect: Connecting to Heartbeat
> >> Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: info: ccm_connect: 
> >> Registering with CCM...
> >> Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM 
> >> Activation failed
> >> Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM 
> >> Connection failed 1 times (30 max)
> >> Jan 17 14:12:12 lsc-node02.velti.net heartbeat: [12414]: info: the send 
> >> queue length from heartbeat to client cib is set to 1024
> >> Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: info: do_cib_control: 
> >> Could not connect to the CIB service: connection failed
> >> Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: WARN: do_cib_control: 
> >> Couldn't complete CIB registration 1 times... pause and retry
> >> Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: info: crmd_init: 
> >> Starting crmd's mainloop
> >> Jan 17 14:12:14 lsc-node02.velti.net crmd: [12453]: info: 
> >> crm_timer_popped: Wait Timer (I_NULL) just popped!
> >> Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: info: ccm_connect: 
> >> Registering with CCM...
> >> Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM 
> >> Activation failed
> >> Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM 
> >> Connection failed 2 times (30 max)
> >> Jan 17 14:12:15 lsc-node02.velti.net crmd: [12453]: info: do_cib_control: 
> >> Could not connect to the CIB service: connection failed
> >> Jan 17 14:12:15 lsc-node02.velti.net crmd: [12453]: WARN: do_cib_control: 
> >> Couldn't complete CIB registration 2 times... pause and retry
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_ha_callback: flush message from lsc-node01.velti.net
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> find_hash_entry: Creating hash entry for probe_complete
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_perform_update: Delaying operation probe_complete=<null>: cib not 
> >> connected
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_ha_callback: flush message from lsc-node01.velti.net
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> find_hash_entry: Creating hash entry for last-failure-VIP
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_perform_update: Delaying operation last-failure-VIP=<null>: cib not 
> >> connected
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_ha_callback: flush message from lsc-node01.velti.net
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> find_hash_entry: Creating hash entry for terminate
> >> Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_perform_update: Delaying operation terminate=<null>: cib not 
> >> connected
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_ha_callback: flush message from lsc-node01.velti.net
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> find_hash_entry: Creating hash entry for shutdown
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
> >> Jan 17 14:12:16 lsc-node02.velti.net ccm: [12448]: info: 
> >> G_main_add_SignalHandler: Added signal handler for signal 15
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_ha_callback: flush message from lsc-node01.velti.net
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> find_hash_entry: Creating hash entry for fail-count-VIP
> >> Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: 
> >> attrd_perform_update: Delaying operation fail-count-VIP=<null>: cib not 
> >> connected
> >> Jan 17 14:12:17 lsc-node02.velti.net crmd: [12453]: info: 
> >> crm_timer_popped: Wait Timer (I_NULL) just popped!
> >> Jan 17 14:12:18 lsc-node02.velti.net cib: [12449]: info: ccm_connect: 
> >> Registering with CCM...
> >>
> >>
> >>
> >>
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: mem_handle_event: 
> >> Got an event OC_EV_MS_INVALID from ccm
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> mem_handle_event: instance=4, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, 
> >> old_idx=6 Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: 
> >> mem_handle_event: instance=4, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, 
> >> old_idx=6
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> crmd_ccm_msg_callback: Quorum lost after event=INVALID (id=4)
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: 
> >> cib_ccm_msg_callback: Processing CCM event=INVALID (id=4)
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> ccm_event_detail: INVALID: trans=4, nodes=2, new=2, lost=0 n_idx=0, 
> >> new_idx=0, old_idx=6
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_get_peer: 
> >> Node lsc-node01.velti.net now has id: 1
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> ccm_event_detail:     CURRENT: lsc-node01.velti.net [nodeid=1, born=2]
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_update_peer: 
> >> Node lsc-node01.velti.net: id=1 state=member (new) addr=(null) votes=-1 
> >> born=2 seen=4 proc=0000000 0000000000000000000000100
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> ccm_event_detail:     CURRENT: lsc-node02.velti.net [nodeid=3, born=4]
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: 
> >> crm_update_peer_proc: lsc-node01.velti.net.ais is now online
> >> Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: 
> >> ccm_event_detail:     NEW:     lsc-node01.velti.net [nodeid=1, born=2]
> >> Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: 
> >> crm_update_peer_proc: lsc-node01.velti.net.crmd is now online
> >> Jan 17 14:12:23 lsc-node02.velti.net crmd: [12453]: info: 
> >> ccm_event_detail:     NEW:     lsc-node02.velti.net [nodeid=3, born=4]
> >> Jan 17 14:12:23 lsc-node02.velti.net cib: [12449]: info: crm_get_peer: 
> >> Node lsc-node02.velti.net now has id: 3
> >> Jan 17 14:12:23 lsc-node02.velti.net crmd: [12453]: info: crm_get_peer: 
> >> Node lsc-node01.velti.net now has id: 1
> >> Jan 17 14:12:23 lsc-node02.velti.net cib: [12449]: info: crm_update_peer: 
> >> Node lsc-node02.velti.net: id=3 state=member (new) addr=(null)
> >>
> >>
> >>
> >> Jan 17 16:15:51 lsc-node02.velti.net cib: [16237]: info: 
> >> crm_update_peer_proc: lsc-node01.velti.net.crmd is now online
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: 
> >> ccm_event_detail:     NEW:     lsc-node01.velti.net [nodeid=1, born=2]
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: crm_get_peer: 
> >> Node lsc-node01.velti.net now has id: 1
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: recv msg 
> >> CCM_TYPE_MEM_LIST from lsc-node02.velti.net, status:[null ptr]
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: crm_update_peer: 
> >> Node lsc-node01.velti.net: id=1 state=member (new) addr=(null) votes=-1 
> >> born=2 seen=2 proc=00000000000000000000000000000200
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: WARN: ccm_state_joined: 
> >> received message with unknown cookie, just dropping
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: 
> >> crm_update_peer_proc: lsc-node01.velti.net.ais is now online
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: dump current 
> >> membership 0xf7ed5008
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: 
> >> post_cache_update: Updated cache after membership event 2.
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> leader=lsc-node02.velti.net
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: 
> >> post_cache_update: post_cache_update added action A_ELECTION_CHECK to the 
> >> FSA
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> transition=2
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: do_fsa_action: 
> >> actions:trace:        // A_ELECTION_CHECK
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> status=CCM_STATE_JOINED
> >> Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: 
> >> do_election_check: Ignore election check: we not in an election
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> has_quorum=0
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> nodename=lsc-node02.velti.net bornon=1
> >> Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:       
> >> nodename=lsc-node01.velti.net bornon=2
> >> Jan 17 16:16:47 lsc-node02.velti.net crmd: [16241]: info: 
> >> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped!
> >> Jan 17 16:16:47 lsc-node02.velti.net crmd: [16241]: debug: s_crmd_fsa: 
> >> Processing I_DC_TIMEOUT: [ state=S_PENDING cause=C_TIMER_POPPED 
> >> origin=crm_timer_popped ]
> >>
> >>
> >> an 17 13:59:34 lsc-node02.velti.net pengine: [23022]: debug: 
> >> unpack_config: Cluster is symmetric - resources can run anywhere by default
> >> Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: debug: 
> >> unpack_config: Default stickiness: 0
> >> Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: notice: 
> >> unpack_config: On loss of CCM Quorum: Ignore
> >> Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: info: 
> >> unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> >> Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: info: 
> >> determine_online_status: Node lsc-node02.velti.net is online
> >>
> >>
> >> Jan 17 14:10:52 lsc-node02.velti.net lrmd: [23007]: debug: on_receive_cmd: 
> >> the IPC to client [pid:23010] disconnected.
> >> Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: info: do_lrm_control: 
> >> Disconnected from the LRM
> >> Jan 17 14:10:52 lsc-node02.velti.net lrmd: [23007]: debug: 
> >> unregister_client: client crmd [pid:23010] is unregistered
> >> Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: debug: do_fsa_action: 
> >> actions:trace:        // A_CCM_DISCONNECT
> >> Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: debug: do_fsa_action: 
> >> actions:trace:        // A_HA_DISCONNECT
> >> Jan 17 14:10:52 lsc-node02.velti.net ccm: [23005]: info: client 
> >> (pid=23010) removed from ccm
> >> Jan 17 14:10:52 lsc-node02.velti.net heartbeat: [22969]: debug: Signing 
> >> client 23010 off
> >> Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: info: do_ha_control: 
> >> Disconnected from Heartbeat
> >> Jan 17 14:10:52 lsc-node02.velti.net heartbeat: [22969]: debug: 
> >> G_remove_client(pid=23010, reason='signoff' gsource=0x8836a00) {
> >>
> >>
> >> I would attach the full logs but they are too large :)
> >>
> >>
> >> Thanks in advance
> >>
> >> Pavlos Polianidis
> >>
> >>
> >> Pavlos Polianidis | Technical Support Specialist
> >>
> >> Velti
> >> 44 Kifisias Ave.
> >> 15125 Marousi, Athens, Greece
> >> T   +30.210.637.8000
> >> F   +30.210.637.8888
> >> M  +30.695.506.0133
> >> E   [email protected]
> >> www.velti.com
> >>
> >> Velti is a global leader in mobile marketing and advertising solutions for 
> >> mobile operators, ad agencies, brands and media groups.
> >> San Francisco | New York | Boston | Dublin | London | Paris | Madrid | 
> >> Athens | Sofia | Moscow | Dubai | New Delhi | Mumbai | Jakarta | Beijing | 
> >> Shanghai | Sydney-----Original Message-----
> >> From: [email protected] 
> >> [mailto:[email protected]] On Behalf Of Andrew Beekhof
> >> Sent: Tuesday, January 18, 2011 11:02 AM
> >> To: General Linux-HA mailing list
> >> Subject: Re: [Linux-HA] no quorum problem
> >>
> >> On Thu, Jan 13, 2011 at 3:17 PM, Pavlos Polianidis
> >> <[email protected]> wrote:
> >>> Hello,
> >>>
> >>>
> >>> Currently I have installed heartbeat 3.0.2-2.el5 x86_64 and pacemaker 
> >>> 1.0.7-4.el5 x86_64 on a CentOS release 5.3 x86_64 machine using yum 
> >>> repositories.
> >>>
> >>> My configuration is the below:
> >>> Ha.cf
> >>>
> >>> debugfile /var/log/ha-debug
> >>> logfile /var/log/ha-log
> >>> logfacility     local0
> >>> compression_threshold 2
> >>> node    lsc-node01
> >>> node    lsc-node02
> >>> debug                          1
> >>>  use_logd                       false
> >>>  logfacility                    daemon
> >>> traditional_compression        off
> >>>  compression                    bz2
> >>>  coredumps                      true
> >>> udpport                        694
> >>>  bcast                          eth0
> >>> autojoin                       any
> >>> keepalive                      1
> >>>  warntime                       10
> >>>  deadtime                       35
> >>>  initdead                       40
> >>>  max_rexmit_delay               10000
> >>> crm respawn
> >>>
> >>> but the output of the crm_mon command is the below:
> >>>
> >>> Last updated: Thu Jan 13 16:00:15 2011
> >>> Stack: Heartbeat
> >>> Current DC: lsc-node02.velti.net (a7e25657-fb85-4cf1-9d9b-5a21484e1583) - 
> >>> partition WITHOUT quorum
> >>> Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> >>> 2 Nodes configured, unknown expected votes
> >>> 0 Resources configured.
> >>> ============
> >>>
> >>> Online: [ lsc-node02.velti.net lsc-node01.velti.net ]
> >>>
> >>>
> >>> Previously I have experimented with the latest version of heartbeat and 
> >>> pacemaker and I downgraded to the current versions as I had the same 
> >>> problem and I have read in the forums that it might be some bugs in some 
> >>> versions.
> >>>
> >>> in the debug log I see the below entry:
> >>>
> >>> WARN: cluster_status: We do not have quorum - fencing and resource 
> >>> management disabled
> >>>
> >>> In the log:
> >>>
> >>> Jan 13 15:53:34 lsc-node02.velti.net crmd: [30853]: info: 
> >>> populate_cib_nodes_ha: Requesting the list of configured nodes
> >>> Jan 13 15:53:37 lsc-node02.velti.net crmd: [30853]: WARN: get_uuid: Could 
> >>> not calculate UUID for lsc-node02
> >>> Jan 13 15:53:37 lsc-node02.velti.net crmd: [30853]: WARN: 
> >>> populate_cib_nodes_ha: Node lsc-node02: no uuid found
> >>> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: WARN: get_uuid: Could 
> >>> not calculate UUID for lsc-node01
> >>> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: WARN: 
> >>> populate_cib_nodes_ha: Node lsc-node01: no uuid found
> >>> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info: 
> >>> do_state_transition: All 1 cluster nodes are eligible to run resources.
> >>> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info: 
> >>> do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
> >>> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info: 
> >>> crm_update_quorum: Updating quorum status to false (call=22)
> >>> Jan 13 15:53:38 lsc-node02.velti.net attrd: [30852]: info: 
> >>> attrd_local_callback: Sending full refresh (origin=crmd)
> >>> Jan 13 15:53:38 lsc-node02.velti.net cib: [30849]: info: 
> >>> cib_process_request: Operation complete: op cib_modify for section nodes 
> >>> (origin=local/crmd/20, version=0.18.1): ok
> >>>  (rc=0)
> >>>
> >>> I did not have the same issue when tried on Centos 5.3 i386.
> >>>
> >>> Can anyone advise?
> >>>
> >>> What may be my consequences if no-quorum-policy is set to ignore?
> >>
> >> Well you;ll be in trouble if you get a split-brain - but no more so
> >> than usual since heartbeat will normally always claim it has quorum in
> >> a two node cluster.
> >>
> >> What do the heartbeat/ccm logs say?
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >>
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] no quorum problem

Reply via email to