Thank you Andrew,
Since what I am going to do is manage resources on 2 nodes, one will be the
failover of the other but there will be services running on both nodes and
switching nodes if needed, I cannot use STONITH of suicide methods. The nodes
will be connected to each other by LAN and also by a parallel cable to prevent
some of connection losses.
It seems that for some reason the Heartbeat quorum daemon fails to start and if
I change the xml configuration manually to have-quorum="1", after a few seconds
it returns to have-quorum="0" :).
What I found in logs regarding ccm is the below:
Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: info: crm_cluster_connect:
Connecting to Heartbeat
Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: info: ccm_connect:
Registering with CCM...
Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM
Activation failed
Jan 17 14:12:12 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM
Connection failed 1 times (30 max)
Jan 17 14:12:12 lsc-node02.velti.net heartbeat: [12414]: info: the send queue
length from heartbeat to client cib is set to 1024
Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: info: do_cib_control: Could
not connect to the CIB service: connection failed
Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: WARN: do_cib_control:
Couldn't complete CIB registration 1 times... pause and retry
Jan 17 14:12:12 lsc-node02.velti.net crmd: [12453]: info: crmd_init: Starting
crmd's mainloop
Jan 17 14:12:14 lsc-node02.velti.net crmd: [12453]: info: crm_timer_popped:
Wait Timer (I_NULL) just popped!
Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: info: ccm_connect:
Registering with CCM...
Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM
Activation failed
Jan 17 14:12:15 lsc-node02.velti.net cib: [12449]: WARN: ccm_connect: CCM
Connection failed 2 times (30 max)
Jan 17 14:12:15 lsc-node02.velti.net crmd: [12453]: info: do_cib_control: Could
not connect to the CIB service: connection failed
Jan 17 14:12:15 lsc-node02.velti.net crmd: [12453]: WARN: do_cib_control:
Couldn't complete CIB registration 2 times... pause and retry
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: attrd_ha_callback:
flush message from lsc-node01.velti.net
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: find_hash_entry:
Creating hash entry for probe_complete
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info:
attrd_perform_update: Delaying operation probe_complete=<null>: cib not
connected
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: attrd_ha_callback:
flush message from lsc-node01.velti.net
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: find_hash_entry:
Creating hash entry for last-failure-VIP
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info:
attrd_perform_update: Delaying operation last-failure-VIP=<null>: cib not
connected
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: attrd_ha_callback:
flush message from lsc-node01.velti.net
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info: find_hash_entry:
Creating hash entry for terminate
Jan 17 14:12:15 lsc-node02.velti.net attrd: [12452]: info:
attrd_perform_update: Delaying operation terminate=<null>: cib not connected
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: attrd_ha_callback:
flush message from lsc-node01.velti.net
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: find_hash_entry:
Creating hash entry for shutdown
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info:
attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
Jan 17 14:12:16 lsc-node02.velti.net ccm: [12448]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: attrd_ha_callback:
flush message from lsc-node01.velti.net
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info: find_hash_entry:
Creating hash entry for fail-count-VIP
Jan 17 14:12:16 lsc-node02.velti.net attrd: [12452]: info:
attrd_perform_update: Delaying operation fail-count-VIP=<null>: cib not
connected
Jan 17 14:12:17 lsc-node02.velti.net crmd: [12453]: info: crm_timer_popped:
Wait Timer (I_NULL) just popped!
Jan 17 14:12:18 lsc-node02.velti.net cib: [12449]: info: ccm_connect:
Registering with CCM...
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: mem_handle_event: Got
an event OC_EV_MS_INVALID from ccm
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: mem_handle_event: Got
an event OC_EV_MS_INVALID from ccm
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: mem_handle_event:
instance=4, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=6 Jan 17
14:12:22 lsc-node02.velti.net cib: [12449]: info: mem_handle_event: instance=4,
nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=6
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info:
crmd_ccm_msg_callback: Quorum lost after event=INVALID (id=4)
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: cib_ccm_msg_callback:
Processing CCM event=INVALID (id=4)
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: ccm_event_detail:
INVALID: trans=4, nodes=2, new=2, lost=0 n_idx=0, new_idx=0, old_idx=6
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_get_peer: Node
lsc-node01.velti.net now has id: 1
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: ccm_event_detail:
CURRENT: lsc-node01.velti.net [nodeid=1, born=2]
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_update_peer: Node
lsc-node01.velti.net: id=1 state=member (new) addr=(null) votes=-1 born=2
seen=4 proc=0000000 0000000000000000000000100
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: ccm_event_detail:
CURRENT: lsc-node02.velti.net [nodeid=3, born=4]
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_update_peer_proc:
lsc-node01.velti.net.ais is now online
Jan 17 14:12:22 lsc-node02.velti.net crmd: [12453]: info: ccm_event_detail:
NEW: lsc-node01.velti.net [nodeid=1, born=2]
Jan 17 14:12:22 lsc-node02.velti.net cib: [12449]: info: crm_update_peer_proc:
lsc-node01.velti.net.crmd is now online
Jan 17 14:12:23 lsc-node02.velti.net crmd: [12453]: info: ccm_event_detail:
NEW: lsc-node02.velti.net [nodeid=3, born=4]
Jan 17 14:12:23 lsc-node02.velti.net cib: [12449]: info: crm_get_peer: Node
lsc-node02.velti.net now has id: 3
Jan 17 14:12:23 lsc-node02.velti.net crmd: [12453]: info: crm_get_peer: Node
lsc-node01.velti.net now has id: 1
Jan 17 14:12:23 lsc-node02.velti.net cib: [12449]: info: crm_update_peer: Node
lsc-node02.velti.net: id=3 state=member (new) addr=(null)
Jan 17 16:15:51 lsc-node02.velti.net cib: [16237]: info: crm_update_peer_proc:
lsc-node01.velti.net.crmd is now online
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: ccm_event_detail:
NEW: lsc-node01.velti.net [nodeid=1, born=2]
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: crm_get_peer: Node
lsc-node01.velti.net now has id: 1
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: recv msg
CCM_TYPE_MEM_LIST from lsc-node02.velti.net, status:[null ptr]
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: crm_update_peer: Node
lsc-node01.velti.net: id=1 state=member (new) addr=(null) votes=-1 born=2
seen=2 proc=00000000000000000000000000000200
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: WARN: ccm_state_joined:
received message with unknown cookie, just dropping
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: info: crm_update_peer_proc:
lsc-node01.velti.net.ais is now online
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: dump current
membership 0xf7ed5008
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: post_cache_update:
Updated cache after membership event 2.
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:
leader=lsc-node02.velti.net
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: post_cache_update:
post_cache_update added action A_ELECTION_CHECK to the FSA
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: transition=2
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: do_fsa_action:
actions:trace: // A_ELECTION_CHECK
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:
status=CCM_STATE_JOINED
Jan 17 16:15:51 lsc-node02.velti.net crmd: [16241]: debug: do_election_check:
Ignore election check: we not in an election
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug: has_quorum=0
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:
nodename=lsc-node02.velti.net bornon=1
Jan 17 16:15:51 lsc-node02.velti.net ccm: [16236]: debug:
nodename=lsc-node01.velti.net bornon=2
Jan 17 16:16:47 lsc-node02.velti.net crmd: [16241]: info: crm_timer_popped:
Election Trigger (I_DC_TIMEOUT) just popped!
Jan 17 16:16:47 lsc-node02.velti.net crmd: [16241]: debug: s_crmd_fsa:
Processing I_DC_TIMEOUT: [ state=S_PENDING cause=C_TIMER_POPPED
origin=crm_timer_popped ]
an 17 13:59:34 lsc-node02.velti.net pengine: [23022]: debug: unpack_config:
Cluster is symmetric - resources can run anywhere by default
Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: debug: unpack_config:
Default stickiness: 0
Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: notice: unpack_config:
On loss of CCM Quorum: Ignore
Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: info: unpack_config:
Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jan 17 13:59:34 lsc-node02.velti.net pengine: [23022]: info:
determine_online_status: Node lsc-node02.velti.net is online
Jan 17 14:10:52 lsc-node02.velti.net lrmd: [23007]: debug: on_receive_cmd: the
IPC to client [pid:23010] disconnected.
Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: info: do_lrm_control:
Disconnected from the LRM
Jan 17 14:10:52 lsc-node02.velti.net lrmd: [23007]: debug: unregister_client:
client crmd [pid:23010] is unregistered
Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: debug: do_fsa_action:
actions:trace: // A_CCM_DISCONNECT
Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: debug: do_fsa_action:
actions:trace: // A_HA_DISCONNECT
Jan 17 14:10:52 lsc-node02.velti.net ccm: [23005]: info: client (pid=23010)
removed from ccm
Jan 17 14:10:52 lsc-node02.velti.net heartbeat: [22969]: debug: Signing client
23010 off
Jan 17 14:10:52 lsc-node02.velti.net crmd: [23010]: info: do_ha_control:
Disconnected from Heartbeat
Jan 17 14:10:52 lsc-node02.velti.net heartbeat: [22969]: debug:
G_remove_client(pid=23010, reason='signoff' gsource=0x8836a00) {
I would attach the full logs but they are too large :)
Thanks in advance
Pavlos Polianidis
Pavlos Polianidis | Technical Support Specialist
Velti
44 Kifisias Ave.
15125 Marousi, Athens, Greece
T +30.210.637.8000
F +30.210.637.8888
M +30.695.506.0133
E [email protected]
www.velti.com
Velti is a global leader in mobile marketing and advertising solutions for
mobile operators, ad agencies, brands and media groups.
San Francisco | New York | Boston | Dublin | London | Paris | Madrid | Athens |
Sofia | Moscow | Dubai | New Delhi | Mumbai | Jakarta | Beijing | Shanghai |
Sydney-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Andrew Beekhof
Sent: Tuesday, January 18, 2011 11:02 AM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] no quorum problem
On Thu, Jan 13, 2011 at 3:17 PM, Pavlos Polianidis
<[email protected]> wrote:
> Hello,
>
>
> Currently I have installed heartbeat 3.0.2-2.el5 x86_64 and pacemaker
> 1.0.7-4.el5 x86_64 on a CentOS release 5.3 x86_64 machine using yum
> repositories.
>
> My configuration is the below:
> Ha.cf
>
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility local0
> compression_threshold 2
> node lsc-node01
> node lsc-node02
> debug 1
> use_logd false
> logfacility daemon
> traditional_compression off
> compression bz2
> coredumps true
> udpport 694
> bcast eth0
> autojoin any
> keepalive 1
> warntime 10
> deadtime 35
> initdead 40
> max_rexmit_delay 10000
> crm respawn
>
> but the output of the crm_mon command is the below:
>
> Last updated: Thu Jan 13 16:00:15 2011
> Stack: Heartbeat
> Current DC: lsc-node02.velti.net (a7e25657-fb85-4cf1-9d9b-5a21484e1583) -
> partition WITHOUT quorum
> Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> 2 Nodes configured, unknown expected votes
> 0 Resources configured.
> ============
>
> Online: [ lsc-node02.velti.net lsc-node01.velti.net ]
>
>
> Previously I have experimented with the latest version of heartbeat and
> pacemaker and I downgraded to the current versions as I had the same problem
> and I have read in the forums that it might be some bugs in some versions.
>
> in the debug log I see the below entry:
>
> WARN: cluster_status: We do not have quorum - fencing and resource management
> disabled
>
> In the log:
>
> Jan 13 15:53:34 lsc-node02.velti.net crmd: [30853]: info:
> populate_cib_nodes_ha: Requesting the list of configured nodes
> Jan 13 15:53:37 lsc-node02.velti.net crmd: [30853]: WARN: get_uuid: Could not
> calculate UUID for lsc-node02
> Jan 13 15:53:37 lsc-node02.velti.net crmd: [30853]: WARN:
> populate_cib_nodes_ha: Node lsc-node02: no uuid found
> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: WARN: get_uuid: Could not
> calculate UUID for lsc-node01
> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: WARN:
> populate_cib_nodes_ha: Node lsc-node01: no uuid found
> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info:
> do_state_transition: All 1 cluster nodes are eligible to run resources.
> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info: do_dc_join_final:
> Ensuring DC, quorum and node attributes are up-to-date
> Jan 13 15:53:38 lsc-node02.velti.net crmd: [30853]: info: crm_update_quorum:
> Updating quorum status to false (call=22)
> Jan 13 15:53:38 lsc-node02.velti.net attrd: [30852]: info:
> attrd_local_callback: Sending full refresh (origin=crmd)
> Jan 13 15:53:38 lsc-node02.velti.net cib: [30849]: info: cib_process_request:
> Operation complete: op cib_modify for section nodes (origin=local/crmd/20,
> version=0.18.1): ok
> (rc=0)
>
> I did not have the same issue when tried on Centos 5.3 i386.
>
> Can anyone advise?
>
> What may be my consequences if no-quorum-policy is set to ignore?
Well you;ll be in trouble if you get a split-brain - but no more so
than usual since heartbeat will normally always claim it has quorum in
a two node cluster.
What do the heartbeat/ccm logs say?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems