[Linux-cluster] SAN with GFS2 on RHEL 6 beta: STONITH right after start

LET Wed, 28 Jul 2010 01:11:53 -0700

Hello

I have two nodes running RHEL 6 beta 2  and configured corosync as follows. 
Both nodes have access to a SAN disk. The disk is partitioned into /dev/sdb1 
for SBD STONITH and /dev/sdb2 for data. /dev/sdb2 has a GFS2 filesystem on the 
LVM (vg01/lv00). For the configuration, I followed the Cluster from Scratch PDF 
from clusterlabs.


As soon as I start the two nodes, one of them gets immediately fenced and shut 
down. I see in the logs, that the fenced node tries to mount the FS when he 
gets shot down. I have no clue why this happens. Can anyone give me a hint how 
to fix my cluster?


Configuration:
[r...@pcmknode-1 ~]# crm configure show
node pcmknode-1
node pcmknode-2
primitive WebFS ocf:heartbeat:Filesystem \
        params device="/dev/vg01/lv00" directory="/data_1" fstype="gfs2"
primitive dlm ocf:pacemaker:controld \
        params configdir="/config" \
        op monitor interval="120s"
primitive gfs-control ocf:pacemaker:controld \
        params daemon="gfs_controld.pcmk" args="-g 0" \
        op monitor interval="120s"
primitive resSBD stonith:external/sbd \
        params sbd_device="/dev/sdb1"
clone WebFSClone WebFS
clone dlm-clone dlm \
        meta interleave="true" target-role="Started"
clone gfs-clone gfs-control \
        meta interleave="true" target-role="Started"
location cli-prefer-WebFS WebFSClone \
        rule $id="cli-prefer-rule-WebFS" inf: #uname eq pcmknode-1 and date lt 
"2010-07-27 21:53:10Z"
colocation WebFS-with-gfs-control inf: WebFSClone gfs-clone
colocation gfs-with-dlm inf: gfs-clone dlm-clone
order start-WebFS-after-gfs-control inf: gfs-clone WebFSClone
order start-gfs-after-dlm inf: dlm-clone gfs-clone
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        stonith-timeout="30s" \
        no-quorum-policy="ignore"



These are the logs:
pcmknode-1: /var/log/messages
~ snip ~
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_color: Resource 
WebFS:0 cannot run anywhere
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Action 
dlm:0_stop_0 on pcmknode-2 is unrunnable (offline)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Marking node 
pcmknode-2 unclean
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Action 
gfs-control:0_stop_0 on pcmknode-2 is unrunnable (offline)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Marking node 
pcmknode-2 unclean
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: stage6: Scheduling Node 
pcmknode-2 for STONITH
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_stop_constraints: 
dlm:0_stop_0 is implicit after pcmknode-2 is fenced
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_stop_constraints: 
gfs-control:0_stop_0 is implicit after pcmknode-2 is fenced
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: 
Colocating gfs-control:1 with dlm:1 on pcmknode-1
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: 
Interleaving dlm:1 with gfs-control:1
Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: te_fence_node: Executing reboot 
fencing operation (30) on pcmknode-2 (timeout=30000)
Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: te_rsc_command: Initiating 
action 22: stop WebFS:1_stop_0 on pcmknode-1 (local)
Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: do_lrm_rsc_op: Performing 
key=22:3:0:2419bb70-dce6-4a0e-b649-fae2b0f21b8d op=WebFS:1_stop_0 )
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: 
Colocating dlm:0 with gfs-control:0 on pcmknode-2
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: 
Interleaving gfs-control:0 with dlm:0
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: 
Colocating dlm:1 with gfs-control:1 on pcmknode-1
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: 
Interleaving gfs-control:1 with dlm:1
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: 
Colocating WebFS:1 with gfs-control:1 on pcmknode-1
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: 
Interleaving gfs-control:1 with WebFS:1
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource 
resSBD   (Started pcmknode-1)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Stop resource 
dlm:0     (pcmknode-2)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource 
dlm:1    (Started pcmknode-1)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Stop resource 
gfs-control:0     (pcmknode-2)
Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: write_cib_contents: Archived 
previous version as /var/lib/heartbeat/crm/cib-23.raw
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource 
gfs-control:1    (Started pcmknode-1)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource 
WebFS:0  (Stopped)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Restart 
resource WebFS:1        (Started pcmknode-1)
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: process_pe_message: 
Transition 3: WARNINGs found during PE processing. PEngine Input stored in: 
/var/lib/pengine/pe-warn-4.bz2
Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: process_pe_message: 
Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" 
to identify issues.
Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: write_cib_contents: Wrote version 
0.127.0 of the CIB to disk (digest: af7f98fa70bd2ef644e8e70d6f2ceea9)
Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.vx6ONO (digest: 
/var/lib/heartbeat/crm/cib.kGkBp3)
Jul 28 00:46:32 pcmknode-1 Filesystem[2902]: INFO: Running stop for 
/dev/vg01/lv00 on /data_1
Jul 28 00:46:32 pcmknode-1 Filesystem[2902]: INFO: Trying to unmount /data_1
Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: ERROR: remote_op_query_timeout: 
Query 8f1eeecf-4832-4430-b8e8-41a645675c58 for pcmknode-2 timed out
Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: ERROR: remote_op_timeout: Action 
reboot (8f1eeecf-4832-4430-b8e8-41a645675c58) for pcmknode-2 timed out
Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: info: remote_op_done: Notifing 
clients of 8f1eeecf-4832-4430-b8e8-41a645675c58 (reboot of pcmknode-2 from 
540c61a4-d351-40c7-aa60-efd445097180 by (null)): 0, rc=-7
Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: info: stonith_notify_client: 
Sending st_fence-notification to client 
2629/cc57856c-5357-4343-95a9-712771f711ae
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: log_data_element: 
tengine_stonith_callback: StonithOp <remote-op state="0" st_target="pcmknode-2" 
st_op="reboot" />
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: tengine_stonith_callback: 
Stonith operation 2/30:3:0:2419bb70-dce6-4a0e-b649-fae2b0f21b8d: Operation 
timed out (-7)
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: ERROR: tengine_stonith_callback: 
Stonith of pcmknode-2 failed (-7)... aborting transition.
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: abort_transition_graph: 
tengine_stonith_callback:402 - Triggered transition abort (complete=0) : 
Stonith failed
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: update_abort_priority: Abort 
action done superceeded by restart
Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: tengine_stonith_notify: Peer 
pcmknode-2 was terminated (reboot) by (null) for pcmknode-1 
(ref=8f1eeecf-4832-4430-b8e8-41a645675c58): Operation timed out
~ snip ~



pcmknode-2: /var/log/messages
~ snip ~
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received 
ringid(192.168.1.186:620) seq 91
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 90 to 91
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message 
with seq 91 to pending delivery queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received 
ringid(192.168.1.186:620) seq 92
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 91 to 92
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message 
with seq 92 to pending delivery queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] mcasted message added to 
pending queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 92 to 93
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message 
with seq 93 to pending delivery queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [CPG   ] got procjoin message from 
cluster node -1147763583
Jul 28 00:46:29 pcmknode-2 cib: [2542]: debug: cib_process_xpath: cib_query: 
//nvp...@name='terminate'] does not exist
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received 
ringid(192.168.1.186:620) seq 93
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to 
and including 92
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [CPG   ] got mcast request on 
0x1b072a0
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received 
ringid(192.168.1.186:620) seq 94
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 93 to 94
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message 
with seq 94 to pending delivery queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] mcasted message added to 
pending queue
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to 
and including 93
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 94 to 95
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message 
with seq 95 to pending delivery queue
Jul 28 00:46:29 pcmknode-2 kernel: : dlm: got connection from -1164540799
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received 
ringid(192.168.1.186:620) seq 95
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to 
and including 94
Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to 
and including 95
Jul 28 00:46:29 pcmknode-2 kernel: GFS2: fsid=pcmknode:data1s.0: Joined 
cluster. Now mounting FS...
-- that was the last message in the log.




So, how can I fix my cluster? What exactly is the problem?


Thanks,
Benedikt

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

[Linux-cluster] SAN with GFS2 on RHEL 6 beta: STONITH right after start

Reply via email to