[Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown

Stallmann, Andreas Tue, 12 Apr 2011 15:24:13 -0700

Hi!

We've got a pretty straightforward and easy configuration:


Corosync 1.2.1 / Pacemaker 2.0.0 on OpenSuSE 11.3 running DRBD (M/S), Ping 
(clone), and a resource-group, containing a shared IP, tomcat and mysql (where 
the datafiles of mysql reside on the DRBD). The cluster consists of two virtual 
machines running on VMware ESXi 4.

Since we moved the cluster to an other vmware esxi, strange things happen:

While DRBD and the ping resource come up on both nodes, the resource group 
"appl_grp" (see below) doesn't. No failures are shown in crm_mon and the 
failcount is zero.

Output of crm_mon:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
============
Last updated: Tue Apr 12 23:39:39 2011
Stack: openais
Current DC: cms-appl02 - partition with quorum
Version: 1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ cms-appl01 cms-appl02 ]

Master/Slave Set: ms_drbd_r0
     Masters: [ cms-appl01 ]
     Slaves: [ cms-appl02 ]
Clone Set: pingy_clone
     Started: [ cms-appl01 cms-appl02 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Normally, I'd at least saw the resource group as stoped, but now it doesn't 
even turn up in the crm_mon display!

The crm-Tool at least shows, that the resources still exist:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
crm(live)# resource
crm(live)resource# show
Resource Group: appl_grp
     fs_r0      (ocf::heartbeat:Filesystem) Stopped
     sharedIP   (ocf::heartbeat:IPaddr2) Stopped
     tomcat_res (ocf::heartbeat:tomcat) Stopped
     database_res       (ocf::heartbeat:mysql) Stopped
Master/Slave Set: ms_drbd_r0
     Masters: [ cms-appl01 ]
     Slaves: [ cms-appl02 ]
Clone Set: pingy_clone
     Started: [ cms-appl01 cms-appl02 ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

And finally, here's our configuration:

~~~~~~~~~~~~~~output of "crm configure show"~~~~~~~~
node cms-appl01
node cms-appl02
primitive database_res ocf:heartbeat:mysql \
        params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf" 
datadir="/drbd/mysql" user="mysql" 
log="/var/log/mysql/mysqld.logpid=/var/run/mysql/mysqld.pid" 
socket="/drbd/run/mysql/mysql.sock" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="10s" timeout="30s" \
        op notify interval="0" timeout="90s"
primitive drbd_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive fs_r0 ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/drbd" fstype="ext4" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pingy_res ocf:pacemaker:ping \
        params dampen="5s" multiplier="1000" host_list="191.224.111.1 
191.224.111.78 194.25.2.129" \
        op monitor interval="60s" timeout="60s" \
        op start interval="0" timeout="60s"
primitive sharedIP ocf:heartbeat:IPaddr2 \
        params ip="191.224.111.50" cidr_netmask="255.255.255.0" nic="eth0:0"
primitive tomcat_res ocf:heartbeat:tomcat \
        params java_home="/etc/alternatives/jre" \
        params catalina_home="/usr/share/tomcat6" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="10s" timeout="30s"
group appl_grp fs_r0 sharedIP tomcat_res database_res \
        meta target-role="Started"
ms ms_drbd_r0 drbd_r0 \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
clone pingy_clone pingy_res
location appl_loc appl_grp 100: cms-appl01
location only-if-connected appl_grp \
        rule $id="only-if-connected-rule" -inf: not_defined pingd or pingd lte 
2000
colocation appl_grp-only-on-master inf: appl_grp ms_drbd_r0:Master
order appl_grp-after-drbd inf: ms_drbd_r0:promote appl_grp:start
order mysql-after-fs inf: fs_r0 database_res
property $id="cib-bootstrap-options" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        stonith-action="poweroff" \
        default-resource-stickiness="100" \
        dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \
        cluster-infrastructure="openais" \
       expected-quorum-votes="2" \
        last-lrm-refresh="1302643565"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When I (re)activate the appl_grp, literarily nothing happens:

crm(live)resource# start nag_grp

No new entries in /var/log/messages, no visible changes in crm_mon. It is as if 
the resource didn't exist.

Any ideas? You'll find the logs below.

Cheers and good night,

Andreas

I found only one error message in /var/log/messages:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Apr 12 23:56:11 cms-appl01 cib: [3888]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.dtzl4N (digest: 
/var/lib/heartbeat/crm/cib.QPtzfE)
Apr 12 23:56:11 cms-appl01 pengine: [2662]: info: process_pe_message: 
Transition 0: PEngine Input stored in: /var/lib/pengine/pe-input-2971.bz2
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
tomcat_res_monitor_0 (13) confirmed on cms-appl02 (rc=0)
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
database_res_monitor_0 (14) confirmed on cms-appl02 (rc=0)
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: process_lrm_event: LRM operation 
tomcat_res_monitor_0 (call=4, rc=7, cib-update=31, confirmed=true) not running
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
tomcat_res_monitor_0 (6) confirmed on cms-appl01 (rc=0)
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
fs_r0_monitor_0 (11) confirmed on cms-appl02 (rc=0)
Apr 12 23:56:11 cms-appl01 crmd: [2663]: info: match_graph_event: Action 
sharedIP_monitor_0 (12) confirmed on cms-appl02 (rc=0)
Apr 12 23:56:11 cms-appl01 Filesystem[3889]: [3917]: WARNING: Couldn't find 
device [/dev/drbd0]. Expected /dev/??? to exist
Apr 12 23:56:11 cms-appl01 mysql[3892]: [3932]: ERROR: Datadir /drbd/mysql 
doesn't exist

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are quite a lot of warnings:

Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: send_ipc_message: IPC Channel to 
2655 is not connected
Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: send_via_callback_channel: 
Delivery of reply to client 2655/d4c6501f-32cb-49a4-a800-17d5385d71cb failed
Apr 12 23:54:32 cms-appl01 cib: [2651]: WARN: do_local_notify: A-Sync reply to 
crmd failed: reply failed
Apr 12 23:55:04 cms-appl01 rchal: boot with 'CPUFREQ=no' in to avoid this 
warning.
Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Core dumps could be lost if 
multiple dumps occur.
Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Consider setting non-default 
value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
supportability
Apr 12 23:55:15 cms-appl01 logd: [2326]: WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Apr 12 23:55:18 cms-appl01 corosync[2650]:  [pcmk  ] WARN: route_ais_message: 
Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Apr 12 23:55:18 cms-appl01 corosync[2650]:  [pcmk  ] WARN: route_ais_message: 
Sending message to local.crmd failed: ipc delivery failed (rc=-2)
Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Core dumps could be lost if 
multiple dumps occur.
Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Consider setting non-default 
value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
supportability
Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Apr 12 23:55:18 cms-appl01 mgmtd: [2664]: WARN: lrm_signon: can not initiate 
connection
Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Core dumps could be lost if 
multiple dumps occur.
Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Consider setting non-default 
value in /proc/sys/kernel/core_pattern (or equivalent) for maximum 
supportability
Apr 12 23:55:18 cms-appl01 lrmd: [2660]: WARN: Consider setting 
/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Apr 12 23:56:11 cms-appl01 crmd: [2663]: WARN: cib_client_add_notify_callback: 
Callback already present
Apr 12 23:56:11 cms-appl01 Filesystem[3889]: [3917]: WARNING: Couldn't find 
device [/dev/drbd0]. Expected /dev/??? to exist
Apr 12 23:56:11 cms-appl01 lrmd: [2660]: info: RA output: 
(sharedIP:probe:stderr) Converted dotted-quad netmask to CIDR as: 24#012eth0:0: 
warning: name may be invalid



--
CONET Solutions GmbH
Andreas Stallmann,
Theodor-Heuss-Allee 19, 53773 Hennef
Tel.: +49 2242 939-677, Fax: +49 2242 939-393
Mobil: +49 172 2455051
Internet: http://www.conet.de, mailto: 
[email protected]<mailto:[email protected]>

------------------------
CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke 
H?fer
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown

Reply via email to