Hi, I am having an issue with Pacemaker on the cman stack where I can reliably cause pacemaker to crash and coredump when I put the node in standby or unstandby. Here is my messages log from when I do a unstandby like so "pcs cluster unstandby `uname -n`":
Oct 18 18:33:11 localhost crmd[19258]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] [7/3857] Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: Diff: --- 0.45.4 Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: Diff: +++ 0.46.1 01d8e730566358dcb36a7bc5b9ac587c Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: -- <nvpair id="nodes-node01-standby" name="standby" value="on" /> Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: ++ <cib epoch="46" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7" update-origin="node01" update-client="crm_attribute" cib-last- written="Fri Oct 18 18:32:57 2013" have-quorum="0" dc-uuid="node01" /> Oct 18 18:33:11 localhost pengine[19257]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start reader_vip_2#011(node01) Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start writer_vip#011(node01) Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start reader_vip_1#011(node01) Oct 18 18:33:11 localhost cib[19253]: error: cib_ipc_closed: Client not found in the hashtable Oct 18 18:33:11 localhost cib[19253]: error: qb_ipcs_connection_unref: ref:0 state:3 (19253-20281-21) Oct 18 18:33:11 localhost pengine[19257]: notice: process_pe_message: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-77.bz2 Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished: reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add 10.10.10.9/32 brd 10.10.10.9 dev eth1 ] Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished: reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ] Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished: reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp-10.10.10.9 eth1 10.10.10.9 auto not_used not_used ] Oct 18 18:33:13 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:11-19253' creation detected Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add 10.10.10.8/32 brd 10.10.10.8 dev eth1 ] Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ] Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp-10.10.10.8 eth1 10.10.10.8 auto not_used not_used ] Oct 18 18:33:13 localhost abrt[20283]: Saved core dump of pid 19253 (/usr/libexec/pacemaker/cib) to /var/spool/abrt/ccpp-2013-10-18-18:33:11-19253 (10153984 bytes) Oct 18 18:33:13 localhost rsyslogd-2177: imuxsock lost 75 messages from pid 18983 due to rate-limiting Oct 18 18:33:13 localhost pacemakerd[19247]: notice: pcmk_child_exit: Child process cib terminated with signal 6 (pid=19253, core=128) Oct 18 18:33:13 localhost pacemakerd[19247]: notice: pcmk_process_exit: Respawning failed child process: cib Oct 18 18:33:13 localhost cib[20433]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add 10.10.10.10/32 brd 10.10.10.10 dev eth1 ] Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ] Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished: writer_vip_start_0:20285 [ 2013/10/18_18:33:13 INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp-10.10.10.10 eth1 10.10.10.10 auto not_used not_used ] Oct 18 18:33:13 localhost cib[20433]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman Oct 18 18:33:13 localhost crmd[19258]: notice: crm_ipc_send: Connection to cib_shm closed Oct 18 18:33:13 localhost crmd[19258]: error: cib_native_perform_op_delegate: Couldn't perform cib_modify operation (timeout=120s): -107: Transport endpoint is not connected (107) Oct 18 18:33:13 localhost crmd[19258]: error: cib_native_perform_op_delegate: CIB disconnected Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource update -70 failed: (rc=-70) Communication error on send Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM operation reader_vip_2_start_0 (call=60, rc=0, cib-update=-70, confirmed=true) ok Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource update -107 failed: (rc=-107) Transport endpoint is not connected Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM operation reader_vip_1_start_0 (call=64, rc=0, cib-update=-107, confirmed=true) ok Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource update -107 failed: (rc=-107) Transport endpoint is not connected Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM operation writer_vip_start_0 (call=62, rc=0, cib-update=-107, confirmed=true) ok Oct 18 18:33:13 localhost crmd[19258]: error: crm_ipc_read: Connection to cib_shm failed Oct 18 18:33:13 localhost crmd[19258]: error: mainloop_gio_callback: Connection to cib_shm[0x26c7080] closed (I/O condition=17) Oct 18 18:33:13 localhost crmd[19258]: error: crmd_cib_connection_destroy: Connection to the CIB terminated... Oct 18 18:33:13 localhost crmd[19258]: error: do_log: FSA: Input I_ERROR from crmd_cib_connection_destroy() received in state S_TRANSITION_ENGINE Oct 18 18:33:13 localhost crmd[19258]: warning: do_state_transition: State transition S_TRANSITION_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ] Oct 18 18:33:13 localhost crmd[19258]: error: do_recover: Action A_RECOVER (0000000001000000) not supported Oct 18 18:33:13 localhost crmd[19258]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling timer for action 4 (src=71) Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling timer for action 6 (src=72) Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling timer for action 8 (src=73) Oct 18 18:33:13 localhost crmd[19258]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource reader_vip_1 was active at shutdown. You may ignore this error if it is unmanaged. Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource reader_vip_2 was active at shutdown. You may ignore this error if it is unmanaged. Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource writer_vip was active at shutdown. You may ignore this error if it is unmanaged. Oct 18 18:33:13 localhost crmd[19258]: notice: terminate_cs_connection: Disconnecting from Corosync Oct 18 18:33:13 localhost crmd[19258]: error: do_exit: Could not recover from internal error Oct 18 18:33:13 localhost pacemakerd[19247]: error: qb_ipcs_connection_unref: ref:0 state:3 (19247-19258-19) Oct 18 18:33:14 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:14-19247' creation detected Oct 18 18:33:14 localhost abrt[20454]: Saved core dump of pid 19247 (/usr/sbin/pacemakerd) to /var/spool/abrt/ccpp-2013-10-18-18:33:14-19247 (9646080 bytes) Oct 18 18:33:24 localhost abrtd: Sending an email... Oct 18 18:33:24 localhost abrtd: Email was sent to: root@localhost Oct 18 18:33:25 localhost abrtd: New problem directory /var/spool/abrt/ccpp-2013-10-18-18:33:11-19253, processing I get "[abrt] full crash report" emails as well. some basic info from the emails Subject: [abrt] full crash report User-Agent: Heirloom mailx 12.4 7/29/08 Content-Type: text/plain; charset=us-ascii Status: RO abrt_version: 2.0.8 cgroup: cmdline: /usr/libexec/pacemaker/cib executable: /usr/libexec/pacemaker/cib kernel: 2.6.32-358.18.1.el6.x86_64 pid: 19253 pwd: /var/lib/pacemaker/cores time: Fri 18 Oct 2013 06:33:11 PM PDT uid: 495 username: hacluster sosreport.tar.xz: Binary file, 401160 bytes and Subject: [abrt] full crash report User-Agent: Heirloom mailx 12.4 7/29/08 Content-Type: text/plain; charset=us-ascii Status: R abrt_version: 2.0.8 cgroup: cmdline: pacemakerd executable: /usr/sbin/pacemakerd kernel: 2.6.32-358.18.1.el6.x86_64 pid: 19247 pwd: /var/lib/pacemaker/cores/root time: Fri 18 Oct 2013 06:33:14 PM PDT uid: 0 username: root sosreport.tar.xz: Binary file, 401964 bytes This is Centos 6.4 running the following versions: corosync-1.4.1-15.el6_4.1.x86_64 pacemaker-libs-1.1.8-7.el6.x86_64 corosynclib-1.4.1-15.el6_4.1.x86_64 pacemaker-cli-1.1.8-7.el6.x86_64 pacemaker-cluster-libs-1.1.8-7.el6.x86_64 pacemaker-1.1.8-7.el6.x86_64 Here is my corosync.conf file: compatibility: whitetank totem { version: 2 rrp_mode: none netmtu: 1500 secuath: on threads: 0 transport: udp token: 1000 token_retransmit: 238 hold: 100 join: 50 consensus: 1200 merge: 200 downcheck: 1000 fail_recv_const: 2500 seqno_unchanged_const: 30 heartbeat_failures_allowed: 0 max_network_delay: 50 window_size: 50 max_messages: 17 miss_count_const: 5 rrp_problem_count_timeout: 2000 rrp_problem_count_threshold: 10 rrp_problem_count_mcast_threshold: rrp_token_expired_timeout: 47 rrp_autorecovery_check_timeout: 1000 interface: { ringnumber: 0 bindnetaddr: 10.10.10.0 broadcast: no mcastaddr: 239.0.0.111 mcastport: 4000 ttl: 1 } } logging { timestamp: off fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/corosync.log logfile_priority: info syslog_facility: daemon syslog_priority: info debug: off } amf { mode: disabled } and my cluster.conf <?xml version="1.0"?> <cluster config_version="1" name="nodes"> <logging debug="on"/> <clusternodes> <clusternode name="node01" nodeid="1"/> <clusternode name="node02" nodeid="2"/> </clusternodes> </cluster> If I'm doing something obviously wrong please let me know. If you would like the full contents of the coredump and the abrt emails, I can also provide those. Thanks for your time, Justin
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org