Using pacemaker 1.1.8-7 on EL6, I got the following series of events trying to shut down pacemaker and then corosync. The corosync shutdown (service corosync stop) ended up spinning/hanging indefinitely (~7hrs now). The events, including a:
May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from internal error For completeness, I've included the logging from the whole session, from corosync startup until it all went haywire. The badness seems to start at about 23:43:18. May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync built-in features: nss dbus rdma snmp May 21 23:42:51 node1 corosync[17541]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport (UDP/IP Multicast). May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport (UDP/IP Multicast). May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface [192.168.122.253] is now up. May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: Reading configure May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 5650605097994944513 for logging May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: Processing additional logging options... May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'off' for option: debug May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'no' for option: to_logfile May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 2730409743423111170 for quorum May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: No additional configuration supplied for: quorum May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: No default for option: provider May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: Local handle: 5880381755227111427 for service May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: Processing additional service options... May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found '1' for option: ver May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: CRM: Initialized May 21 23:42:51 node1 corosync[17541]: [pcmk ] Logging: Initialized pcmk_startup May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: Service: 9 May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: Local hostname: node1 May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_update_nodeid: Local node id: 4252674240 May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 4252674240 born on 0 May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c4640 Node 4252674240 now known as node1 (was: (null)) May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node node1 now has 1 quorum votes (was 0) May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.1.8 May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync extended virtual synchrony service May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync configuration service May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync profile loading service May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 May 21 23:42:51 node1 corosync[17541]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface [10.0.0.253] is now up. May 21 23:42:52 node1 corosync[17541]: [TOTEM ] Incrementing problem counter for seqid 1 iface 10.0.0.253 to [1 of 10] May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 4: memb=0, new=0, lost=0 May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 4: memb=1, new=1, lost=0 May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240 May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 May 21 23:42:52 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 23:42:52 node1 corosync[17541]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:0 left:0) May 21 23:42:52 node1 corosync[17541]: [MAIN ] Completed service synchronization, ready to provide service. May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 8: memb=1, new=0, lost=0 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node1 4252674240 May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 8: memb=2, new=1, lost=0 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 2608507072 born on 8 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node 2608507072/unknown is now: member May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2608507072 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2608507072 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 8 to 0 children May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c4640 Node 4252674240 ((null)) born on: 8 May 21 23:42:54 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c8b50 Node 2608507072 (node2) born on: 8 May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: 0x12c8b50 Node 2608507072 now known as node2 (was: (null)) May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node node2 now has 1 quorum votes (was 0) May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 8 to 0 children May 21 23:42:54 node1 corosync[17541]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) r(1) ip(10.0.0.155) ; members(old:1 left:0) May 21 23:42:54 node1 corosync[17541]: [MAIN ] Completed service synchronization, ready to provide service. May 21 23:42:54 node1 corosync[17541]: [TOTEM ] ring 1 active with no faults May 21 23:42:56 node1 pacemakerd[17587]: notice: main: Starting Pacemaker 1.1.8-7.wc1.el6 (Build: 394e906): generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing corosync-plugin cman May 21 23:42:56 node1 pacemakerd[17587]: notice: get_local_node_name: Defaulting to uname(2).nodename for the local classic openais (with plugin) node name May 21 23:42:56 node1 pacemakerd[17587]: notice: update_node_processes: 0x1ec7cc0 Node 4252674240 now known as node1, was: May 21 23:42:56 node1 stonith-ng[17594]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) May 21 23:42:56 node1 cib[17593]: notice: main: Using new config location: /var/lib/pacemaker/cib May 21 23:42:56 node1 cib[17593]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backup... May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Continuing with an empty configuration. May 21 23:42:56 node1 attrd[17596]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12d6560 for stonith-ng/0 May 21 23:42:56 node1 crmd[17598]: notice: main: CRM Git Version: 394e906 May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12da8e0 for attrd/0 May 21 23:42:56 node1 attrd[17596]: notice: main: Starting mainloop... May 21 23:42:56 node1 cib[17593]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12df4c0 for cib/0 May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending membership update 8 to cib May 21 23:42:56 node1 cib[17593]: notice: ais_dispatch_message: Membership 8: quorum acquired May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now member May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now member May 21 23:42:57 node1 pacemakerd[17587]: notice: update_node_processes: 0x1ec8c20 Node 2608507072 now known as node2, was: May 21 23:42:57 node1 stonith-ng[17594]: notice: setup_cib: Watching for stonith topology changes May 21 23:42:57 node1 crmd[17598]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded connection 0x12e4340 for crmd/0 May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending membership update 8 to crmd May 21 23:42:57 node1 crmd[17598]: notice: ais_dispatch_message: Membership 8: quorum acquired May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now member May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now member May 21 23:42:57 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: node1[-42293056] May 21 23:42:57 node1 crmd[17598]: notice: do_started: The local CRM is operational May 21 23:42:58 node1 cib[17593]: warning: cib_process_diff: Diff 0.0.0 -> 0.0.1 from node2 not applied to 0.0.1: current "num_updates" is greater than required May 21 23:42:58 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: node2[-1686460224] May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: memb: node1 4252674240 May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: Creating entry for node 2474289344 born on 12 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: Node 2474289344/unknown is now: member May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2474289344 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2474289344 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children May 21 23:43:01 node1 corosync[17541]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 21 23:43:01 node1 crmd[17598]: notice: crm_update_peer_state: crm_update_ais_node: Node (null)[2474289344] - state is now member May 21 23:43:01 node1 cib[17593]: notice: crm_update_peer_state: crm_update_ais_node: Node (null)[2474289344] - state is now member May 21 23:43:18 node1 crmd[17598]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING May 21 23:45:18 node1 crmd[17598]: error: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms) May 21 23:45:18 node1 crmd[17598]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=crm_timer_popped ] May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: --- 0.0.3 May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: +++ 0.1.1 2d896a081763a7b2ca54f82877fbbc84 May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="0" num_updates="3" /> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <cluster_property_set id="cib-bootstrap-options" > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.wc1.el6-394e906" /> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ </cluster_property_set> May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: Local-only Change: 0.2.1 May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="1" num_updates="1" /> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="classic openais (with plugin)" /> May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: Local-only Change: 0.3.1 May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" epoch="2" num_updates="1" /> May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2" /> May 21 23:46:33 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: Shuting down Pacemaker May 21 23:46:33 node1 pacemakerd[17587]: notice: stop_child: Stopping crmd: Sent -15 to process 17598 May 21 23:46:33 node1 crmd[17598]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_ERROR from feature_update_callback() received in state S_INTEGRATION May 21 23:47:18 node1 crmd[17598]: warning: do_state_transition: State transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=feature_update_callback ] May 21 23:47:18 node1 crmd[17598]: error: do_recover: Action A_RECOVER (0000000001000000) not supported May 21 23:47:18 node1 crmd[17598]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY May 21 23:47:18 node1 crmd[17598]: notice: terminate_cs_connection: Disconnecting from Corosync May 21 23:47:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x12e4340, async-conn=0x12e4340) left May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from internal error May 21 23:47:18 node1 pacemakerd[17587]: error: pcmk_child_exit: Child process crmd exited (pid=17598, rc=2) May 21 23:47:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:47:33 node1 pacemakerd[17587]: notice: stop_child: Stopping pengine: Sent -15 to process 17597 May 21 23:47:48 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:47:48 node1 pacemakerd[17587]: notice: stop_child: Stopping attrd: Sent -15 to process 17596 May 21 23:47:48 node1 attrd[17596]: notice: main: Exiting... May 21 23:47:48 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x12da8e0, async-conn=0x12da8e0) left May 21 23:48:03 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:48:03 node1 pacemakerd[17587]: notice: stop_child: Stopping lrmd: Sent -15 to process 17595 May 21 23:48:18 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:48:18 node1 pacemakerd[17587]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 17594 May 21 23:48:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client stonith-ng (conn=0x12d6560, async-conn=0x12d6560) left May 21 23:48:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:48:33 node1 pacemakerd[17587]: notice: stop_child: Stopping cib: Sent -15 to process 17593 May 21 23:48:38 node1 cib[17593]: notice: cib_force_exit: Forcing exit! May 21 23:48:38 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client cib (conn=0x12df4c0, async-conn=0x12df4c0) left May 21 23:48:38 node1 pacemakerd[17587]: error: pcmk_child_exit: Child process cib exited (pid=17593, rc=64) May 21 23:48:53 node1 pacemakerd[17587]: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again May 21 23:48:53 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: Shutdown complete At this point I tried to shut down corosync as described above but it failed to shut down and just hung/spun in shutdown. Any ideas what went wrong here? b.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org