Hi, On Sun, Feb 10, 2013 at 2:24 PM, Viacheslav Biriukov <v.v.biriu...@gmail.com> wrote: > Hi guys, > > Got a tricky issue with Corosync and Pacemaker over DHCP IP address using > unicast. Corosync craches periodically. > > Packages are from centos 6 repos: > corosync-1.4.1-7.el6_3.1.x86_64 > corosynclib-1.4.1-7.el6_3.1.x86_64 > pacemaker-cluster-libs-1.1.7-6.el6.x86_64 > pacemaker-libs-1.1.7-6.el6.x86_64 > pacemaker-cli-1.1.7-6.el6.x86_64 > pacemaker-1.1.7-6.el6.x86_64 > > > Logs > > Feb 09 23:24:33 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 00:24:39 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 01:24:44 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 02:24:48 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 03:24:51 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 04:24:52 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 05:24:54 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 06:25:00 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 07:25:06 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor > Feb 10 07:56:22 corosync [TOTEM ] A processor failed, forming new > configuration. > Feb 10 07:56:22 corosync [TOTEM ] The network interface is down.
This ^^^ is your problem. Corosync doesn't like it, see https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface Normally DHCP shouldn't take the interface down. Also, since changing the network configuration in corosync means restarting it, why not go with static IP's? HTH, Dan > Feb 10 07:56:24 corosync [TOTEM ] The network interface [172.17.0.104] is > now up. > Feb 10 07:56:25 [5242] host1 pacemakerd: error: cfg_connection_destroy: > Connection destroyed > Feb 10 07:56:25 [5251] host1 crmd: error: ais_dispatch: > Receiving message body failed: (2) Library error: Resource temporarily > unavailable (11) > Feb 10 07:56:25 [5246] host1 cib: error: ais_dispatch: > Receiving message body failed: (2) Library error: Resource temporarily > unavailable (11) > Feb 10 07:56:25 [5249] host1 attrd: error: ais_dispatch: > Receiving message body failed: (2) Library error: Resource temporarily > unavailable (11) > Feb 10 07:56:25 [5251] host1 crmd: error: ais_dispatch: AIS > connection failed > Feb 10 07:56:25 [5242] host1 pacemakerd: error: cpg_connection_destroy: > Connection destroyed > Feb 10 07:56:25 [5246] host1 cib: error: ais_dispatch: AIS > connection failed > Feb 10 07:56:25 [5251] host1 crmd: info: crmd_ais_destroy: > connection closed > Feb 10 07:56:25 [5249] host1 attrd: error: ais_dispatch: AIS > connection failed > Feb 10 07:56:25 [5247] host1 stonith-ng: error: ais_dispatch: > Receiving message body failed: (2) Library error: Resource temporarily > unavailable (11) > Feb 10 07:56:25 [5246] host1 cib: error: cib_ais_destroy: AIS > connection terminated > Feb 10 07:56:25 [5249] host1 attrd: crit: attrd_ais_destroy: Lost > connection to OpenAIS service! > Feb 10 07:56:25 [5242] host1 pacemakerd: notice: pcmk_shutdown_worker: > Shuting down Pacemaker > Feb 10 07:56:25 [5247] host1 stonith-ng: error: ais_dispatch: AIS > connection failed > Feb 10 07:56:25 [5249] host1 attrd: notice: main: Exiting... > Feb 10 07:56:25 [5247] host1 stonith-ng: error: stonith_peer_ais_destroy: > AIS connection terminated > Feb 10 07:56:25 [5242] host1 pacemakerd: notice: stop_child: > Stopping crmd: Sent -15 to process 5251 > Feb 10 07:56:25 [5249] host1 attrd: error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Feb 10 07:56:25 [5251] host1 crmd: info: crm_signal_dispatch: > Invoking handler for signal 15: Terminated > Feb 10 07:56:25 [5251] host1 crmd: notice: crm_shutdown: > Requesting shutdown, upper limit is 1200000ms > Feb 10 07:56:25 [5251] host1 crmd: info: do_shutdown_req: > Sending shutdown request to host2 > Feb 10 07:56:25 [5242] host1 pacemakerd: error: pcmk_child_exit: Child > process stonith-ng exited (pid=5247, rc=1) > Feb 10 07:56:25 [5242] host1 pacemakerd: warning: send_ipc_message: IPC > Channel to 5249 is not connected > Feb 10 07:56:25 [5242] host1 pacemakerd: warning: send_ipc_message: IPC > Channel to 5246 is not connected > Feb 10 07:56:25 [5242] host1 pacemakerd: warning: send_ipc_message: IPC > Channel to 5247 is not connected > Feb 10 07:56:25 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:25 [5242] host1 pacemakerd: error: pcmk_child_exit: Child > process cib exited (pid=5246, rc=1) > Feb 10 07:56:25 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:25 [5242] host1 pacemakerd: error: pcmk_child_exit: Child > process attrd exited (pid=5249, rc=1) > Feb 10 07:56:25 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:27 [5251] host1 crmd: error: send_ais_text: > Sending message 68 via pcmk: FAILED (rc=2): Library error: Connection timed > out (110) > Feb 10 07:56:27 [5251] host1 crmd: error: do_log: FSA: Input > I_ERROR from do_shutdown_req() received in state S_NOT_DC > Feb 10 07:56:27 [5251] host1 crmd: notice: do_state_transition: > State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > origin=do_shutdown_req ] > Feb 10 07:56:27 [5251] host1 crmd: error: do_recover: > Action A_RECOVER (0000000001000000) not supported > Feb 10 07:56:27 [5251] host1 crmd: error: do_log: FSA: Input > I_TERMINATE from do_recover() received in state S_RECOVERY > Feb 10 07:56:27 [5251] host1 crmd: notice: do_state_transition: > State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE > cause=C_FSA_INTERNAL origin=do_recover ] > Feb 10 07:56:27 [5251] host1 crmd: info: do_shutdown: > Disconnecting STONITH... > Feb 10 07:56:27 [5251] host1 crmd: info: > tengine_stonith_connection_destroy: Fencing daemon disconnected > Feb 10 07:56:27 host1 lrmd: [5248]: info: cancel_op: operation monitor[25] > on ocf::OpenStackFloatingIP::P_SESSION_IP for client 5251, its parameters: > CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_timeout=[20000] > CRM_meta_interval=[5000] ip=[172.24.0.104] cancelled > Feb 10 07:56:27 [5251] host1 crmd: error: verify_stopped: > Resource P_SESSION_IP was active at shutdown. You may ignore this error if > it is unmanaged. > Feb 10 07:56:27 [5251] host1 crmd: info: do_lrm_control: > Disconnected from the LRM > Feb 10 07:56:27 [5251] host1 crmd: notice: terminate_ais_connection: > Disconnecting from AIS > Feb 10 07:56:27 [5251] host1 crmd: info: do_ha_control: > Disconnected from OpenAIS > Feb 10 07:56:27 [5251] host1 crmd: info: do_cib_control: > Disconnecting CIB > Feb 10 07:56:27 [5251] host1 crmd: error: send_ipc_message: IPC > Channel to 5246 is not connected > Feb 10 07:56:27 [5251] host1 crmd: error: send_ipc_message: IPC > Channel to 5246 is not connected > Feb 10 07:56:27 [5251] host1 crmd: error: > cib_native_perform_op_delegate: Sending message to CIB service FAILED > Feb 10 07:56:27 [5251] host1 crmd: info: > crmd_cib_connection_destroy: Connection to the CIB terminated... > Feb 10 07:56:27 [5251] host1 crmd: error: verify_stopped: > Resource P_SESSION_IP was active at shutdown. You may ignore this error if > it is unmanaged. > Feb 10 07:56:27 [5251] host1 crmd: info: do_exit: Performing > A_EXIT_0 - gracefully exiting the CRMd > Feb 10 07:56:27 [5251] host1 crmd: error: do_exit: Could not > recover from internal error > Feb 10 07:56:27 [5251] host1 crmd: info: free_mem: Dropping > I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] > Feb 10 07:56:27 [5251] host1 crmd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Feb 10 07:56:27 [5251] host1 crmd: info: do_exit: [crmd] > stopped (2) > Feb 10 07:56:27 [5242] host1 pacemakerd: error: pcmk_child_exit: Child > process crmd exited (pid=5251, rc=2) > Feb 10 07:56:27 [5242] host1 pacemakerd: warning: send_ipc_message: IPC > Channel to 5251 is not connected > Feb 10 07:56:27 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:27 [5242] host1 pacemakerd: notice: stop_child: > Stopping pengine: Sent -15 to process 5250 > Feb 10 07:56:27 [5242] host1 pacemakerd: info: pcmk_child_exit: Child > process pengine exited (pid=5250, rc=0) > Feb 10 07:56:27 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:27 [5242] host1 pacemakerd: notice: stop_child: > Stopping lrmd: Sent -15 to process 5248 > Feb 10 07:56:27 host1 lrmd: [5248]: info: lrmd is shutting down > Feb 10 07:56:27 [5242] host1 pacemakerd: info: pcmk_child_exit: Child > process lrmd exited (pid=5248, rc=0) > Feb 10 07:56:27 [5242] host1 pacemakerd: error: send_cpg_message: > Sending message via cpg FAILED: (rc=9) Bad handle > Feb 10 07:56:27 [5242] host1 pacemakerd: notice: pcmk_shutdown_worker: > Shutdown complete > Feb 10 07:56:27 [5242] host1 pacemakerd: info: main: Exiting > pacemakerd > > > corosync.conf: > > compatibility: whitetank > > totem { > version: 2 > secauth: off > nodeid: 104 > interface { > member { > memberaddr: 172.17.0.104 > } > member { > memberaddr: 172.17.0.105 > } > ringnumber: 0 > bindnetaddr: 172.17.0.0 > mcastport: 5426 > ttl: 1 > } > transport: udpu > } > > logging { > fileline: off > to_logfile: yes > to_syslog: yes > debug: on > logfile: /var/log/cluster/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > service { > # Load the Pacemaker Cluster Resource Manager > ver: 1 > name: pacemaker > } > > aisexec { > user: root > group: root > } > > > > Thank you! > > -- > Viacheslav Biriukov > BR > http://biriukov.me > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Dan Frincu CCNA, RHCE _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org