Hi, I've got a pair of servers running on RHEL5 x86_64 with openais-0.80 (older install) which I want to upgrade to corosync-1.3.0 + pacemaker-1.0.10. Downtime is not an issue and corosync 1.3.0 is needed for UDPU, so I built it from the corosync.org website.
With pacemaker, we won't be using the heartbeat stack, so I built the pacemaker package from the clusterlabs.org src.rpm without heartbeat support. To be more precise I used rpmbuild --without heartbeat --with ais --with snmp --with esmtp -ba pacemaker-epel.spec Now I've tested the rpm list below on a pair of XEN VM's, it works just fine. cluster-glue-1.0.6-1.6.el5.x86_64.rpm cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm corosync-1.3.0-1.x86_64.rpm corosynclib-1.3.0-1.x86_64.rpm libesmtp-1.0.4-5.el5.x86_64.rpm libibverbs-1.1.2-1.el5.x86_64.rpm librdmacm-1.0.8-1.el5.x86_64.rpm libtool-ltdl-1.5.22-6.1.x86_64.rpm openais-1.1.4-2.x86_64.rpm openaislib-1.1.4-2.x86_64.rpm openhpi-2.10.2-1.el5.x86_64.rpm openib-1.3.2-0.20080728.0355.3.el5.noarch.rpm pacemaker-1.0.10-1.4.x86_64.rpm pacemaker-libs-1.0.10-1.4.x86_64.rpm perl-TimeDate-1.16-5.el5.noarch.rpm resource-agents-1.0.3-2.6.el5.x86_64.rpm However when performing the upgrade on the servers running openais-0.80, first I removed the heartbeat, heartbeat-libs and PyXML rpms (conflicting dependencies issue) then rpm -Uvh on the rpm list above. Installation went fine, removed existing cib.xml and signatures, fresh start. Then I configured corosync, then started it on both servers, and nothing. At first I got an error related to pacemaker mgmt, which was an old package installed with the old rpms. Removed it, tried again. Nothing. Removed all cluster related rpms old and new + deps, except for DRBD, then installed the list above, then again, nothing. What nothing means: - corosync starts, never elects DC, never sees the other node or itself for that matter. - can stop corosync via the init script, it goes into an endless phase where it just prints dots to the screen, have to kill the process to make it stop. Troubleshooting done so far: - tested network sockets (nc from side to side), firewall rules (iptables down), communication is ok - searched for the original RPM's list, removed all remaining RPMs, ran ldconfig, removed new RPM's, installed new RPM's My guess is that there are some leftovers from the old openais-0.80 installation, which mess with the current installation, seeing as how the same set of RPMs on a pair of XEN VM's with the same OS work fine, however I cannot put my finger on the culprit for the real servers' issue. Logs: http://pastebin.com/i0maZM4p Removed everything after removing the RPM's, just to be extra paranoid about leftovers (rpm -qpl *.rpm >> file && for i in `cat file `; do [[ -e "$i" ]] && echo "$i" >> newfile ; done && for i in `cat newfile` ; do rm -rf $i ; done) Installed RPMs (without openais) Same output http://pastebin.com/3iPHSXua It seems to go into some sort of loop. Jan 26 12:13:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped! Jan 26 12:13:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1, Integrated: 0 Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster nodes failed to respond to the join offer. Jan 26 12:13:41 cluster1 crmd: [15612]: info: ghash_print_node: Welcome reply not received from: cluster1 7 Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-8: Waiting on 1 outstanding join acks Jan 26 12:16:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped! Jan 26 12:16:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1, Integrated: 0 Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster nodes failed to respond to the join offer. Jan 26 12:16:41 cluster1 crmd: [15612]: info: ghash_print_node: Welcome reply not received from: cluster1 8 Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-9: Waiting on 1 outstanding join acks Jan 26 12:19:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped! Jan 26 12:19:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1, Integrated: 0 Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster nodes failed to respond to the join offer. Jan 26 12:19:41 cluster1 crmd: [15612]: info: ghash_print_node: Welcome reply not received from: cluster1 9 Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-10: Waiting on 1 outstanding join acks Jan 26 12:20:11 cluster1 cib: [15608]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Any suggestions? TIA. Regards, Dan -- Dan Frîncu CCNA, RHCE
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
