On 01/26/2011 04:35 AM, Dan Frincu wrote: > Hi, > > I've got a pair of servers running on RHEL5 x86_64 with openais-0.80 > (older install) which I want to upgrade to corosync-1.3.0 + > pacemaker-1.0.10. Downtime is not an issue and corosync 1.3.0 is needed > for UDPU, so I built it from the corosync.org > <http://corosync.org/> website. > > With pacemaker, we won't be using the heartbeat stack, so I built the > pacemaker package from the clusterlabs.org > <http://clusterlabs.org/> src.rpm without heartbeat support. To be more > precise I used > > rpmbuild --without heartbeat --with ais --with snmp --with esmtp -ba > pacemaker-epel.spec > > Now I've tested the rpm list below on a pair of XEN VM's, it works just > fine. > > cluster-glue-1.0.6-1.6.el5.x86_64.rpm > cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm > corosync-1.3.0-1.x86_64.rpm > corosynclib-1.3.0-1.x86_64.rpm > libesmtp-1.0.4-5.el5.x86_64.rpm > libibverbs-1.1.2-1.el5.x86_64.rpm > librdmacm-1.0.8-1.el5.x86_64.rpm > libtool-ltdl-1.5.22-6.1.x86_64.rpm > openais-1.1.4-2.x86_64.rpm > openaislib-1.1.4-2.x86_64.rpm > openhpi-2.10.2-1.el5.x86_64.rpm > openib-1.3.2-0.20080728.0355.3.el5.noarch.rpm > pacemaker-1.0.10-1.4.x86_64.rpm > pacemaker-libs-1.0.10-1.4.x86_64.rpm > perl-TimeDate-1.16-5.el5.noarch.rpm > resource-agents-1.0.3-2.6.el5.x86_64.rpm > > However when performing the upgrade on the servers running openais-0.80, > first I removed the heartbeat, heartbeat-libs and PyXML rpms > (conflicting dependencies issue) then rpm -Uvh on the rpm list above. > Installation went fine, removed existing cib.xml and signatures, fresh > start. Then I configured corosync, then started it on both servers, and > nothing. At first I got an error related to pacemaker mgmt, which was an > old package installed with the old rpms. Removed it, tried again. > Nothing. Removed all cluster related rpms old and new + deps, except for > DRBD, then installed the list above, then again, nothing. What nothing > means: > - corosync starts, never elects DC, never sees the other node or itself > for that matter. > - can stop corosync via the init script, it goes into an endless phase > where it just prints dots to the screen, have to kill the process to > make it stop. > > Troubleshooting done so far: > - tested network sockets (nc from side to side), firewall rules > (iptables down), communication is ok > - searched for the original RPM's list, removed all remaining RPMs, ran > ldconfig, removed new RPM's, installed new RPM's > > My guess is that there are some leftovers from the old openais-0.80 > installation, which mess with the current installation, seeing as how > the same set of RPMs on a pair of XEN VM's with the same OS work fine, > however I cannot put my finger on the culprit for the real servers' issue. > > Logs: http://pastebin.com/i0maZM4p > > Removed everything after removing the RPM's, just to be extra paranoid > about leftovers (rpm -qpl *.rpm >> file && for i in `cat file `; do [[ > -e "$i" ]] && echo "$i" >> newfile ; done && for i in `cat newfile` ; do > rm -rf $i ; done) > > Installed RPMs (without openais) > > Same output > > http://pastebin.com/3iPHSXua > > It seems to go into some sort of loop. > > Jan 26 12:13:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: > Integration Timer (I_INTEGRATED) just popped! > Jan 26 12:13:41 cluster1 crmd: [15612]: info: crm_timer_popped: > Welcomed: 1, Integrated: 0 > Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 > cluster nodes failed to respond to the join offer. > Jan 26 12:13:41 cluster1 crmd: [15612]: info: ghash_print_node: > Welcome reply not received from: cluster1 7 > Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input > I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN > Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] > Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: > join-8: Waiting on 1 outstanding join acks > Jan 26 12:16:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: > Integration Timer (I_INTEGRATED) just popped! > Jan 26 12:16:41 cluster1 crmd: [15612]: info: crm_timer_popped: > Welcomed: 1, Integrated: 0 > Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 > cluster nodes failed to respond to the join offer. > Jan 26 12:16:41 cluster1 crmd: [15612]: info: ghash_print_node: > Welcome reply not received from: cluster1 8 > Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input > I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN > Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] > Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: > join-9: Waiting on 1 outstanding join acks > Jan 26 12:19:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: > Integration Timer (I_INTEGRATED) just popped! > Jan 26 12:19:41 cluster1 crmd: [15612]: info: crm_timer_popped: > Welcomed: 1, Integrated: 0 > Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 > cluster nodes failed to respond to the join offer. > Jan 26 12:19:41 cluster1 crmd: [15612]: info: ghash_print_node: > Welcome reply not received from: cluster1 9 > Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input > I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN > Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State > transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=do_dc_join_finalize ] > Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: > join-10: Waiting on 1 outstanding join acks > Jan 26 12:20:11 cluster1 cib: [15608]: info: cib_stats: Processed 1 > operations (0.00us average, 0% utilization) in the last 10min > > Any suggestions? > > TIA. > > Regards, > Dan > > -- > Dan Frîncu > CCNA, RHCE > >
Dan, Are you using the MCP deployment model? If not, I'd recommend that instead. Another thing to check is that there are not older versions of the libraries in /usr/lib64 (or /usr/lib if your on a 32 bit system). The libs are as follows: /usr/lib64/libvotequorum.so* /usr/lib64/libtotem_pg.so* /usr/lib64/libam.so* /usr/lib64/libquorum.so* /usr/lib64/libpload.so* /usr/lib64/liblogsys.so* /usr/lib64/libevs.so* /usr/lib64/libcpg.so* /usr/lib64/libcoroipc.so* /usr/lib64/libonfdb.so* /usr/lib64/libcfg.so* If you could run corosync-fplay after you have started Corosync, that would provide some helpful information. Also ldd /usr/sbin/corosync would be useful output. > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
