On 01/26/2011 04:35 AM, Dan Frincu wrote:
> Hi, 
> 
> I've got a pair of servers running on RHEL5 x86_64 with openais-0.80
> (older install) which I want to upgrade to corosync-1.3.0 +
> pacemaker-1.0.10. Downtime is not an issue and corosync 1.3.0 is needed
> for UDPU, so I built it from the corosync.org
> <http://corosync.org/> website.
> 
> With pacemaker, we won't be using the heartbeat stack, so I built the
> pacemaker package from the clusterlabs.org
> <http://clusterlabs.org/> src.rpm without heartbeat support. To be more
> precise I used 
> 
> rpmbuild --without heartbeat --with ais --with snmp --with esmtp -ba
> pacemaker-epel.spec 
> 
> Now I've tested the rpm list below on a pair of XEN VM's, it works just
> fine. 
> 
> cluster-glue-1.0.6-1.6.el5.x86_64.rpm 
> cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm 
> corosync-1.3.0-1.x86_64.rpm 
> corosynclib-1.3.0-1.x86_64.rpm 
> libesmtp-1.0.4-5.el5.x86_64.rpm 
> libibverbs-1.1.2-1.el5.x86_64.rpm 
> librdmacm-1.0.8-1.el5.x86_64.rpm 
> libtool-ltdl-1.5.22-6.1.x86_64.rpm 
> openais-1.1.4-2.x86_64.rpm 
> openaislib-1.1.4-2.x86_64.rpm 
> openhpi-2.10.2-1.el5.x86_64.rpm 
> openib-1.3.2-0.20080728.0355.3.el5.noarch.rpm 
> pacemaker-1.0.10-1.4.x86_64.rpm 
> pacemaker-libs-1.0.10-1.4.x86_64.rpm 
> perl-TimeDate-1.16-5.el5.noarch.rpm 
> resource-agents-1.0.3-2.6.el5.x86_64.rpm 
> 
> However when performing the upgrade on the servers running openais-0.80,
> first I removed the heartbeat, heartbeat-libs and PyXML rpms
> (conflicting dependencies issue) then rpm -Uvh on the rpm list above.
> Installation went fine, removed existing cib.xml and signatures, fresh
> start. Then I configured corosync, then started it on both servers, and
> nothing. At first I got an error related to pacemaker mgmt, which was an
> old package installed with the old rpms. Removed it, tried again.
> Nothing. Removed all cluster related rpms old and new + deps, except for
> DRBD, then installed the list above, then again, nothing. What nothing
> means: 
> - corosync starts, never elects DC, never sees the other node or itself
> for that matter. 
> - can stop corosync via the init script, it goes into an endless phase
> where it just prints dots to the screen, have to kill the process to
> make it stop. 
> 
> Troubleshooting done so far: 
> - tested network sockets (nc from side to side), firewall rules
> (iptables down), communication is ok 
> - searched for the original RPM's list, removed all remaining RPMs, ran
> ldconfig, removed new RPM's, installed new RPM's 
> 
> My guess is that there are some leftovers from the old openais-0.80
> installation, which mess with the current installation, seeing as how
> the same set of RPMs on a pair of XEN VM's with the same OS work fine,
> however I cannot put my finger on the culprit for the real servers' issue. 
> 
> Logs: http://pastebin.com/i0maZM4p
> 
> Removed everything after removing the RPM's, just to be extra paranoid
> about leftovers (rpm -qpl *.rpm >> file && for i in `cat file `; do [[
> -e "$i" ]] && echo "$i" >> newfile ; done && for i in `cat newfile` ; do
> rm -rf $i ; done)
> 
> Installed RPMs (without openais)
> 
> Same output
> 
> http://pastebin.com/3iPHSXua
> 
> It seems to go into some sort of loop.
> 
> Jan 26 12:13:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: crm_timer_popped:
> Welcomed: 1, Integrated: 0
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: ghash_print_node:  
> Welcome reply not received from: cluster1 7
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all:
> join-8: Waiting on 1 outstanding join acks
> Jan 26 12:16:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: crm_timer_popped:
> Welcomed: 1, Integrated: 0
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: ghash_print_node:  
> Welcome reply not received from: cluster1 8
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all:
> join-9: Waiting on 1 outstanding join acks
> Jan 26 12:19:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: crm_timer_popped:
> Welcomed: 1, Integrated: 0
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: ghash_print_node:  
> Welcome reply not received from: cluster1 9
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all:
> join-10: Waiting on 1 outstanding join acks
> Jan 26 12:20:11 cluster1 cib: [15608]: info: cib_stats: Processed 1
> operations (0.00us average, 0% utilization) in the last 10min
> 
> Any suggestions?
> 
> TIA.
> 
> Regards,
> Dan
> 
> -- 
> Dan Frîncu
> CCNA, RHCE
> 
> 

Dan,

Are you using the MCP deployment model?  If not, I'd recommend that instead.

Another thing to check is that there are not older versions of the
libraries in /usr/lib64 (or /usr/lib if your on a 32 bit system).  The
libs are as follows:

/usr/lib64/libvotequorum.so*
/usr/lib64/libtotem_pg.so*
/usr/lib64/libam.so*
/usr/lib64/libquorum.so*
/usr/lib64/libpload.so*
/usr/lib64/liblogsys.so*
/usr/lib64/libevs.so*
/usr/lib64/libcpg.so*
/usr/lib64/libcoroipc.so*
/usr/lib64/libonfdb.so*
/usr/lib64/libcfg.so*

If you could run corosync-fplay after you have started Corosync, that
would provide some helpful information.

Also ldd /usr/sbin/corosync would be useful output.

> 
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to