Hi,

I've got a pair of servers running on RHEL5 x86_64 with openais-0.80 (older
install) which I want to upgrade to corosync-1.3.0 + pacemaker-1.0.10.
Downtime is not an issue and corosync 1.3.0 is needed for UDPU, so I built
it from the corosync.org website.

With pacemaker, we won't be using the heartbeat stack, so I built the
pacemaker package from the clusterlabs.org src.rpm without heartbeat
support. To be more precise I used

rpmbuild --without heartbeat --with ais --with snmp --with esmtp -ba
pacemaker-epel.spec

Now I've tested the rpm list below on a pair of XEN VM's, it works just
fine.

cluster-glue-1.0.6-1.6.el5.x86_64.rpm
cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
corosync-1.3.0-1.x86_64.rpm
corosynclib-1.3.0-1.x86_64.rpm
libesmtp-1.0.4-5.el5.x86_64.rpm
libibverbs-1.1.2-1.el5.x86_64.rpm
librdmacm-1.0.8-1.el5.x86_64.rpm
libtool-ltdl-1.5.22-6.1.x86_64.rpm
openais-1.1.4-2.x86_64.rpm
openaislib-1.1.4-2.x86_64.rpm
openhpi-2.10.2-1.el5.x86_64.rpm
openib-1.3.2-0.20080728.0355.3.el5.noarch.rpm
pacemaker-1.0.10-1.4.x86_64.rpm
pacemaker-libs-1.0.10-1.4.x86_64.rpm
perl-TimeDate-1.16-5.el5.noarch.rpm
resource-agents-1.0.3-2.6.el5.x86_64.rpm

However when performing the upgrade on the servers running openais-0.80,
first I removed the heartbeat, heartbeat-libs and PyXML rpms (conflicting
dependencies issue) then rpm -Uvh on the rpm list above. Installation went
fine, removed existing cib.xml and signatures, fresh start. Then I
configured corosync, then started it on both servers, and nothing. At first
I got an error related to pacemaker mgmt, which was an old package installed
with the old rpms. Removed it, tried again. Nothing. Removed all cluster
related rpms old and new + deps, except for DRBD, then installed the list
above, then again, nothing. What nothing means:
- corosync starts, never elects DC, never sees the other node or itself for
that matter.
- can stop corosync via the init script, it goes into an endless phase where
it just prints dots to the screen, have to kill the process to make it
stop.

Troubleshooting done so far:
- tested network sockets (nc from side to side), firewall rules (iptables
down), communication is ok
- searched for the original RPM's list, removed all remaining RPMs, ran
ldconfig, removed new RPM's, installed new RPM's

My guess is that there are some leftovers from the old openais-0.80
installation, which mess with the current installation, seeing as how the
same set of RPMs on a pair of XEN VM's with the same OS work fine, however I
cannot put my finger on the culprit for the real servers' issue.

Logs: http://pastebin.com/i0maZM4p

Removed everything after removing the RPM's, just to be extra paranoid about
leftovers (rpm -qpl *.rpm >> file && for i in `cat file `; do [[ -e "$i" ]]
&& echo "$i" >> newfile ; done && for i in `cat newfile` ; do rm -rf $i ;
done)

Installed RPMs (without openais)

Same output

http://pastebin.com/3iPHSXua

It seems to go into some sort of loop.

Jan 26 12:13:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
Jan 26 12:13:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1,
Integrated: 0
Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition:
Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster
nodes failed to respond to the join offer.
Jan 26 12:13:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
reply not received from: cluster1 7
Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-8:
Waiting on 1 outstanding join acks
Jan 26 12:16:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
Jan 26 12:16:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1,
Integrated: 0
Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition:
Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster
nodes failed to respond to the join offer.
Jan 26 12:16:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
reply not received from: cluster1 8
Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-9:
Waiting on 1 outstanding join acks
Jan 26 12:19:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
Jan 26 12:19:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed: 1,
Integrated: 0
Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition:
Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1 cluster
nodes failed to respond to the join offer.
Jan 26 12:19:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
reply not received from: cluster1 9
Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-10:
Waiting on 1 outstanding join acks
Jan 26 12:20:11 cluster1 cib: [15608]: info: cib_stats: Processed 1
operations (0.00us average, 0% utilization) in the last 10min

Any suggestions?

TIA.

Regards,
Dan

-- 
Dan Frîncu
CCNA, RHCE
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to