Re: [Openais] Upgrade from openais-0.80 to corosync-1.3.0 fails

Dan Frincu Wed, 26 Jan 2011 08:33:12 -0800

Update: increased verbosity to debug and I get the following

Jan 26 16:36:45 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:36:45 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:36:53 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:36:53 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:37:00 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:37:00 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:37:08 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:37:08 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:37:12 cluster1 crmd: [16266]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
Jan 26 16:37:12 cluster1 crmd: [16266]: info: crm_timer_popped: Welcomed: 1,
Integrated: 0
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: s_crmd_fsa: Processing
I_INTEGRATED: [ state=S_INTEGRATION cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jan 26 16:37:12 cluster1 crmd: [16266]: info: do_state_transition: State
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 26 16:37:12 cluster1 crmd: [16266]: WARN: do_state_transition:
Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Jan 26 16:37:12 cluster1 crmd: [16266]: WARN: do_state_transition: 1 cluster
nodes failed to respond to the join offer.
Jan 26 16:37:12 cluster1 crmd: [16266]: info: ghash_print_node:   Welcome
reply not received from: cluster1 42
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_DC_TIMER_STOP
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_INTEGRATE_TIMER_STOP
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_FINALIZE_TIMER_START
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: crm_timer_start: Started
Finalization Timer (I_ELECTION:1800000ms), src=102
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_DC_JOIN_FINALIZE
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_dc_join_finalize:
Finializing join-42 for 0 clients
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: s_crmd_fsa: Processing
I_ELECTION_DC: [ state=S_FINALIZE_JOIN cause=C_FSA_INTERNAL
origin=do_dc_join_finalize ]
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_WARN
Jan 26 16:37:12 cluster1 crmd: [16266]: WARN: do_log: FSA: Input
I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
Jan 26 16:37:12 cluster1 crmd: [16266]: info: do_state_transition: State
transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_DC_TIMER_STOP
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_INTEGRATE_TIMER_START
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: crm_timer_start: Started
Integration Timer (I_INTEGRATED:180000ms), src=103
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_FINALIZE_TIMER_STOP
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_ELECTION_VOTE
Jan 26 16:37:12 corosync [TOTEM ] mcasted message added to pending queue
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_election_vote: Started
election 44
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: do_fsa_action: actions:trace:
   // A_DC_JOIN_OFFER_ALL
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: initialize_join: join-43:
Initializing join data (flag=true)
Jan 26 16:37:12 cluster1 crmd: [16266]: debug: join_make_offer: join-43:
Sending offer to cluster1
Jan 26 16:37:12 cluster1 crmd: [16266]: info: do_dc_join_offer_all: join-43:
Waiting on 1 outstanding join acks
Jan 26 16:37:15 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:37:16 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:37:23 corosync [TOTEM ] The consensus timeout expired.
Jan 26 16:37:23 corosync [TOTEM ] entering GATHER state from 3.
Jan 26 16:37:31 corosync [TOTEM ] The consensus timeout expired.


Running corosync-blackbox gives me:

# corosync-blackbox
Starting replay: head [67420] tail [0]
rec=[1] Log Message=Corosync Cluster Engine ('1.3.0'): started and ready to
provide service.
rec=[2] Log Message=Corosync built-in features: nss rdma
rec=[3] Log Message=Successfully read main configuration file
'/etc/corosync/corosync.conf'.
rec=[4] Log Message=Token Timeout (5000 ms) retransmit timeout (490 ms)
rec=[5] Log Message=token hold (382 ms) retransmits before loss (10 retrans)
rec=[6] Log Message=join (1000 ms) send_join (45 ms) consensus (2500 ms)
merge (200 ms)
rec=[7] Log Message=downcheck (1000 ms) fail to recv const (50 msgs)
rec=[8] Log Message=seqno unchanged const (30 rotations) Maximum network MTU
1402
rec=[9] Log Message=window size per rotation (50 messages) maximum messages
per rotation (25 messages)
rec=[10] Log Message=send threads (0 threads)
rec=[11] Log Message=RRP token expired timeout (490 ms)
rec=[12] Log Message=RRP token problem counter (2000 ms)
rec=[13] Log Message=RRP threshold (10 problem count)
rec=[14] Log Message=RRP mode set to active.
rec=[15] Log Message=heartbeat_failures_allowed (0)
rec=[16] Log Message=max_network_delay (50 ms)
rec=[17] Log Message=HeartBeat is Disabled. To enable set
heartbeat_failures_allowed > 0
rec=[18] Log Message=Initializing transport (UDP/IP Unicast).
rec=[19] Log Message=Initializing transmit/receive security: libtomcrypt
SOBER128/SHA1HMAC (mode 0).
rec=[20] Log Message=Initializing transport (UDP/IP Unicast).
rec=[21] Log Message=Initializing transmit/receive security: libtomcrypt
SOBER128/SHA1HMAC (mode 0).
rec=[22] Log Message=you are using ipc api v2
rec=[23] Log Message=The network interface [10.0.2.11] is now up.
rec=[24] Log Message=Created or loaded sequence id 0.10.0.2.11 for this
ring.
rec=[25] Log Message=debug: pcmk_user_lookup: Cluster user root has uid=0
gid=0
rec=[26] Log Message=info: process_ais_conf: Reading configure
rec=[27] Log Message=info: config_find_init: Local handle:
4552499517957603332 for logging
rec=[28] Log Message=info: config_find_next: Processing additional logging
options...
rec=[29] Log Message=info: get_config_opt: Found 'on' for option: debug
rec=[30] Log Message=info: get_config_opt: Found 'yes' for option:
to_logfile
rec=[31] Log Message=info: get_config_opt: Found
'/var/log/cluster/corosync.log' for option: logfile
rec=[32] Log Message=info: get_config_opt: Found 'no' for option: to_syslog
rec=[33] Log Message=info: process_ais_conf: User configured file based
logging and explicitly disabled syslog.
rec=[34] Log Message=info: config_find_init: Local handle:
8972265949260414981 for service
rec=[35] Log Message=info: config_find_next: Processing additional service
options...
rec=[36] Log Message=info: get_config_opt: Defaulting to 'pcmk' for option:
clustername
rec=[37] Log Message=info: get_config_opt: Found 'no' for option: use_logd
rec=[38] Log Message=info: get_config_opt: Found 'yes' for option: use_mgmtd
rec=[39] Log Message=info: pcmk_startup: CRM: Initialized
rec=[40] Log Message=Logging: Initialized pcmk_startup
rec=[41] Log Message=info: pcmk_startup: Maximum core file size is:
18446744073709551615
rec=[42] Log Message=debug: pcmk_user_lookup: Cluster user hacluster has
uid=101 gid=102
rec=[43] Log Message=info: pcmk_startup: Service: 9
rec=[44] Log Message=info: pcmk_startup: Local hostname: cluster1
rec=[45] Log Message=info: pcmk_update_nodeid: Local node id: 184680458
rec=[46] Log Message=info: update_member: Creating entry for node 184680458
born on 0
rec=[47] Log Message=info: update_member: 0x5fe73e0 Node 184680458 now known
as cluster1 (was: (null))
rec=[48] Log Message=info: update_member: Node cluster1 now has 1 quorum
votes (was 0)
rec=[49] Log Message=info: update_member: Node 184680458/cluster1 is now:
member
rec=[50] Log Message=info: spawn_child: Forked child 16261 for process
stonithd
rec=[51] Log Message=debug: pcmk_user_lookup: Cluster user hacluster has
uid=101 gid=102
rec=[52] Log Message=info: spawn_child: Forked child 16262 for process cib
rec=[53] Log Message=info: spawn_child: Forked child 16263 for process lrmd
rec=[54] Log Message=debug: pcmk_user_lookup: Cluster user hacluster has
uid=101 gid=102
rec=[55] Log Message=info: spawn_child: Forked child 16264 for process attrd
rec=[56] Log Message=debug: pcmk_user_lookup: Cluster user hacluster has
uid=101 gid=102
rec=[57] Log Message=info: spawn_child: Forked child 16265 for process
pengine
rec=[58] Log Message=debug: pcmk_user_lookup: Cluster user hacluster has
uid=101 gid=102
rec=[59] Log Message=info: spawn_child: Forked child 16266 for process crmd
*rec=[60] Log Message=spawn_child: FATAL: Cannot exec
/usr/lib64/heartbeat/mgmtd: (2) No such file or directory*
*** buffer overflow detected ***: corosync-fplay terminated
======= Backtrace: =========
/lib64/libc.so.6(__chk_fail+0x2f)[0x37a72e6c2f]
corosync-fplay[0x400c0b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x37a721d974]
corosync-fplay[0x4008d9]
======= Memory map: ========
00400000-00402000 r-xp 00000000 08:05 1041788
 /usr/sbin/corosync-fplay
00602000-00603000 rw-p 00002000 08:05 1041788
 /usr/sbin/corosync-fplay
00603000-0060d000 rw-p 00603000 00:00 0
0af45000-0af66000 rw-p 0af45000 00:00 0
 [heap]
3190400000-319040d000 r-xp 00000000 08:02 911558
/lib64/libgcc_s-4.1.2-20080825.so.1
319040d000-319060d000 ---p 0000d000 08:02 911558
/lib64/libgcc_s-4.1.2-20080825.so.1
319060d000-319060e000 rw-p 0000d000 08:02 911558
/lib64/libgcc_s-4.1.2-20080825.so.1
37a6e00000-37a6e1c000 r-xp 00000000 08:02 911525
/lib64/ld-2.5.so
37a701b000-37a701c000 r--p 0001b000 08:02 911525
/lib64/ld-2.5.so
37a701c000-37a701d000 rw-p 0001c000 08:02 911525
/lib64/ld-2.5.so
37a7200000-37a734c000 r-xp 00000000 08:02 911526
/lib64/libc-2.5.so
37a734c000-37a754c000 ---p 0014c000 08:02 911526
/lib64/libc-2.5.so
37a754c000-37a7550000 r--p 0014c000 08:02 911526
/lib64/libc-2.5.so
37a7550000-37a7551000 rw-p 00150000 08:02 911526
/lib64/libc-2.5.so
37a7551000-37a7556000 rw-p 37a7551000 00:00 0
37a7600000-37a7602000 r-xp 00000000 08:02 911527
/lib64/libdl-2.5.so
37a7602000-37a7802000 ---p 00002000 08:02 911527
/lib64/libdl-2.5.so
37a7802000-37a7803000 r--p 00002000 08:02 911527
/lib64/libdl-2.5.so
37a7803000-37a7804000 rw-p 00003000 08:02 911527
/lib64/libdl-2.5.so
37a7e00000-37a7e16000 r-xp 00000000 08:02 911531
/lib64/libpthread-2.5.so
37a7e16000-37a8015000 ---p 00016000 08:02 911531
/lib64/libpthread-2.5.so
37a8015000-37a8016000 r--p 00015000 08:02 911531
/lib64/libpthread-2.5.so
37a8016000-37a8017000 rw-p 00016000 08:02 911531
/lib64/libpthread-2.5.so
37a8017000-37a801b000 rw-p 37a8017000 00:00 0
37a8e00000-37a8e07000 r-xp 00000000 08:02 911532
/lib64/librt-2.5.so
37a8e07000-37a9007000 ---p 00007000 08:02 911532
/lib64/librt-2.5.so
37a9007000-37a9008000 r--p 00007000 08:02 911532
/lib64/librt-2.5.so
37a9008000-37a9009000 rw-p 00008000 08:02 911532
/lib64/librt-2.5.so
2b3bae55d000-2b3bae55e000 rw-p 2b3bae55d000 00:00 0
2b3bae56a000-2b3bae943000 rw-p 2b3bae56a000 00:00 0
7ffffc537000-7ffffc54c000 rw-p 7ffffffea000 00:00 0
 [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0
 [vdso]
/usr/bin/corosync-blackbox: line 34: 16676 Aborted
corosync-fplay

I see the error message with mgmtd however I've performed the same test on a
pair of XEN VM's with the exact same packages (clean install, no upgrade
from openais-0.80 like the real hardware) and mgmtd doesn't exist either,
but it says

rec=[55] Log Message=info: spawn_child: Forked child 4459 for process mgmtd

# ll /usr/lib64/heartbeat/mgmtd
ls: /usr/lib64/heartbeat/mgmtd: No such file or directory

# rpm -ql pacemaker-1.0.10-1.4 | grep mgm
/usr/lib64/python2.4/site-packages/crm/idmgmt.py
/usr/lib64/python2.4/site-packages/crm/idmgmt.pyc
/usr/lib64/python2.4/site-packages/crm/idmgmt.pyo

Anyone?

Regards,
Dan

On Wed, Jan 26, 2011 at 1:35 PM, Dan Frincu <[email protected]> wrote:

> Hi,
>
> I've got a pair of servers running on RHEL5 x86_64 with openais-0.80 (older
> install) which I want to upgrade to corosync-1.3.0 + pacemaker-1.0.10.
> Downtime is not an issue and corosync 1.3.0 is needed for UDPU, so I built
> it from the corosync.org website.
>
> With pacemaker, we won't be using the heartbeat stack, so I built the
> pacemaker package from the clusterlabs.org src.rpm without heartbeat
> support. To be more precise I used
>
> rpmbuild --without heartbeat --with ais --with snmp --with esmtp -ba
> pacemaker-epel.spec
>
> Now I've tested the rpm list below on a pair of XEN VM's, it works just
> fine.
>
> cluster-glue-1.0.6-1.6.el5.x86_64.rpm
> cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
> corosync-1.3.0-1.x86_64.rpm
> corosynclib-1.3.0-1.x86_64.rpm
> libesmtp-1.0.4-5.el5.x86_64.rpm
> libibverbs-1.1.2-1.el5.x86_64.rpm
> librdmacm-1.0.8-1.el5.x86_64.rpm
> libtool-ltdl-1.5.22-6.1.x86_64.rpm
> openais-1.1.4-2.x86_64.rpm
> openaislib-1.1.4-2.x86_64.rpm
> openhpi-2.10.2-1.el5.x86_64.rpm
> openib-1.3.2-0.20080728.0355.3.el5.noarch.rpm
> pacemaker-1.0.10-1.4.x86_64.rpm
> pacemaker-libs-1.0.10-1.4.x86_64.rpm
> perl-TimeDate-1.16-5.el5.noarch.rpm
> resource-agents-1.0.3-2.6.el5.x86_64.rpm
>
> However when performing the upgrade on the servers running openais-0.80,
> first I removed the heartbeat, heartbeat-libs and PyXML rpms (conflicting
> dependencies issue) then rpm -Uvh on the rpm list above. Installation went
> fine, removed existing cib.xml and signatures, fresh start. Then I
> configured corosync, then started it on both servers, and nothing. At first
> I got an error related to pacemaker mgmt, which was an old package installed
> with the old rpms. Removed it, tried again. Nothing. Removed all cluster
> related rpms old and new + deps, except for DRBD, then installed the list
> above, then again, nothing. What nothing means:
> - corosync starts, never elects DC, never sees the other node or itself for
> that matter.
> - can stop corosync via the init script, it goes into an endless phase
> where it just prints dots to the screen, have to kill the process to make it
> stop.
>
> Troubleshooting done so far:
> - tested network sockets (nc from side to side), firewall rules (iptables
> down), communication is ok
> - searched for the original RPM's list, removed all remaining RPMs, ran
> ldconfig, removed new RPM's, installed new RPM's
>
> My guess is that there are some leftovers from the old openais-0.80
> installation, which mess with the current installation, seeing as how the
> same set of RPMs on a pair of XEN VM's with the same OS work fine, however I
> cannot put my finger on the culprit for the real servers' issue.
>
> Logs: http://pastebin.com/i0maZM4p
>
> Removed everything after removing the RPM's, just to be extra paranoid
> about leftovers (rpm -qpl *.rpm >> file && for i in `cat file `; do [[ -e
> "$i" ]] && echo "$i" >> newfile ; done && for i in `cat newfile` ; do rm -rf
> $i ; done)
>
> Installed RPMs (without openais)
>
> Same output
>
> http://pastebin.com/3iPHSXua
>
> It seems to go into some sort of loop.
>
> Jan 26 12:13:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed:
> 1, Integrated: 0
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
> reply not received from: cluster1 7
> Jan 26 12:13:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:13:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-8:
> Waiting on 1 outstanding join acks
> Jan 26 12:16:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed:
> 1, Integrated: 0
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
> reply not received from: cluster1 8
> Jan 26 12:16:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:16:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all: join-9:
> Waiting on 1 outstanding join acks
> Jan 26 12:19:41 cluster1 crmd: [15612]: ERROR: crm_timer_popped:
> Integration Timer (I_INTEGRATED) just popped!
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: crm_timer_popped: Welcomed:
> 1, Integrated: 0
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition:
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_state_transition: 1
> cluster nodes failed to respond to the join offer.
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: ghash_print_node:   Welcome
> reply not received from: cluster1 9
> Jan 26 12:19:41 cluster1 crmd: [15612]: WARN: do_log: FSA: Input
> I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_dc_join_finalize ]
> Jan 26 12:19:41 cluster1 crmd: [15612]: info: do_dc_join_offer_all:
> join-10: Waiting on 1 outstanding join acks
> Jan 26 12:20:11 cluster1 cib: [15608]: info: cib_stats: Processed 1
> operations (0.00us average, 0% utilization) in the last 10min
>
> Any suggestions?
>
> TIA.
>
> Regards,
> Dan
>
> --
> Dan Frîncu
> CCNA, RHCE
>
>


-- 
Dan Frîncu
CCNA, RHCE

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Upgrade from openais-0.80 to corosync-1.3.0 fails

Reply via email to