On 11/23/2010 03:53 AM, Vladislav Bogdanov wrote:
> Hi Steven, hi all.
>
> I often see this assert on one of nodes after I stop corosync on some
> another node in newly-setup 4-node cluster.
>
> #0 0x00007f51953e49a5 in raise () from /lib64/libc.so.6
> #1 0x00007f51953e6185 in abort () from /lib64/libc.so.6
> #2 0x00007f51953dd935 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007f5196176406 in memb_consensus_agreed
> (instance=0x7f5196554010) at totemsrp.c:1194
> #4 0x00007f519617b2f3 in memb_join_process (instance=0x7f5196554010,
> memb_join=0x262f628) at totemsrp.c:3918
> #5 0x00007f519617b619 in message_handler_memb_join
> (instance=0x7f5196554010, msg=<value optimized out>, msg_len=<value
> optimized out>, endian_conversion_needed=<value optimized out>)
> at totemsrp.c:4161
> #6 0x00007f5196173ba7 in passive_mcast_recv (rrp_instance=0x2603030,
> iface_no=0, context=<value optimized out>, msg=<value optimized out>,
> msg_len=<value optimized out>) at totemrrp.c:720
> #7 0x00007f5196172b44 in rrp_deliver_fn (context=<value optimized out>,
> msg=0x262f628, msg_len=420) at totemrrp.c:1404
> #8 0x00007f5196171a76 in net_deliver_fn (handle=<value optimized out>,
> fd=<value optimized out>, revents=<value optimized out>, data=0x262ef80)
> at totemudp.c:1244
> #9 0x00007f519616d7f2 in poll_run (handle=4858364909567606784) at
> coropoll.c:510
> #10 0x0000000000406add in main (argc=<value optimized out>, argv=<value
> optimized out>, envp=<value optimized out>) at main.c:1680
>
> Last fplay lines are:
>
> rec=[36124] Log Message=Delivering MCAST message with seq 1366 to
> pending delivery queue
> rec=[36125] Log Message=Delivering MCAST message with seq 1367 to
> pending delivery queue
> rec=[36126] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36127] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36128] Log Message=Received ringid(10.5.4.52:12660) seq 1366
> rec=[36129] Log Message=Received ringid(10.5.4.52:12660) seq 1367
> rec=[36130] Log Message=releasing messages up to and including 1367
> rec=[36131] Log Message=FAILED TO RECEIVE
> rec=[36132] Log Message=entering GATHER state from 6.
> rec=[36133] Log Message=entering GATHER state from 0.
> Finishing replay: records found [33993]
>
> What could be the reason for this? Bug, switches, memory errors?
>
The FAILED TO RECEIVE indicates the node didn't receive any multicast
packets for long periods (switch problem). Given the complexity of your
setup, I'm not certain why the multicast messages are not received. You
might try the pending udpu code when it is released to avoid the need
for multicast.
It is holiday in the US. I will look into the assert (which is a bug)
on Monday.
Regards
-steve
> Setup is:
> corosync-1.2.8
> openais-1.1.4
> pacemaker-1.1.4
>
> corosync.conf is:
> =============
> compatibility: none
>
> totem {
> version: 2
> secauth: off
> # 9192-18
> net_mtu: 9174
> window_size: 300
> max_messages: 25
> rrp_mode: passive
>
> interface {
> ringnumber: 0
> bindnetaddr: 10.5.4.48
> mcastaddr: 239.94.1.3
> mcastport: 5405
> }
> }
> logging {
> fileline: off
> to_stderr: no
> to_logfile: no
> to_syslog: yes
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> aisexec {
> user: root
> group: root
> }
> ========
>
> Pacemaker is run with MCP:
> service {
> name: pacemaker
> ver: 1
> }
>
> Current nodes have addresses 10.5.4.52 to 55.
>
> I also use dlm, gfs2 and clvm as openais clients from pacemaker and they
> are the only services configured in pacemaker right now (except fencing).
>
> I verified that iptables do not block anything - I log all denied
> packets, and logs are clean.
>
> I also use bonding in 802.3ad mode with cisco 3750x stack (physical
> interfaces are connected to different switch in stack). Bridge is set up
> on top of bonding.
>
> What more do I need to provide to help with resolving this issue?
>
> Best,
> Vladislav
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais