Hi,
I met a problem with whitetank.
#ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP
qlen 1000
link/ether 00:1c:23:00:5a:8a brd ff:ff:ff:ff:ff:ff
inet 147.2.207.210/24 brd 147.2.207.255 scope global eth0
inet6 fe80::21c:23ff:fe00:5a8a/64 scope link
valid_lft forever preferred_lft forever
Note: There're multiple IPs on lo.
If I default all the timeout options in totem directive, everything goes all
right.
Though if setting "consensus" to a value greater than "downcheck", for example:
totem {
version: 2
secauth: off
threads: 0
consensus: 2500
interface {
ringnumber: 0
bindnetaddr: 147.2.207.0
mcastaddr: 226.94.1.1
mcastport: 5495
}
}
logging {
to_stderr: yes
to_file: yes
logfile: /tmp/ais
debug: on
timestamp: on
}
amf {
mode: disabled
}
And once I delete the listening IP:
# ip addr del 147.2.207.210/24 brd 147.2.207.255 dev eth0
Segmentation fault happens to aisexec:
#0 0xb7e4a200 in strcpy () from /lib/libc.so.6
#1 0xb74dbdf7 in my_cluster_node_load () at clm.c:280
#2 0xb74dc879 in clm_confchg_fn
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84,
member_list_entries=1, left_list=0xbf815084,
left_list_entries=0, joined_list=0x0, joined_list_entries=0,
ring_id=0xb7529664) at clm.c:538
#3 0x08063d71 in confchg_fn
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84,
member_list_entries=1, left_list=0xbf815084,
left_list_entries=0, joined_list=0x0, joined_list_entries=0,
ring_id=0xb7529664) at main.c:213
#4 0x0805df4e in app_confchg_fn
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84,
member_list_entries=1, left_list=0xbf815084,
left_list_entries=0, joined_list=0x0, joined_list_entries=0,
ring_id=0xb7529664) at totempg.c:327
#5 0x0805de6a in totempg_confchg_fn
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84,
member_list_entries=1,
left_list=0xbf815084, left_list_entries=0, joined_list=0x0,
joined_list_entries=0, ring_id=0xb7529664) at totempg.c:480
#6 0x0805d964 in totemmrp_confchg_fn
(configuration_type=TOTEM_CONFIGURATION_TRANSITIONAL, member_list=0xbf815c84,
member_list_entries=1,
left_list=0xbf815084, left_list_entries=0, joined_list=0x0,
joined_list_entries=0, ring_id=0xb7529664) at totemmrp.c:92
#7 0x08056b30 in memb_state_operational_enter (instance=0xb7508008) at
totemsrp.c:1635
#8 0x0805aeeb in message_handler_orf_token (instance=0xb7508008,
msg=0x8193574, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3402
#9 0x0805d757 in main_deliver_fn (context=0xb7508008, msg=0x8193574,
msg_len=70) at totemsrp.c:4131
#10 0x08051ac6 in none_token_recv (rrp_instance=0x8192ae0, iface_no=0,
context=0xb7508008, msg=0x8193574, msg_len=70, token_seq=3) at totemrrp.c:506
#11 0x080533e6 in rrp_deliver_fn (context=0x8185d18, msg=0x8193574, msg_len=70)
at totemrrp.c:1308
#12 0x0804fa68 in net_deliver_fn (handle=0, fd=6, revents=1, data=0x8192f48) at
totemnet.c:695
#13 0x0804de7b in poll_run (handle=0) at aispoll.c:402
#14 0x08064db0 in main (argc=2, argv=0xbf81bb24) at main.c:623
There's the same problem when setting an alias IP on eth0 with a different
network address from "bindnetaddr"
So I patched clm.c to trace the problem:
--- clm.c.orig 2009-01-26 05:44:55.000000000 +0800
+++ clm.c 2009-08-31 18:28:10.000000000 +0800
@@ -268,13 +268,21 @@
unsigned int iface_count;
char **status;
const char *iface_string;
+ int my_nodeid;
+ int res;
- totempg_ifaces_get (
- totempg_my_nodeid_get (),
+ my_nodeid = totempg_my_nodeid_get ();
+ res = totempg_ifaces_get (
+ my_nodeid,
interfaces,
&status,
&iface_count);
+ if (res != 0) {
+ log_printf (LOG_LEVEL_ERROR, "Cannot get interfaces for
my_nodeid: %x", my_nodeid);
+ assert (0) ;
+ }
+
iface_string = totemip_print (&interfaces[0]);
sprintf ((char *)my_cluster_node.node_address.value, "%s",
The outputs were like the below:
Aug 31 18:29:08.802934 [MAIN ] AIS Executive Service RELEASE 'subrev 1152
version 0.80'
Aug 31 18:29:08.803180 [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc
and contributors.
Aug 31 18:29:08.803232 [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Aug 31 18:29:08.803274 [MAIN ] AIS Executive Service: started and ready to
provide service.
Aug 31 18:29:08.803315 [print.c:0361] log setup
Aug 31 18:29:08.823250 [TOTEM] Token Timeout (1000 ms) retransmit timeout (238
ms)
Aug 31 18:29:08.823342 [TOTEM] token hold (180 ms) retransmits before loss (4
retrans)
Aug 31 18:29:08.823366 [TOTEM] join (50 ms) send_join (0 ms) consensus (2500
ms) merge (200 ms)
Aug 31 18:29:08.823386 [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
Aug 31 18:29:08.823404 [TOTEM] seqno unchanged const (30 rotations) Maximum
network MTU 1500
Aug 31 18:29:08.823423 [TOTEM] window size per rotation (50 messages) maximum
messages per rotation (17 messages)
Aug 31 18:29:08.823441 [TOTEM] send threads (0 threads)
Aug 31 18:29:08.823457 [TOTEM] RRP token expired timeout (238 ms)
Aug 31 18:29:08.823474 [TOTEM] RRP token problem counter (2000 ms)
Aug 31 18:29:08.823491 [TOTEM] RRP threshold (10 problem count)
Aug 31 18:29:08.823507 [TOTEM] RRP mode set to none.
Aug 31 18:29:08.823524 [TOTEM] heartbeat_failures_allowed (0)
Aug 31 18:29:08.823540 [TOTEM] max_network_delay (50 ms)
Aug 31 18:29:08.823594 [TOTEM] HeartBeat is Disabled. To enable set
heartbeat_failures_allowed > 0
Aug 31 18:29:08.824018 [TOTEM] Receive multicast socket recv buffer size
(262142 bytes).
Aug 31 18:29:08.824049 [TOTEM] Transmit multicast socket send buffer size
(262142 bytes).
Aug 31 18:29:08.824439 [TOTEM] The network interface [147.2.207.210] is now up.
Aug 31 18:29:08.824503 [TOTEM] Created or loaded sequence id
42284.147.2.207.210 for this ring.
Aug 31 18:29:08.824652 [TOTEM] entering GATHER state from 15.
Aug 31 18:29:08.826367 [SERV ] Service initialized 'openais extended virtual
synchrony service'
Aug 31 18:29:08.827577 [SERV ] Service initialized 'openais cluster membership
service B.01.01'
Aug 31 18:29:08.827903 [SERV ] Service initialized 'openais availability
management framework B.01.01'
Aug 31 18:29:08.828319 [SERV ] Service initialized 'openais checkpoint service
B.01.01'
Aug 31 18:29:08.829046 [SERV ] Service initialized 'openais event service
B.01.01'
Aug 31 18:29:08.829744 [SERV ] Service initialized 'openais distributed locking
service B.01.01'
Aug 31 18:29:08.830559 [SERV ] Service initialized 'openais message service
B.01.01'
Aug 31 18:29:08.830832 [SERV ] Service initialized 'openais configuration
service'
Aug 31 18:29:08.831267 [SERV ] Service initialized 'openais cluster closed
process group service v1.01'
Aug 31 18:29:08.831540 [SERV ] Service initialized 'openais cluster config
database access v1.01'
Aug 31 18:29:08.831574 [SYNC ] Not using a virtual synchrony filter.
Aug 31 18:29:08.831684 [TOTEM] Creating commit token because I am the rep.
Aug 31 18:29:08.831726 [TOTEM] Saving state aru 0 high seq received 0
Aug 31 18:29:08.831773 [TOTEM] Storing new sequence id for ring a530
Aug 31 18:29:08.831903 [TOTEM] entering COMMIT state.
Aug 31 18:29:08.831946 [TOTEM] entering RECOVERY state.
Aug 31 18:29:08.832026 [TOTEM] position [0] member 147.2.207.210:
Aug 31 18:29:08.832050 [TOTEM] previous ring seq 42284 rep 147.2.207.210
Aug 31 18:29:08.832069 [TOTEM] aru 0 high delivered 0 received flag 1
Aug 31 18:29:08.832087 [TOTEM] Did not need to originate any messages in
recovery.
Aug 31 18:29:08.832127 [TOTEM] Sending initial ORF token
Aug 31 18:29:08.832363 [CLM ] CLM CONFIGURATION CHANGE
Aug 31 18:29:08.832389 [CLM ] New Configuration:
Aug 31 18:29:08.832406 [CLM ] Members Left:
Aug 31 18:29:08.832423 [CLM ] Members Joined:
Aug 31 18:29:08.832489 [CLM ] CLM CONFIGURATION CHANGE
Aug 31 18:29:08.832511 [CLM ] New Configuration:
Aug 31 18:29:08.832535 [CLM ] r(0) ip(147.2.207.210)
Aug 31 18:29:08.832554 [CLM ] Members Left:
Aug 31 18:29:08.832570 [CLM ] Members Joined:
Aug 31 18:29:08.832591 [CLM ] r(0) ip(147.2.207.210)
Aug 31 18:29:08.832623 [SYNC ] This node is within the primary component and
will provide service.
Aug 31 18:29:08.832662 [TOTEM] entering OPERATIONAL state.
Aug 31 18:29:08.834725 [CLM ] got nodejoin message 147.2.207.210
Aug 31 18:29:12.890629 [TOTEM] Receive multicast socket recv buffer size
(262142 bytes).
Aug 31 18:29:12.890714 [TOTEM] Transmit multicast socket send buffer size
(262142 bytes).
Aug 31 18:29:12.891060 [TOTEM] The network interface is down.
Aug 31 18:29:12.891180 [TOTEM] entering GATHER state from 15.
aisexec: clm.c:283: my_cluster_node_load: Assertion `0' failed.
Aug 31 18:29:15.405300 [TOTEM] entering GATHER state from 0.
Aug 31 18:29:15.405381 [TOTEM] Creating commit token because I am the rep.
Aug 31 18:29:15.405410 [TOTEM] Saving state aru c high seq received c
Aug 31 18:29:15.405452 [TOTEM] Storing new sequence id for ring a534
Aug 31 18:29:15.405574 [TOTEM] entering COMMIT state.
Aug 31 18:29:15.405614 [TOTEM] entering RECOVERY state.
Aug 31 18:29:15.405684 [TOTEM] position [0] member 127.0.0.1:
Aug 31 18:29:15.405708 [TOTEM] previous ring seq 42288 rep 147.2.207.210
Aug 31 18:29:15.405726 [TOTEM] aru c high delivered c received flag 1
Aug 31 18:29:15.405744 [TOTEM] Did not need to originate any messages in
recovery.
Aug 31 18:29:15.405782 [TOTEM] Sending initial ORF token
Aug 31 18:29:15.406020 [CLM ] CLM CONFIGURATION CHANGE
Aug 31 18:29:15.406045 [CLM ] New Configuration:
Aug 31 18:29:15.406071 [CLM ] r(0) ip(127.0.0.1)
Aug 31 18:29:15.406089 [CLM ] Members Left:
Aug 31 18:29:15.406104 [CLM ] Members Joined:
Aug 31 18:29:15.406124 [CLM ] Cannot get interfaces for my_nodeid: 200007f
So the boudto.nodid became "127.0.0.2", but it could not be found from
my_memb_list.
..At last I found something in totemip.c: totemip_iface_check().
In the funtion, whether or not an appropriate interface can be found, the
"boundto" argument will be always updated in the end.
totemip.c:587:
totemip_copy (boundto, &ipaddr);
I applied a patch as the below:
--- openais.orig/exec/totemip.c 2009-08-31 19:30:43.000000000 +0800
+++ openais/exec/totemip.c 2009-08-31 19:03:23.000000000 +0800
@@ -497,7 +497,7 @@
h = (struct nlmsghdr *)rcvbuf;
if (h->nlmsg_type == NLMSG_DONE)
- break;
+ return -1;
if (h->nlmsg_type == NLMSG_ERROR) {
close(fd);
--- openais.org/exec/clm.c 2009-01-26 05:44:55.000000000 +0800
+++ openais/exec/clm.c 2009-08-31 20:57:18.000000000 +0800
@@ -268,13 +268,21 @@
unsigned int iface_count;
char **status;
const char *iface_string;
+ int my_nodeid;
+ int res;
- totempg_ifaces_get (
- totempg_my_nodeid_get (),
+ my_nodeid = totempg_my_nodeid_get ();
+ res = totempg_ifaces_get (
+ my_nodeid,
interfaces,
&status,
&iface_count);
+ if (res != 0) {
+ log_printf (LOG_LEVEL_DEBUG, "Cannot get interfaces for
my_nodeid: %x", my_nodeid);
+ return ;
+ }
+
iface_string = totemip_print (&interfaces[0]);
sprintf ((char *)my_cluster_node.node_address.value, "%s",
It seems to reslove the problem. Otherwise there was any other consideration?
I haven't tried corosync + openais 1.0. No idea if it has the same issue.
Thanks,
Yan
--
Software Engineer
China Server Team, OPS Engineering
Novell, Inc.
Making IT Work As Oneā¢
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais