Bad news...
2010/6/30 Andrew Beekhof <[email protected]>:
> On Wed, Jun 30, 2010 at 12:06 PM, Keisuke MORI
> <[email protected]> wrote:
>> 2010/6/29 Andrew Beekhof <[email protected]>:
>>> On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI <[email protected]>
>>> wrote:
>>>> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on
>>>> CentOS 5.5 using yum but it still hangs on its startup somtimes.
>>>>
>>>> The symptom is exactly same as this:
>>>> https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html
>>>
>>> Arrgghhh!!!
>>>
>>> Can you try the following patch?
>>
>> With the patch the problem disappeared!
>> I've not been able to reproduce the hang with rebooting the node more
>> than 10 times (which was enough to reproduce it previously).
It didn't happen yesterday, but the same hang occurred again today.
I also tried with corosync-1.2.6 but the things didn't get better.
Here is the stack trace and the corosync.conf when I reproduce it with
corosync-1.2.6.
According to the core, fileno=10 looks broken, while filno=0,1,2,3 seems sane.
----8<--------8<--------8<--------8<--------8<--------8<--------8<----
[r...@pm01 20100701-memo]# gdb /usr/sbin/corosync core.2674
(...)
Core was generated by `corosync'.
#0 0x000000377a607b35 in pthread_join (threadid=1085937984, thread_return=0x0)
at pthread_join.c:89
89 lll_wait_tid (pd->tid);
(gdb) where
#0 0x000000377a607b35 in pthread_join (threadid=1085937984, thread_return=0x0)
at pthread_join.c:89
#1 0x00002b48a7899ad9 in logsys_atexit () at logsys.c:1687
#2 0x0000000000405b05 in sigsegv_handler (num=<value optimized out>)
at main.c:222
#3 <signal handler called>
#4 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
#5 __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:155
#6 0x00002aaaaaba84de in spawn_child ()
from /usr/libexec/lcrso/pacemaker.lcrso
#7 0x00002aaaaabacb9b in pcmk_startup ()
from /usr/libexec/lcrso/pacemaker.lcrso
#8 0x0000000000408339 in corosync_service_link_and_init (
corosync_api=0x613920, service_name=0xe3ef850 "pacemaker", service_ver=0)
at service.c:201
#9 0x00000000004086e3 in corosync_service_defaults_link_and_init (
corosync_api=0x613920) at service.c:534
#10 0x0000000000405106 in main_service_ready () at main.c:1206
#11 0x00002b48a7679425 in main_iface_change_fn (context=0x2aaaaaaae010,
iface_addr=<value optimized out>, iface_no=<value optimized out>)
at totemsrp.c:4363
#12 0x00002b48a76701a7 in timer_function_netif_check_timeout (data=0xe416520)
at totemudp.c:1380
---Type <return> to continue, or q <return> to quit---
#13 0x00002b48a766d459 in timerlist_expire (handle=150346236434579456)
at tlist.h:309
#14 poll_run (handle=150346236434579456) at coropoll.c:448
#15 0x000000000040670c in main (argc=<value optimized out>,
argv=<value optimized out>) at main.c:1558
(gdb)
(gdb) up 4
#4 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
48 _IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
(gdb) print *_IO_list_all
$2 = {file = {_flags = -72503612,
_IO_read_ptr = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_read_end = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_read_base = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_write_base = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_write_ptr = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_write_end = 0x2aaaab278000 <Address 0x2aaaab278000 out of bounds>,
_IO_buf_base = 0x0, _IO_buf_end = 0x0, _IO_save_base = 0x0,
_IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0,
_chain = 0x3779d51780, _fileno = 10, _flags2 = 0, _old_offset = 600,
_cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "\254",
_lock = 0x0, _offset = -1, _codecvt = 0x2aaaac002b20,
_wide_data = 0x40b9f640, _freeres_list = 0x0, _freeres_buf = 0x2010,
_freeres_size = 46912518490896, _mode = -1,
_unused2 = "\000\000\000\000)D\247y7\000\000\000+\v\000\254\252*\000"},
vtable = 0x3779d50520}
(gdb) print *_IO_list_all->file->_chain
$3 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764f000 "",
_IO_read_end = 0x2b48a764f000 "", _IO_read_base = 0x2b48a764f000 "",
_IO_write_base = 0x2b48a764f000 "", _IO_write_ptr = 0x2b48a764f000 "",
_IO_write_end = 0x2b48a764f000 "", _IO_buf_base = 0x2b48a764f000 "",
_IO_buf_end = 0x2b48a7650000 <Address 0x2b48a7650000 out of bounds>,
_IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0,
_markers = 0x0, _chain = 0x3779d51860, _fileno = 1, _flags2 = 0,
_old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
_shortbuf = "", _lock = 0x3779d52980, _offset = 0, _codecvt = 0x0,
_wide_data = 0x3779d51ac0, _freeres_list = 0x0, _freeres_buf = 0x0,
_freeres_size = 0, _mode = 0, _unused2 = '\000' <repeats 19 times>}
(gdb) print *_IO_list_all->file->_chain->_chain
$4 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764e000 "",
_IO_read_end = 0x2b48a764e000 "", _IO_read_base = 0x2b48a764e000 "",
_IO_write_base = 0x2b48a764e000 "", _IO_write_ptr = 0x2b48a764e000 "",
_IO_write_end = 0x2b48a764e000 "", _IO_buf_base = 0x2b48a764e000 "",
_IO_buf_end = 0x2b48a764f000 "", _IO_save_base = 0x0, _IO_backup_base = 0x0,
_IO_save_end = 0x0, _markers = 0x0, _chain = 0x3779d516a0, _fileno = 2,
_flags2 = 0, _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
_shortbuf = "", _lock = 0x3779d52990, _offset = 0, _codecvt = 0x0,
_wide_data = 0x3779d51c20, _freeres_list = 0x0, _freeres_buf = 0x0,
_freeres_size = 0, _mode = 0, _unused2 = '\000' <repeats 19 times>}
(gdb) print *_IO_list_all->file->_chain->_chain->_chain
$5 = {_flags = -72539000, _IO_read_ptr = 0x0, _IO_read_end = 0x0,
_IO_read_base = 0x0, _IO_write_base = 0x0, _IO_write_ptr = 0x0,
_IO_write_end = 0x0, _IO_buf_base = 0x0, _IO_buf_end = 0x0,
_IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0,
_markers = 0x0, _chain = 0xe3ef500, _fileno = 0, _flags2 = 0,
_old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
_shortbuf = "", _lock = 0x3779d52970, _offset = -1, _codecvt = 0x0,
_wide_data = 0x3779d51960, _freeres_list = 0x0, _freeres_buf = 0x0,
_freeres_size = 0, _mode = 0, _unused2 = '\000' <repeats 19 times>}
(gdb) print *_IO_list_all->file->_chain->_chain->_chain->_chain
$6 = {_flags = -72532864,
_IO_read_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_read_end = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_read_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_write_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_write_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_write_end = 0x2aaaaaaae000 "",
_IO_buf_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ]
info: get_config_opt: Defaulting to 'pcmk' for option:
clustername\nce\nty\nde 0).\n",
_IO_buf_end = 0x2aaaaaaae000 "", _IO_save_base = 0x0, _IO_backup_base = 0x0,
_IO_save_end = 0x0, _markers = 0x0, _chain = 0x0, _fileno = 3, _flags2 = 0,
_old_offset = 0, _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "",
_lock = 0xe3ef5e0, _offset = -1, _codecvt = 0x0, _wide_data = 0xe3ef5f0,
_freeres_list = 0x0, _freeres_buf = 0x0, _freeres_size = 0, _mode = -1,
_unused2 = '\000' <repeats 19 times>}
(gdb)
----8<--------8<--------8<--------8<--------8<--------8<--------8<----
corosync.conf
----8<--------8<--------8<--------8<--------8<--------8<--------8<----
# Please read the corosync.conf.5 manual page
compatibility: whitetank
aisexec {
user: root
group: root
}
service {
name: pacemaker
ver : 0
}
totem {
version: 2
secauth: off
threads: 0
rrp_mode: active
token: 24000
consensus: 29000
clear_node_high_bit: yes
rrp_problem_count_timeout: 30000
interface {
ringnumber: 0
bindnetaddr: 192.168.1.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 192.168.2.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: off
timestamp: on
}
----8<--------8<--------8<--------8<--------8<--------8<--------8<----
--
Keisuke MORI
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais