On 07/01/2010 06:09 AM, Keisuke MORI wrote: > Bad news... > > 2010/6/30 Andrew Beekhof<[email protected]>: >> On Wed, Jun 30, 2010 at 12:06 PM, Keisuke MORI >> <[email protected]> wrote: >>> 2010/6/29 Andrew Beekhof<[email protected]>: >>>> On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI<[email protected]> >>>> wrote: >>>>> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on >>>>> CentOS 5.5 using yum but it still hangs on its startup somtimes. >>>>> >>>>> The symptom is exactly same as this: >>>>> >>>>> https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html >>>> >>>> Arrgghhh!!! >>>> >>>> Can you try the following patch? >>> >>> With the patch the problem disappeared! >>> I've not been able to reproduce the hang with rebooting the node more >>> than 10 times (which was enough to reproduce it previously). > > It didn't happen yesterday, but the same hang occurred again today. > > I also tried with corosync-1.2.6 but the things didn't get better. > > Here is the stack trace and the corosync.conf when I reproduce it with > corosync-1.2.6. > According to the core, fileno=10 looks broken, while filno=0,1,2,3 seems sane. > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > [r...@pm01 20100701-memo]# gdb /usr/sbin/corosync core.2674 > (...) > Core was generated by `corosync'. > #0 0x000000377a607b35 in pthread_join (threadid=1085937984, > thread_return=0x0) > at pthread_join.c:89 > 89 lll_wait_tid (pd->tid); > (gdb) where > #0 0x000000377a607b35 in pthread_join (threadid=1085937984, > thread_return=0x0) > at pthread_join.c:89 > #1 0x00002b48a7899ad9 in logsys_atexit () at logsys.c:1687 > #2 0x0000000000405b05 in sigsegv_handler (num=<value optimized out>) > at main.c:222 > #3<signal handler called> > #4 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48 > #5 __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:155 > #6 0x00002aaaaaba84de in spawn_child () > from /usr/libexec/lcrso/pacemaker.lcrso > #7 0x00002aaaaabacb9b in pcmk_startup () > from /usr/libexec/lcrso/pacemaker.lcrso > #8 0x0000000000408339 in corosync_service_link_and_init ( > corosync_api=0x613920, service_name=0xe3ef850 "pacemaker", service_ver=0) > at service.c:201 > #9 0x00000000004086e3 in corosync_service_defaults_link_and_init ( > corosync_api=0x613920) at service.c:534 > #10 0x0000000000405106 in main_service_ready () at main.c:1206 > #11 0x00002b48a7679425 in main_iface_change_fn (context=0x2aaaaaaae010, > iface_addr=<value optimized out>, iface_no=<value optimized out>) > at totemsrp.c:4363 > #12 0x00002b48a76701a7 in timer_function_netif_check_timeout (data=0xe416520) > at totemudp.c:1380 > ---Type<return> to continue, or q<return> to quit--- > #13 0x00002b48a766d459 in timerlist_expire (handle=150346236434579456) > at tlist.h:309 > #14 poll_run (handle=150346236434579456) at coropoll.c:448 > #15 0x000000000040670c in main (argc=<value optimized out>, > argv=<value optimized out>) at main.c:1558 > (gdb) > (gdb) up 4 > #4 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48 > 48 _IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock)); > (gdb) print *_IO_list_all > $2 = {file = {_flags = -72503612, > _IO_read_ptr = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_read_end = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_read_base = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_write_base = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_write_ptr = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_write_end = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>, > _IO_buf_base = 0x0, _IO_buf_end = 0x0, _IO_save_base = 0x0, > _IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0, > _chain = 0x3779d51780, _fileno = 10, _flags2 = 0, _old_offset = 600, > _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "\254", > _lock = 0x0, _offset = -1, _codecvt = 0x2aaaac002b20, > _wide_data = 0x40b9f640, _freeres_list = 0x0, _freeres_buf = 0x2010, > _freeres_size = 46912518490896, _mode = -1, > _unused2 = "\000\000\000\000)D\247y7\000\000\000+\v\000\254\252*\000"}, > vtable = 0x3779d50520} > (gdb) print *_IO_list_all->file->_chain > $3 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764f000 "", > _IO_read_end = 0x2b48a764f000 "", _IO_read_base = 0x2b48a764f000 "", > _IO_write_base = 0x2b48a764f000 "", _IO_write_ptr = 0x2b48a764f000 "", > _IO_write_end = 0x2b48a764f000 "", _IO_buf_base = 0x2b48a764f000 "", > _IO_buf_end = 0x2b48a7650000<Address 0x2b48a7650000 out of bounds>, > _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0, > _markers = 0x0, _chain = 0x3779d51860, _fileno = 1, _flags2 = 0, > _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000', > _shortbuf = "", _lock = 0x3779d52980, _offset = 0, _codecvt = 0x0, > _wide_data = 0x3779d51ac0, _freeres_list = 0x0, _freeres_buf = 0x0, > _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>} > (gdb) print *_IO_list_all->file->_chain->_chain > $4 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764e000 "", > _IO_read_end = 0x2b48a764e000 "", _IO_read_base = 0x2b48a764e000 "", > _IO_write_base = 0x2b48a764e000 "", _IO_write_ptr = 0x2b48a764e000 "", > _IO_write_end = 0x2b48a764e000 "", _IO_buf_base = 0x2b48a764e000 "", > _IO_buf_end = 0x2b48a764f000 "", _IO_save_base = 0x0, _IO_backup_base = > 0x0, > _IO_save_end = 0x0, _markers = 0x0, _chain = 0x3779d516a0, _fileno = 2, > _flags2 = 0, _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000', > _shortbuf = "", _lock = 0x3779d52990, _offset = 0, _codecvt = 0x0, > _wide_data = 0x3779d51c20, _freeres_list = 0x0, _freeres_buf = 0x0, > _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>} > (gdb) print *_IO_list_all->file->_chain->_chain->_chain > $5 = {_flags = -72539000, _IO_read_ptr = 0x0, _IO_read_end = 0x0, > _IO_read_base = 0x0, _IO_write_base = 0x0, _IO_write_ptr = 0x0, > _IO_write_end = 0x0, _IO_buf_base = 0x0, _IO_buf_end = 0x0, > _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0, > _markers = 0x0, _chain = 0xe3ef500, _fileno = 0, _flags2 = 0, > _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000', > _shortbuf = "", _lock = 0x3779d52970, _offset = -1, _codecvt = 0x0, > _wide_data = 0x3779d51960, _freeres_list = 0x0, _freeres_buf = 0x0, > _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>} > (gdb) print *_IO_list_all->file->_chain->_chain->_chain->_chain > $6 = {_flags = -72532864, > _IO_read_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_read_end = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_read_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_write_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_write_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_write_end = 0x2aaaaaaae000 "", > _IO_buf_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk ] > info: get_config_opt: Defaulting to 'pcmk' for option: > clustername\nce\nty\nde 0).\n", > _IO_buf_end = 0x2aaaaaaae000 "", _IO_save_base = 0x0, _IO_backup_base = > 0x0, > _IO_save_end = 0x0, _markers = 0x0, _chain = 0x0, _fileno = 3, _flags2 = 0, > _old_offset = 0, _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = > "", > _lock = 0xe3ef5e0, _offset = -1, _codecvt = 0x0, _wide_data = 0xe3ef5f0, > _freeres_list = 0x0, _freeres_buf = 0x0, _freeres_size = 0, _mode = -1, > _unused2 = '\000'<repeats 19 times>} > (gdb) > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > > corosync.conf > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > # Please read the corosync.conf.5 manual page > compatibility: whitetank > > aisexec { > user: root > group: root > } > > service { > name: pacemaker > ver : 0 > } > > totem { > version: 2 > secauth: off > threads: 0 > rrp_mode: active > token: 24000 > consensus: 29000 > clear_node_high_bit: yes > rrp_problem_count_timeout: 30000 > interface { > ringnumber: 0 > bindnetaddr: 192.168.1.0 > mcastaddr: 226.94.1.1 > mcastport: 5405 > } > interface { > ringnumber: 1 > bindnetaddr: 192.168.2.0 > mcastaddr: 226.94.1.1 > mcastport: 5405 > } > } > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: off > timestamp: on > } > ----8<--------8<--------8<--------8<--------8<--------8<--------8<---- > Thank you for the detailed bug report.
Would you mind also posting a corosync-fplay output? There was mention that the segv occured again. Was it during startup, or later during runtime when pacemaker forked a process? Thanks -steve _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
