Re: [Openais] Corosync 1.2.5 still hangs on startup

Steven Dake Thu, 01 Jul 2010 12:58:50 -0700

On 07/01/2010 06:09 AM, Keisuke MORI wrote:
> Bad news...
>
> 2010/6/30 Andrew Beekhof<[email protected]>:
>> On Wed, Jun 30, 2010 at 12:06 PM, Keisuke MORI
>> <[email protected]>  wrote:
>>> 2010/6/29 Andrew Beekhof<[email protected]>:
>>>> On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI<[email protected]>  
>>>> wrote:
>>>>> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on
>>>>> CentOS 5.5 using yum but it still hangs on its startup somtimes.
>>>>>
>>>>> The symptom is exactly same as this:
>>>>>   
>>>>> https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html
>>>>
>>>> Arrgghhh!!!
>>>>
>>>> Can you try the following patch?
>>>
>>> With the patch the problem disappeared!
>>> I've not been able to reproduce the hang with rebooting the node more
>>> than 10 times (which was enough to reproduce it previously).
>
> It didn't happen yesterday, but the same hang occurred again today.
>
> I also tried with corosync-1.2.6 but the things didn't get better.
>
> Here is the stack trace and the corosync.conf when I reproduce it with
>   corosync-1.2.6.
> According to the core, fileno=10 looks broken, while filno=0,1,2,3 seems sane.
>
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<----
> [r...@pm01 20100701-memo]# gdb /usr/sbin/corosync core.2674
> (...)
> Core was generated by `corosync'.
> #0  0x000000377a607b35 in pthread_join (threadid=1085937984, 
> thread_return=0x0)
>      at pthread_join.c:89
> 89          lll_wait_tid (pd->tid);
> (gdb) where
> #0  0x000000377a607b35 in pthread_join (threadid=1085937984, 
> thread_return=0x0)
>      at pthread_join.c:89
> #1  0x00002b48a7899ad9 in logsys_atexit () at logsys.c:1687
> #2  0x0000000000405b05 in sigsegv_handler (num=<value optimized out>)
>      at main.c:222
> #3<signal handler called>
> #4  fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
> #5  __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:155
> #6  0x00002aaaaaba84de in spawn_child ()
>     from /usr/libexec/lcrso/pacemaker.lcrso
> #7  0x00002aaaaabacb9b in pcmk_startup ()
>     from /usr/libexec/lcrso/pacemaker.lcrso
> #8  0x0000000000408339 in corosync_service_link_and_init (
>      corosync_api=0x613920, service_name=0xe3ef850 "pacemaker", service_ver=0)
>      at service.c:201
> #9  0x00000000004086e3 in corosync_service_defaults_link_and_init (
>      corosync_api=0x613920) at service.c:534
> #10 0x0000000000405106 in main_service_ready () at main.c:1206
> #11 0x00002b48a7679425 in main_iface_change_fn (context=0x2aaaaaaae010,
>      iface_addr=<value optimized out>, iface_no=<value optimized out>)
>      at totemsrp.c:4363
> #12 0x00002b48a76701a7 in timer_function_netif_check_timeout (data=0xe416520)
>      at totemudp.c:1380
> ---Type<return>  to continue, or q<return>  to quit---
> #13 0x00002b48a766d459 in timerlist_expire (handle=150346236434579456)
>      at tlist.h:309
> #14 poll_run (handle=150346236434579456) at coropoll.c:448
> #15 0x000000000040670c in main (argc=<value optimized out>,
>      argv=<value optimized out>) at main.c:1558
> (gdb)
> (gdb) up 4
> #4  fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
> 48          _IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
> (gdb) print *_IO_list_all
> $2 = {file = {_flags = -72503612,
>      _IO_read_ptr = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_read_end = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_read_base = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_write_base = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_write_ptr = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_write_end = 0x2aaaab278000<Address 0x2aaaab278000 out of bounds>,
>      _IO_buf_base = 0x0, _IO_buf_end = 0x0, _IO_save_base = 0x0,
>      _IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0,
>      _chain = 0x3779d51780, _fileno = 10, _flags2 = 0, _old_offset = 600,
>      _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "\254",
>      _lock = 0x0, _offset = -1, _codecvt = 0x2aaaac002b20,
>      _wide_data = 0x40b9f640, _freeres_list = 0x0, _freeres_buf = 0x2010,
>      _freeres_size = 46912518490896, _mode = -1,
>      _unused2 = "\000\000\000\000)D\247y7\000\000\000+\v\000\254\252*\000"},
>    vtable = 0x3779d50520}
> (gdb) print *_IO_list_all->file->_chain
> $3 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764f000 "",
>    _IO_read_end = 0x2b48a764f000 "", _IO_read_base = 0x2b48a764f000 "",
>    _IO_write_base = 0x2b48a764f000 "", _IO_write_ptr = 0x2b48a764f000 "",
>    _IO_write_end = 0x2b48a764f000 "", _IO_buf_base = 0x2b48a764f000 "",
>    _IO_buf_end = 0x2b48a7650000<Address 0x2b48a7650000 out of bounds>,
>    _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0,
>    _markers = 0x0, _chain = 0x3779d51860, _fileno = 1, _flags2 = 0,
>    _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
>    _shortbuf = "", _lock = 0x3779d52980, _offset = 0, _codecvt = 0x0,
>    _wide_data = 0x3779d51ac0, _freeres_list = 0x0, _freeres_buf = 0x0,
>    _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>}
> (gdb) print *_IO_list_all->file->_chain->_chain
> $4 = {_flags = -72534908, _IO_read_ptr = 0x2b48a764e000 "",
>    _IO_read_end = 0x2b48a764e000 "", _IO_read_base = 0x2b48a764e000 "",
>    _IO_write_base = 0x2b48a764e000 "", _IO_write_ptr = 0x2b48a764e000 "",
>    _IO_write_end = 0x2b48a764e000 "", _IO_buf_base = 0x2b48a764e000 "",
>    _IO_buf_end = 0x2b48a764f000 "", _IO_save_base = 0x0, _IO_backup_base = 
> 0x0,
>    _IO_save_end = 0x0, _markers = 0x0, _chain = 0x3779d516a0, _fileno = 2,
>    _flags2 = 0, _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
>    _shortbuf = "", _lock = 0x3779d52990, _offset = 0, _codecvt = 0x0,
>    _wide_data = 0x3779d51c20, _freeres_list = 0x0, _freeres_buf = 0x0,
>    _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>}
> (gdb) print *_IO_list_all->file->_chain->_chain->_chain
> $5 = {_flags = -72539000, _IO_read_ptr = 0x0, _IO_read_end = 0x0,
>    _IO_read_base = 0x0, _IO_write_base = 0x0, _IO_write_ptr = 0x0,
>    _IO_write_end = 0x0, _IO_buf_base = 0x0, _IO_buf_end = 0x0,
>    _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0,
>    _markers = 0x0, _chain = 0xe3ef500, _fileno = 0, _flags2 = 0,
>    _old_offset = -1, _cur_column = 0, _vtable_offset = 0 '\000',
>    _shortbuf = "", _lock = 0x3779d52970, _offset = -1, _codecvt = 0x0,
>    _wide_data = 0x3779d51960, _freeres_list = 0x0, _freeres_buf = 0x0,
>    _freeres_size = 0, _mode = 0, _unused2 = '\000'<repeats 19 times>}
> (gdb) print *_IO_list_all->file->_chain->_chain->_chain->_chain
> $6 = {_flags = -72532864,
>    _IO_read_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_read_end = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_read_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_write_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_write_ptr = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_write_end = 0x2aaaaaaae000 "",
>    _IO_buf_base = 0x2aaaaaaad000 "Jul 01 18:09:27 corosync [pcmk  ]
> info: get_config_opt: Defaulting to 'pcmk' for option:
> clustername\nce\nty\nde 0).\n",
>    _IO_buf_end = 0x2aaaaaaae000 "", _IO_save_base = 0x0, _IO_backup_base = 
> 0x0,
>    _IO_save_end = 0x0, _markers = 0x0, _chain = 0x0, _fileno = 3, _flags2 = 0,
>    _old_offset = 0, _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = 
> "",
>    _lock = 0xe3ef5e0, _offset = -1, _codecvt = 0x0, _wide_data = 0xe3ef5f0,
>    _freeres_list = 0x0, _freeres_buf = 0x0, _freeres_size = 0, _mode = -1,
>    _unused2 = '\000'<repeats 19 times>}
> (gdb)
>
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<----
>
> corosync.conf
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<----
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
>           user: root
>           group: root
> }
>
> service {
>           name: pacemaker
>           ver : 0
> }
>
> totem {
>          version: 2
>          secauth: off
>          threads: 0
>          rrp_mode: active
>          token: 24000
>          consensus: 29000
>          clear_node_high_bit: yes
>          rrp_problem_count_timeout: 30000
>          interface {
>                  ringnumber: 0
>                  bindnetaddr: 192.168.1.0
>                  mcastaddr: 226.94.1.1
>                  mcastport: 5405
>          }
>          interface {
>                  ringnumber: 1
>                  bindnetaddr: 192.168.2.0
>                  mcastaddr: 226.94.1.1
>                  mcastport: 5405
>          }
> }
>
> logging {
>          fileline: off
>          to_stderr: yes
>          to_logfile: yes
>          to_syslog: yes
>          logfile: /tmp/corosync.log
>          debug: off
>          timestamp: on
> }
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<----
>
Thank you for the detailed bug report.


Would you mind also posting a corosync-fplay output?

There was mention that the segv occured again.  Was it during startup, 
or later during runtime when pacemaker forked a process?

Thanks
-steve
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.5 still hangs on startup

Reply via email to