Hi Lars,
Hi Dejan,
I got ltrace file when a problem occurred.
I attach ltrace file.
The investigation in gdb continues it and performs it.
If there is suggestion of any improvement, please tell me.
Best Regards,
Hideo Yamauchi.
--- On Tue, 2012/1/10, renayama19661...@ybb.ne.jp <renayama19661...@ybb.ne.jp>
wrote:
> Hi Lars,
>
> I attach strace file when a problem reappeared at the end of last year.
> I used glue which applied your patch for confirmation.
>
> It is the file which I picked with attrd by strace -p command right before I
> stop Heartbeat.
>
> Finally SIGTERM caught it, but attrd did not stop.
> The attrd stopped afterwards when I sent SIGKILL.
>
> * I acquire the information such as ltrace from now on.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp <renayama19661...@ybb.ne.jp>
> wrote:
>
> > Hi Lars,
> >
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > >
> > > various ways to try to do that:
> > > cat /proc/<pid-of-attrd>/stack # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step
> > > it,
> > > cause attrd to core dump, and analyse the core.
> >
> > All right.
> > I investigate the cause a little more.
> >
> > Give me the time for investigation a little more.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- On Fri, 2011/12/30, Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> >
> > > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp
> > > wrote:
> > > > Hi Dejan,
> > > > Hi Lars,
> > > >
> > > > In our environment, the problem recurred with the patch of Mr. Lars.
> > > > After a problem occurred, I sent TERM signal, but attrd does not seem to
> > > > receive TERM at all.
> > >
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > >
> > > various ways to try to do that:
> > > cat /proc/<pid-of-attrd>/stack # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step
> > > it,
> > > cause attrd to core dump, and analyse the core.
> > >
> > > > The reconsideration of the patch is necessary for the solution to
> > > > problem.
> > > >
> > > >
> > > > Best Regards,
> > > > Hideo Yamauchi.
> > > >
> > > >
> > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp
> > > > <renayama19661...@ybb.ne.jp> wrote:
> > > >
> > > > > Hi Dejan,
> > > > > Hi Lars,
> > > > >
> > > > > I understood it.
> > > > > I try the operation of the patch in our environment.
> > > > >
> > > > > To Alan: Will you try a patch?
> > > > >
> > > > > Best Regards,
> > > > > Hideo Yamauchi.
> > > > >
> > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic <deja...@fastmail.fm> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > > > > > <lars.ellenb...@linbit.com> wrote:
> > > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof
> > > > > > > > > wrote:
> > > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,
> > > > > > > > >> <renayama19661...@ybb.ne.jp> wrote:
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > We sometimes fail in a stop of attrd.
> > > > > > > > >> >
> > > > > > > > >> > Step1. start a cluster in 2 nodes
> > > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > > > > >> > Step3. stop the second node after time passed a
> > > > > > > > >> > little.(/etc/init.d/heartbeat
> > > > > > > > >> > stop.)
> > > > > > > > >> >
> > > > > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > > > > >>
> > > > > > > > >> There's no evidence that it actually catches it, only that
> > > > > > > > >> it is sent.
> > > > > > > > >> I've seen it before but never figured out why it occurs.
> > > > > > > > >
> > > > > > > > > I had it once tracked down almost to where it occurs, but
> > > > > > > > > then got distracted.
> > > > > > > > > Yes the signal was delivered.
> > > > > > > > >
> > > > > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > > > > or looping in some internal message delivery function too
> > > > > > > > > often.
> > > > > > > > >
> > > > > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > > > > but I'm not sure.
> > > > > > > > >
> > > > > > > > > I *may* be that, because
> > > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout,
> > > > > > > > > &ipc_rc);
> > > > > > > > >
> > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > > > > > IPC_INTR:
> > > > > > > > > if ( allow_intr){
> > > > > > > > > goto startwait;
> > > > > > > > >
> > > > > > > > > Depending on the frequency of deliverd signals, it may cause
> > > > > > > > > this goto
> > > > > > > > > startwait loop to never exit, because the timeout always
> > > > > > > > > starts again
> > > > > > > > > from the full passed in timeout.
> > > > > > > > >
> > > > > > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the
> > > > > > > > > signal
> > > > > > > > > handler only raises a flag for the next mainloop iteration.
> > > > > > > > >
> > > > > > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > > > > > then the goto loop will never timeout.
> > > > > > > > >
> > > > > > > > > Please someone check this for plausibility ;-)
> > > > > > > >
> > > > > > > > Most plausible explanation I've heard so far... still odd that
> > > > > > > > only
> > > > > > > > attrd is affected.
> > > > > > > > So what do we do about it?
> > > > > > >
> > > > > > > Reproduce, and confirm that this is what people are seeing.
> > > > > > >
> > > > > > > Make attrd non-blocking?
> > > > > > >
> > > > > > > Fix the ipc layer to not restart the full timeout,
> > > > > > > but only the remaining partial time?
> > > > > >
> > > > > > Lars and I made a quick patch for cluster-glue (attached).
> > > > > > Hideo-san, is there a way for you to verify if it helps? The
> > > > > > patch is not perfect and under unfavourable circumstances it may
> > > > > > still take a long time for the caller to exit, but it'd be good
> > > > > > to know if this is the right spot.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Dejan
> > > > > >
> > > > > > > --
> > > > > > > : Lars Ellenberg
> > > > > > > : LINBIT | Your Way to High Availability
> > > > > > > : DRBD/HA support and consulting http://www.linbit.com
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > > > >
> > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > Getting started:
> > > > > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > > > Bugs:
> > > > > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > >
> > > --
> > > : Lars Ellenberg
> > > : LINBIT | Your Way to High Availability
> > > : DRBD/HA support and consulting http://www.linbit.com
> > >
> > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
31986 --- SIGUSR1 (User defined signal 1) ---
31986 cl_log(7, 0x804bfb7, 2, 0x804caaa, 0) = 0
31986 xmlfromIPC(0x901c230, 120, 2, 0x804caaa, 0) = 0x901c3b0
31986 cl_log(7, 0x804bfd0, 2, 0x804caaa, 0) = 0
31986 cl_log(6, 0x804bff4, 0x804caaa, 0, 0) = 0
31986 print_xml_formatted(9, 0x804caaa, 0x901c3b0, 0x804caaa, 0) = 0
31986 crm_element_value(0x901c3b0, 0x804c3e6, 0, 0x59b6f799, 0xa261c0) =
0x901f558
31986 crm_element_value(0x901c3b0, 0x804c3ea, 0, 0x59b6f799, 0xa261c0) =
0x901d548
31986 crm_element_value(0x901c3b0, 0x804c280, 0, 0x59b6f799, 0xa261c0) =
0x901d5e0
31986 crm_element_value(0x901c3b0, 0x804c843, 0, 0x59b6f799, 0xa261c0) =
0x901f9a8
31986 crm_element_value(0x901c3b0, 0x804c3ef, 0, 0x59b6f799, 0xa261c0) =
0x901fa88
31986 crm_str_eq(0x901d548, 0x804c84e, 0, 0x59b6f799, 0xa261c0) = 0
31986 safe_str_neq(0x901fa88, 0x90192a0, 0, 0x59b6f799, 0xa261c0) = 0
31986 cl_log(7, 0x804c87d, 0x804c9a9, 0x901d548, 0x901f558) = 0
31986 crm_element_value(0x901c3b0, 0x804c280, 0xbfd72898, 0xa05ece, 0x901fa88)
= 0x901d5e0
31986 g_hash_table_lookup(0x9018600, 0x901d5e0, 0xbfd72898, 0xa05ece, 0x901fa88
<unfinished ...>
31986 g_str_hash(0x901d5e0, 0x804c280, 88, 0xa261c0, 868) = 0xf5e80656
31986 g_str_equal(0x901e8f0, 0x901d5e0, 88, 0xa261c0, 868) = 1
31986 <... g_hash_table_lookup resumed> ) = 0x901ae70
31986 crm_element_value(0x901c3b0, 0x804c2ea, 0xbfd72898, 0xa05ece, 0x901fa88)
= 0
31986 crm_element_value(0x901c3b0, 0x804c304, 0xbfd72898, 0xa05ece, 0x901fa88)
= 0x901fa30
31986 free(0x901e988) = <void>
31986 crm_strdup_fn(0x901fa30, 0x804c16a, 0x804ca64, 302, 0x901fa88) = 0x901e988
31986 cl_log(7, 0x804c318, 2, 0x804ca64, 0x901d5e0) = 0
31986 crm_element_value(0x901c3b0, 0x804c336, 2, 0x804ca64, 0x901d5e0) = 0
31986 cl_log(7, 0x804c1cc, 2, 0x804ca74, 0x804c363) = 0
31986 cl_log(7, 0x804c1ec, 2, 0x804ca74, 0x901e988) = 0
31986 cl_log(7, 0x804c215, 2, 0x804ca74, 0x901e8f0) = 0
31986 cl_log(7, 0x804c23e, 2, 0x804ca74, 0) = 0
31986 cl_log(7, 0x804c267, 2, 0x804ca74, 0) = 0
31986 crm_element_value(0x901c3b0, 0x804c89b, 0x804c9a9, 0x901d548, 0x901f558)
= 0
31986 cl_log(7, 0x804c8a4, 0x804c9a9, 0x901f9a8, 0) = 0
31986 crm_str_eq(0x901f9a8, 0, 0, 0x901f9a8, 0) = 0
31986 strlen("1326278312") = 10
31986 crm_str_eq(0x901f9a8, 0, 0, 0x901f9a8, 0) = 0
31986 free(NULL) = <void>
31986 crm_strdup_fn(0x901f9a8, 0x804c16a, 0x804c9a9, 810, 0) = 0x9017dc0
31986 cl_log(7, 0x804c90a, 0x804c9a9, 0x901d5e0, 0x901f9a8) = 0
31986 cl_log(6, 0x804c924, 0x804c97f, 0x901e8f0, 0x9017dc0) = 0
31986 cl_log(7, 0x804c1cc, 2, 0x804ca74, 0x804c954) = 0
31986 cl_log(7, 0x804c1ec, 2, 0x804ca74, 0x901e988) = 0
31986 cl_log(7, 0x804c215, 2, 0x804ca74, 0x901e8f0) = 0
31986 cl_log(7, 0x804c23e, 2, 0x804ca74, 0x9017dc0) = 0
31986 cl_log(7, 0x804c267, 2, 0x804ca74, 0) = 0
31986 create_xml_node(0, 0x804c994, 0x804c954, 0x901e8f0, 0x9017dc0) = 0x901e900
31986 crm_xml_add(0x901e900, 0x804c977, 0x804c4bf, 0x901e8f0, 0x9017dc0) =
0x901f298
31986 crm_xml_add(0x901e900, 0x804c3e6, 0x90192a0, 0x901e8f0, 0x9017dc0) =
0x901e9d8
31986 crm_xml_add(0x901e900, 0x804c3ea, 0x804c979, 0x901e8f0, 0x9017dc0) =
0x901f878
31986 crm_xml_add(0x901e900, 0x804c280, 0x901e8f0, 0x901e8f0, 0x9017dc0) =
0x901f8d8
31986 crm_xml_add(0x901e900, 0x804c2ea, 0, 0x901e8f0, 0x9017dc0) = 0
31986 crm_xml_add(0x901e900, 0x804c304, 0x901e988, 0x901e8f0, 0x9017dc0) =
0x901eb40
31986 crm_xml_add(0x901e900, 0x804c336, 0, 0x901e8f0, 0x9017dc0) = 0
31986 crm_xml_add(0x901e900, 0x804c843, 0x9017dc0, 0x901e8f0, 0x9017dc0) =
0x901ebd8
31986 crm_xml_add(0x901e900, 0x804c3f9, 0x9017dc0, 0x901e8f0, 0x9017dc0) =
0x901ec78
31986 update_attr(0x901a550, 0, 0x901e988, 0x9019500, 0) = 16
31986 safe_str_neq(0x9017dc0, 0, 0x901e988, 0x9019500, 0) = 1
31986 cl_log(6, 0x804c7e5, 0x804c9be, 16, 0x901e8f0) = 0
31986 malloc(8) = 0x90191d0
31986 memset(0x90191d0, '\000', 8) = 0x90191d0
31986 crm_strdup_fn(0x901e8f0, 0x804c16a, 0x804c9be, 723, 0x901e8f0) = 0x90191e0
31986 crm_strdup_fn(0x9017dc0, 0x804c16a, 0x804c9be, 725, 0x901e8f0) = 0x901a5a8
31986 send_cluster_message(0, 5, 0x901e900, 0, 0x9017dc0) = 1
31986 xmlDocGetRootElement(0x901f150, 5, 0x901e900, 0, 0x9017dc0) = 0x901e900
31986 xmlFreeDoc(0x901f150, 5, 0x901e900, 0, 0x9017dc0) = 105
31986 xmlDocGetRootElement(0x901ea78, 0x804caaa, 0x901c3b0, 0x804caaa, 0) =
0x901c3b0
31986 xmlFreeDoc(0x901ea78, 0x804caaa, 0x901c3b0, 0x804caaa, 0) = 105
31986 cl_log(7, 0x804c01c, 2, 0x804caaa, 1) = 0
31986 cl_log(7, 0x804c658, 0x804c9d3, 16, 0x90191e0) = 0
31986 g_hash_table_lookup(0x9018600, 0x90191e0, 0x804c9d3, 16, 0x90191e0
<unfinished ...>
31986 g_str_hash(0x90191e0, 0xbfd72a50, 0xbfd727f8, 0xa15de8, 0x90186c0) =
0xf5e80656
31986 g_str_equal(0x901e8f0, 0x90191e0, 0xbfd727f8, 0xa15de8, 0x90186c0) = 1
31986 <... g_hash_table_lookup resumed> ) = 0x901ae70
31986 free(NULL) = <void>
31986 crm_strdup_fn(0x901a5a8, 0x804c16a, 0x804c9d3, 655, 0x90191e0) = 0x901a5b8
31986 free(0x901a5a8) = <void>
31986 free(0x90191e0) = <void>
31986 free(0x90191d0) = <void>
31986 convert_ha_message(0, 0x90191d0, 0x804ca24, 0x90191d0, 0x90191d0) =
0x901d508
31986 crm_element_value(0x901d508, 0x804c3e6, 0x804ca24, 0x90191d0, 0x90191d0)
= 0x901f8d0
31986 crm_element_value(0x901d508, 0x804c3ea, 0x804ca24, 0x90191d0, 0x90191d0)
= 0x901f9a8
31986 crm_element_value(0x901d508, 0x804c3ef, 0x804ca24, 0x90191d0, 0x90191d0)
= 0
31986 crm_element_value(0x901d508, 0x804c3f9, 0x804ca24, 0x90191d0, 0x90191d0)
= 0x901ec68
31986 safe_str_neq(0x901f8d0, 0x90192a0, 0x804ca24, 0x90191d0, 0x90191d0) = 0
31986 cl_log(6, 0x804c440, 0x804ca36, 0x90191d0, 0x90191d0) = 0
31986 xmlDocGetRootElement(0x901f540, 0x804c440, 0x804ca36, 0x90191d0,
0x90191d0) = 0x901d508
31986 xmlFreeDoc(0x901f540, 0x804c440, 0x804ca36, 0x90191d0, 0x90191d0) = 105
31986 cl_log(7, 0x804bfb7, 2, 0x804caaa, 0) = 0
31986 cl_log(7, 0x804c01c, 2, 0x804caaa, 0) = 0
31986 cl_log(7, 0x804c040, 4, 0x804ca91, 0) = 0
31986 G_main_del_IPC_Channel(0x9018fc8, 0x804c040, 4, 0x804ca91, 0) = 1
31986 cl_log(7, 0x804c06c, 3, 0x804ca91, 0) = 0
31986 free(NULL) = <void>
31986 free(NULL) = <void>
31986 free(0x901c398) = <void>
31986 cl_log(7, 0x804c08c, 4, 0x804ca91, 0) = 0
31986 --- SIGTERM (Terminated) ---
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org