Re: the killer node

Rafael Schloming Tue, 19 Feb 2013 08:43:27 -0800

That's almost the same stack trace I see with send when I comment out the
while (1). The only difference is that it's all under pn_messenger_send
rather than pn_messenger_recv.


This looks to me like the stack is getting corrupted since send is actually
your code yet the trace appears to be claiming that proton is calling into
it which it couldn't possibly do. I'm guessing the whole stack underneath
pn_connector_process (or above it in the trace below) is garbage. Can you
try running under valgrind and see if it spots where the corruption is
happening?

As an aside you should probably also build with debug on as it will be a
little clearer what is going on.

--Rafael

On Tue, Feb 19, 2013 at 7:08 AM, Michael Goulish <mgoul...@redhat.com>wrote:

> Sorry, I mean to include that.
>
> Here is the stack trace from node A :
>
>
> #0  0x00007fbb74173de8 in vfprintf () from /lib64/libc.so.6
> #1  0x00007fbb74177abf in buffered_vfprintf () from /lib64/libc.so.6
> #2  0x00007fbb74172c1e in vfprintf () from /lib64/libc.so.6
> #3  0x00007fbb7417cd87 in fprintf () from /lib64/libc.so.6
> #4  0x0000000000400f40 in send (name=0x6 <Address 0x6 out of bounds>,
>     messenger=0x149a150, message=0x51,
>     addr=0x4000 <Address 0x4000 out of bounds>) at node.c:44
> #5  0x00007fbb7450f524 in pn_send () from /lib/libqpid-proton.so.1
> #6  0x00007fbb74510883 in pn_connector_process () from
> /lib/libqpid-proton.so.1
> #7  0x00007fbb7450d85a in pn_messenger_tsync () from
> /lib/libqpid-proton.so.1
> #8  0x00007fbb7450d961 in pn_messenger_sync () from
> /lib/libqpid-proton.so.1
> #9  0x00007fbb7450ef6d in pn_messenger_recv () from
> /lib/libqpid-proton.so.1
> #10 0x0000000000401079 in recv (name=0x7fff2f9a5363 "A",
> messenger=0x1493970,
>     message=0x148e010, addr=0x7fff2f9a4360 "amqp://~0.0.0.0:6666") at
> node.c:88
> #11 0x00000000004014e2 in main (argc=3, argv=0x7fff2f9a4888) at node.c:194
>
>
>
>
> If you like I can give you access to my machine.
>
>
>
>
>
>
> ----- Original Message -----
> From: "Rafael Schloming" <r...@alum.mit.edu>
> To: proton@qpid.apache.org
> Sent: Tuesday, February 19, 2013 9:33:29 AM
> Subject: Re: the killer node
>
> This doesn't happen for me. I see node B loop forever and never send
> anything which is what I would expect given the while (1) { sleep(...); }
> you have in there. What does your debugger say about where node A crashes?
>
> --Rafael
>
> On Tue, Feb 19, 2013 at 4:40 AM, Michael Goulish <mgoul...@redhat.com
> >wrote:
>
> >
> > Well, it looks like one of my nodes can kill the other one by doing a
> put.
> > No errors reported by either messenger before the fatality.
> >
> > I'd like to see if someone else can confirm this result,
> > and maybe see something that I am not seeing.
> >
> > compile and run scripts are provided in the directory, called "node".
> >
> >
> > I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> > with
> > Ken's recent patch for infinite credit. )
> >
> >
> >   1. Two instances of one program are used.  Node A only receives,
> >      Node B only sends to it.
> >
> >   2. Start node A first, with the script "r1".
> >      It will go through its main loop, trying to receive
> >      and timing out, for as long as you like.
> >
> >
> >   3. Start node B, with script r2.
> >      It will pause after formatting it first message, and will
> >      then do a dramatic 5-second countdown.  Then it calls
> >      put  ( not send! )  and node *A* dies horribly, its core
> >      file spattering the hard disk.
> >
> >      Node B is unaware of the carnage it has caused, sedated
> >      by a sleep loop, tragically still expecting to call send
> >      and start talking to its partner, node A.
> >
> >
> > ( see attached -- if you dare. )
> >
> >
> >
> >
>

Re: the killer node

Reply via email to