Piotr Jasiukajtis wrote:
> Ashutosh Tripathi pisze:

>> You might have to dive in a bit deep here, unfortunately.

> Fortunately ;)

        OK. That is what i was hoping to hear! :-) :-)

Looking at node1 CMM debug buffers, the following is somewhat surprising:

th ffffff00a9aa0040 tm 527231379: node_is_reachable(2,1231584610); current incn 
= 1231583261
th ffffff00a9aa0040 tm 527231379: node 2 is up; new incn = 1231584610

....

th ffffff00a9aa0040 tm 909605026: node_is_unreachable(2,1231584610,12000) called
th ffffff00a9aa0040 tm 909605026: node 2 is down


[Well, on second thought, it surprising only because your node1 console
logs did not capture this reconfiguration where node2 was declared dead.
But maybe that did not actually show up on console for whatever reason.]

The node_is_unreachable() call is an explicit call from the Path Manger
into CMM indicating that it can no longer talk to the node in question.

But WHY? Given that it was indeed able to see node2 earlier? What exactly
triggered this?

Let us look at path manager logs on node1.

th ffffff00aaf4c160 tm 590674061: Stale node 2 incarnation 1231584610
th ffffff00aaf4c160 tm 590674068: Stale node 2 incarnation 1231584610

Uh oh!! The logs are overrun with the stale incarnation number message.
[Which is expected after node2 is kicked out of the membership.]
Also, for some reason, the timestamps only contain 5XXXX, not 9XXX as
some of the other buffers do. You would expect to see something
around timestamp 909605026, when node2 is declared dead.

I guess you have to repeat this experiment and capture CMM/path
manager/heartbeat buffers AS SOON AS node2 is declared dead by node1. So that
we have all the information in the buffers.

It would help to get the buffers from node2 as well. You would have to forcably
panic the xVM guest domain and force is to take cores. I don't remember off the
top of my head the xm command line for doing that.

And Piotr: This time around, instead of just sending the buffer contents to the
alias (or in ADDITION to sending the buffer contents to the alias), let us see
if you can take a close look at them yourself and do some prelim analysis to
further narrow the area we need to focus upon?

PS: Feel free to do source code browsing on node_in_unreachable() and other
such key functions in question. That would help in the analysis.

Before you complain... i refer you back to your statement quoted in the
begenning of this e-mail  :-)

Happy Hacking!
-ashu


Reply via email to