Piotr Jasiukajtis wrote: > Ashutosh Tripathi pisze: >> You might have to dive in a bit deep here, unfortunately.
> Fortunately ;) OK. That is what i was hoping to hear! :-) :-) Looking at node1 CMM debug buffers, the following is somewhat surprising: th ffffff00a9aa0040 tm 527231379: node_is_reachable(2,1231584610); current incn = 1231583261 th ffffff00a9aa0040 tm 527231379: node 2 is up; new incn = 1231584610 .... th ffffff00a9aa0040 tm 909605026: node_is_unreachable(2,1231584610,12000) called th ffffff00a9aa0040 tm 909605026: node 2 is down [Well, on second thought, it surprising only because your node1 console logs did not capture this reconfiguration where node2 was declared dead. But maybe that did not actually show up on console for whatever reason.] The node_is_unreachable() call is an explicit call from the Path Manger into CMM indicating that it can no longer talk to the node in question. But WHY? Given that it was indeed able to see node2 earlier? What exactly triggered this? Let us look at path manager logs on node1. th ffffff00aaf4c160 tm 590674061: Stale node 2 incarnation 1231584610 th ffffff00aaf4c160 tm 590674068: Stale node 2 incarnation 1231584610 Uh oh!! The logs are overrun with the stale incarnation number message. [Which is expected after node2 is kicked out of the membership.] Also, for some reason, the timestamps only contain 5XXXX, not 9XXX as some of the other buffers do. You would expect to see something around timestamp 909605026, when node2 is declared dead. I guess you have to repeat this experiment and capture CMM/path manager/heartbeat buffers AS SOON AS node2 is declared dead by node1. So that we have all the information in the buffers. It would help to get the buffers from node2 as well. You would have to forcably panic the xVM guest domain and force is to take cores. I don't remember off the top of my head the xm command line for doing that. And Piotr: This time around, instead of just sending the buffer contents to the alias (or in ADDITION to sending the buffer contents to the alias), let us see if you can take a close look at them yourself and do some prelim analysis to further narrow the area we need to focus upon? PS: Feel free to do source code browsing on node_in_unreachable() and other such key functions in question. That would help in the analysis. Before you complain... i refer you back to your statement quoted in the begenning of this e-mail :-) Happy Hacking! -ashu