[ha-clusters-discuss] SCX in xVM

Piotr Jasiukajtis Tue, 20 Jan 2009 09:53:10 +0100

I attach the traffic from the interconnect on node1:
node1 at root# snoop -v -d xnf1 'host 172.16.0.130' | tee nodes.log


Piotr Jasiukajtis pisze:
> I tried the same with one interconnect. Still doesn't work.
> Btw, both physical NICS and interconnects are connected to the one physical 
> switch (without vlans).
> Both physical nodes have only one physical NIC connected to the switch.
> 
> 
> Ashutosh Tripathi pisze:
>> Piotr Jasiukajtis wrote:
>>> Ashutosh Tripathi pisze:
>>>> You might have to dive in a bit deep here, unfortunately.
>>> Fortunately ;)
>>     OK. That is what i was hoping to hear! :-) :-)
>>
>> Looking at node1 CMM debug buffers, the following is somewhat surprising:
>>
>> th ffffff00a9aa0040 tm 527231379: node_is_reachable(2,1231584610); 
>> current incn = 1231583261
>> th ffffff00a9aa0040 tm 527231379: node 2 is up; new incn = 1231584610
>>
>> ....
>>
>> th ffffff00a9aa0040 tm 909605026: 
>> node_is_unreachable(2,1231584610,12000) called
>> th ffffff00a9aa0040 tm 909605026: node 2 is down
>>
>>
>> [Well, on second thought, it surprising only because your node1 console
>> logs did not capture this reconfiguration where node2 was declared dead.
>> But maybe that did not actually show up on console for whatever reason.]
>>
>> The node_is_unreachable() call is an explicit call from the Path Manger
>> into CMM indicating that it can no longer talk to the node in question.
>>
>> But WHY? Given that it was indeed able to see node2 earlier? What exactly
>> triggered this?
>>
>> Let us look at path manager logs on node1.
>>
>> th ffffff00aaf4c160 tm 590674061: Stale node 2 incarnation 1231584610
>> th ffffff00aaf4c160 tm 590674068: Stale node 2 incarnation 1231584610
>>
>> Uh oh!! The logs are overrun with the stale incarnation number message.
>> [Which is expected after node2 is kicked out of the membership.]
>> Also, for some reason, the timestamps only contain 5XXXX, not 9XXX as
>> some of the other buffers do. You would expect to see something
>> around timestamp 909605026, when node2 is declared dead.
>>
>> I guess you have to repeat this experiment and capture CMM/path
>> manager/heartbeat buffers AS SOON AS node2 is declared dead by node1. So 
>> that
>> we have all the information in the buffers.
>>
>> It would help to get the buffers from node2 as well. You would have to 
>> forcably
>> panic the xVM guest domain and force is to take cores. I don't remember 
>> off the
>> top of my head the xm command line for doing that.
>>
>> And Piotr: This time around, instead of just sending the buffer contents 
>> to the
>> alias (or in ADDITION to sending the buffer contents to the alias), let 
>> us see
>> if you can take a close look at them yourself and do some prelim 
>> analysis to
>> further narrow the area we need to focus upon?
>>
>> PS: Feel free to do source code browsing on node_in_unreachable() and other
>> such key functions in question. That would help in the analysis.
>>
>> Before you complain... i refer you back to your statement quoted in the
>> begenning of this e-mail  :-)
>>
>> Happy Hacking!
>> -ashu
>>
> 
> 


-- 
Regards,
Piotr Jasiukajtis | estibi | SCA OS0072
http://estseg.blogspot.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nodes.log.bz2
Type: application/x-bzip
Size: 54179 bytes
Desc: not available
URL: 
<http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20090120/0624c3b1/attachment.bin>

[ha-clusters-discuss] SCX in xVM

Reply via email to