I attach the traffic from the interconnect on node1: node1 at root# snoop -v -d xnf1 'host 172.16.0.130' | tee nodes.log
Piotr Jasiukajtis pisze: > I tried the same with one interconnect. Still doesn't work. > Btw, both physical NICS and interconnects are connected to the one physical > switch (without vlans). > Both physical nodes have only one physical NIC connected to the switch. > > > Ashutosh Tripathi pisze: >> Piotr Jasiukajtis wrote: >>> Ashutosh Tripathi pisze: >>>> You might have to dive in a bit deep here, unfortunately. >>> Fortunately ;) >> OK. That is what i was hoping to hear! :-) :-) >> >> Looking at node1 CMM debug buffers, the following is somewhat surprising: >> >> th ffffff00a9aa0040 tm 527231379: node_is_reachable(2,1231584610); >> current incn = 1231583261 >> th ffffff00a9aa0040 tm 527231379: node 2 is up; new incn = 1231584610 >> >> .... >> >> th ffffff00a9aa0040 tm 909605026: >> node_is_unreachable(2,1231584610,12000) called >> th ffffff00a9aa0040 tm 909605026: node 2 is down >> >> >> [Well, on second thought, it surprising only because your node1 console >> logs did not capture this reconfiguration where node2 was declared dead. >> But maybe that did not actually show up on console for whatever reason.] >> >> The node_is_unreachable() call is an explicit call from the Path Manger >> into CMM indicating that it can no longer talk to the node in question. >> >> But WHY? Given that it was indeed able to see node2 earlier? What exactly >> triggered this? >> >> Let us look at path manager logs on node1. >> >> th ffffff00aaf4c160 tm 590674061: Stale node 2 incarnation 1231584610 >> th ffffff00aaf4c160 tm 590674068: Stale node 2 incarnation 1231584610 >> >> Uh oh!! The logs are overrun with the stale incarnation number message. >> [Which is expected after node2 is kicked out of the membership.] >> Also, for some reason, the timestamps only contain 5XXXX, not 9XXX as >> some of the other buffers do. You would expect to see something >> around timestamp 909605026, when node2 is declared dead. >> >> I guess you have to repeat this experiment and capture CMM/path >> manager/heartbeat buffers AS SOON AS node2 is declared dead by node1. So >> that >> we have all the information in the buffers. >> >> It would help to get the buffers from node2 as well. You would have to >> forcably >> panic the xVM guest domain and force is to take cores. I don't remember >> off the >> top of my head the xm command line for doing that. >> >> And Piotr: This time around, instead of just sending the buffer contents >> to the >> alias (or in ADDITION to sending the buffer contents to the alias), let >> us see >> if you can take a close look at them yourself and do some prelim >> analysis to >> further narrow the area we need to focus upon? >> >> PS: Feel free to do source code browsing on node_in_unreachable() and other >> such key functions in question. That would help in the analysis. >> >> Before you complain... i refer you back to your statement quoted in the >> begenning of this e-mail :-) >> >> Happy Hacking! >> -ashu >> > > -- Regards, Piotr Jasiukajtis | estibi | SCA OS0072 http://estseg.blogspot.com -------------- next part -------------- A non-text attachment was scrubbed... Name: nodes.log.bz2 Type: application/x-bzip Size: 54179 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20090120/0624c3b1/attachment.bin>