I am assuming that the other node was alive and that you are using a private interface. So, the only case that is left is that o2net is timing out for no apparent reason.

Now if that is the case, before we go to the more intrusive module update route, it may
be better if we start from a tcpdump. Run the following on both nodes.

# tcpdump -i <DEVICE> -C 50 -W 3 -s 10000 -Sw /tmp/tcpdump.log -ttt 'port 7777' &

This will create and use three 50M files as rotating buffer. When the problem happens next,
email me the location of the last log file for both nodes.

Andy Phillips wrote:
Hello,

     Well we had the same problem again;

o2net: connection to node barney (num 0) at 172.16.6.10:7777
has been idle for 10 seconds, shutting it down.

kernel: (0,0):o2net_idle_timer:1309 here are some times that might help
debug the situation: (tmr 1154932284.14757 now 1154932294.13147 dr
1154932284.14717 adv 1154932284.14767:1154932284.14768 func (06aac8a1:1)
1154932279.15062:1154932279.15068)

    We upgraded to 1.2.3. And it almost immediately died again with the
same error. Our cron job that touches a file every 3 seconds did not
seem to make much difference. This is now quite a serious problem for
us.

Any suggestions as to how to take this forward? Sunil, what do you need from us to roll a custom debugging build? Can we run the custom build on node 2 and leave the existing build on
node 1, which is now production?

    Andy


Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now
1154545586.796978 dr 1154545576.798238 adv
1154545576.798291:1154545576.798293 func (06aac8a1:1)
1154545566.800782:1154545566.800787)
Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:7777
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0
Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
stopping heartbeat on all active regions.
 ________________________________________________________________________

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to