> I have an Oracle RAC test environment that consists of 2 nodes. The > nodes are running Redhat ES4 update 2. > These two nodes are using a firewire disk with OCFS2 filesystem as shared > disk. > > Oracle Clusterware is installed perfectly fine on these two nodes. > > The problem now is, it seems the hosts always kill (fence) each other. > For example, say currently node 1 is hang (fenced), and node 2 is > active. If I cold restart node 1, node 2 in a few minutes will hang > (fenced) with caps lock and scroll lock blink continuously. > Now if I cold restart node 2m then node 1 will hang (fenced) with caps > lock and scroll lock blink continuously. > > This is the output of syslog from one of the nodes: > Dec 19 13:44:35 testdb02 kernel: (2398,0):o2net_set_nn_state:421 no > longer connected to node testdb01 at 172.16.1.1:7777 > Dec 19 13:44:35 testdb02 kernel: (4468,0):ocfs2_replay_journal:1125 > Recovering node 0 from slot 1 on device (8,17) > Dec 19 13:44:35 testdb02 kernel: (4469,0):ocfs2_replay_journal:1125 > Recovering node 0 from slot 1 on device (8,18) > > > This is the one from testdb01: > > Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1330 connection to > node > testdb02 num 1 at 172.16.1.2:7777 has been idle for 10 seconds, shutting > it do > wn. > Dec 18 17:47:08 testdb01 kernel: (0,0):o2net_idle_timer:1341 here are some > times that might help debug the situation: (tmr 1134953218.565854 now > 1134953228.56 > 4548 dr 1134953218.565842 adv 1134953218.565855:1134953218.565856 func > > (df59be0e:505) 1134953138.728170:1134953138.728179) > Dec 18 17:47:08 testdb01 kernel: (2342,0):o2net_set_nn_state:421 no longer > connected to node testdb02 at 172.16.1.2:7777 > Dec 18 17:47:17 testdb01 kernel: (5061,1):ocfs2_replay_journal:1125 > Recovering > node 1 from slot 0 on device (8,17) > Dec 18 17:47:17 testdb01 kernel: (5062,0):ocfs2_replay_journal:1125 > Recovering > node 1 from slot 0 on device (8,18) > Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 18 17:47:18 testdb01 kernel: kjournald starting. Commit interval 5 > seconds > > > Does anybody know what is going on? > > Thank You
I have had the same problem of fencing nodes, even with faster disks (over fibre from a SAN). I have found someone setting this timeout to as high as 10 Minutes with external disks (USB 2 and Firewire). Maybe someone closer to the development of OCFS2 can shed some more light on this and what the caveats are. As suggested by other people on this list I have increased the heartbeat from the default of 7 to 30. This leads to an effective timeout of (30-1) x2 = 58 seconds. On SLES, this is in /etc/sysconfig/o2cb (not sure for RedHat). O2CB_HEARTBEAT_THRESHOLD=30 HTH -- mike _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
