Ok, so this is normal behaviour that process has no access to the files on ocfs when there is a problem whith some node in the cluster at least for few seconds... Before I simulate the situation with network problems I had the "real" problem one node (let say it was node5) lost connection from all other 7 nodes after about 7 seconds later that node reconnect to 5 nodes and couldn't connect to two remaining nodes. All 7 nodes notice problem with only one node - node5. And I had in my logs:
Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: connection to node node0 (num 0) at 172.28.4.48:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:45 node5 kernel: [845664.767634] (0,5):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831975.367358 now 1245832005.363626 dr 1245831975.367342 adv 1245831975.367364:1245831975.367364 func (cd6c8a07:502) 1245765277.343131:1245765277.343132) Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: no longer connected to node node0 (num 0) at 172.28.4.48:7777 Jun 24 10:26:45 node5 kernel: [845665.013442] o2net: connection to node node2 (num 2) at 172.28.4.64:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:45 node5 kernel: [845665.013449] (0,1):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831975.603754 now 1245832005.603626 dr 1245831975.603749 adv 1245831975.603755:1245831975.603755 func (b0b3255d:500) 1245831909.670314:1245831909.670317) Jun 24 10:26:45 node5 kernel: [845665.013463] o2net: no longer connected to node node2 (num 2) at 172.28.4.64:7777 Jun 24 10:26:45 node5 kernel: [845665.403172] o2net: connection to node node3 (num 3) at 172.28.4.65:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:45 node5 kernel: [845665.403181] (0,7):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831975.787772 now 1245832005.787627 dr 1245831975.787759 adv 1245831975.787773:1245831975.787773 func (b0b3255d:500) 1245831925.837397:1245831925.837399) Jun 24 10:26:45 node5 kernel: [845665.415165] o2net: no longer connected to node node3 (num 3) at 172.28.4.65:7777 Jun 24 10:26:45 node5 kernel: [845666.094914] o2net: connection to node node7 (num 7) at 172.28.4.78:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:45 node5 kernel: [845666.094922] (0,6):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831975.915755 now 1245832005.915626 dr 1245831975.915743 adv 1245831975.915756:1245831975.915756 func (b0b3255d:502) 1245831943.949686:1245831943.949687) Jun 24 10:26:45 node5 kernel: [845666.094930] o2net: connection to node node4 (num 4) at 172.28.4.66:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:45 node5 kernel: [845666.094935] (0,6):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831975.915758 now 1245832005.915642 dr 1245831975.915748 adv 1245831975.915759:1245831975.915759 func (b0b3255d:500) 1245831943.948920:1245831943.948923) Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer connected to node node7 (num 7) at 172.28.4.78:7777 Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer connected to node node4 (num 4) at 172.28.4.66:7777 Jun 24 10:26:46 node5 kernel: [845666.348213] o2net: connection to node node6 (num 6) at 172.28.4.81:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:46 node5 kernel: [845666.348220] (0,7):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831976.123740 now 1245832006.123625 dr 1245831976.123732 adv 1245831976.123740:1245831976.123741 func (b0b3255d:505) 1245822335.389460:1245822335.389465) Jun 24 10:26:46 node5 kernel: [845666.368210] o2net: no longer connected to node node6 (num 6) at 172.28.4.81:7777 Jun 24 10:26:47 node5 kernel: [845668.151772] o2net: connection to node node1 (num 1) at 172.28.4.56:7777 has been idle for 30.0 seconds, shutting it down. Jun 24 10:26:47 node5 kernel: [845668.151780] (0,1):o2net_idle_timer:1468 here are some times that might help debug the situation: (tmr 1245831977.195745 now 1245832007.195625 dr 1245831977.195734 adv 1245831977.195746:1245831977.195746 func (cd6c8a07:502) 1245829979.42551:1245829979.42555) Jun 24 10:26:47 node5 kernel: [845668.151795] o2net: no longer connected to node node1 (num 1) at 172.28.4.56:7777 Jun 24 10:26:54 node5 kernel: [845682.375877] o2net: connected to node node0 (num 0) at 172.28.4.48:7777 Jun 24 10:26:54 node5 kernel: [845683.287292] o2net: connected to node node4 (num 4) at 172.28.4.66:7777 Jun 24 10:26:56 node5 kernel: [845685.463900] o2net: connected to node node1 (num 1) at 172.28.4.56:7777 Jun 24 10:27:05 node5 kernel: [845701.592536] o2net: connected to node node2 (num 2) at 172.28.4.64:7777 Jun 24 10:27:05 node5 kernel: [845701.987283] o2net: connected to node node3 (num 3) at 172.28.4.65:7777 Jun 24 10:27:15 node5 kernel: [845722.566864] (5697,1):o2net_connect_expired:1629 ERROR: no connection established with node 7 after 30.0 seconds, giving up and returning errors. Jun 24 10:27:16 node5 kernel: [845722.787106] (5697,1):o2net_connect_expired:1629 ERROR: no connection established with node 6 after 30.0 seconds, giving up and returning errors. This logs that inform that node5 couldn't connect to nodes 6 and 7 shows in endless loop untill I tried to umout the device and restart o2cb One device I umounted successfully and the second I couldn't umount - and my logs: Jun 24 11:43:01 node5 kernel: [855031.567140] (20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107 Jun 24 11:43:01 node5 kernel: [855031.567140] (20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107 The problem is that I couldn't access to the files on ocfs on none of the nodes until I reboot the nodes. On Wed, 2009-06-24 at 13:25 -0700, Sunil Mushran wrote: > The nodes are not frozen. The processes that are attempting to talk > to the "disconnected" node are waiting for that node to reply, failing > which, to die. The default timeout for the disk heartbeat is 60 secs. > > If that node simply died, the other nodes would have deemed the node > dead after 60 secs, recovered it and carried on with life. > > But in this case, you disconnected that node from the cluster. Meaning > that node has to first decide on its course of action. It kills itself > but after the 60 sec timeout. After that, the other nodes have to wait > another 60 secs before deeming it dead. That's your 120 secs. > > You can always cut the disk heartbeat by a half. But it depends a lot > on your shared disk / io path. Some multipaths require the timeouts to > be as high as 2 mins. > > Many users multipath the network setup too by setting up net bonding. > > Sunil > > Piotr Teodorowski wrote: > > Hi, > > > > I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage > > to this nodes (disks are formatted, file systems ocfs created) > > > > I can start the cluster on each node and mount device - this works fine. > > > > Let say my first node name is host1 and node numer is 0 and ip address > > 172.28.4.1 > > my second node name is host2 node number 1 and ip address 172.28.4.2 > > and i do nothing on other nodes (but the device is mounted on every node). > > > > when I run find /mount_point -type f on host1 it searches and displays > > files. > > Before the find ends, on host2 I remove IP address from interface > > (the network connection is broken) and the find on host1 freeze. > > This is the log on host1: > > > > Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node > > host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds, > > shutting it down. > > Jun 24 12:36:33 host1 kernel: [ 1816.861242] > > (0,5):o2net_idle_timer:1468 here are some times that might help debug > > the situation: (tmr 1245839763.115691 now 1245839793.115494 dr > > 1245839763.115676 adv 1245839763.115691:1245839763.115691 func > > (cd6c8a07:500) 1245839758.695001:1245839758.695003) > > Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer > > connected to node host2 (num 1) at 172.28.4.2:7777 > > > > Few minutes later the find can search again (I do not kill the proccess) > > and I have in my logs: > > Jun 24 12:38:41 host1 kernel: [ 2011.612478] > > (5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group > > C9113043842642AD9694FDF0E9BE6E29 > > Jun 24 12:38:42 host1 kernel: [ 2013.370655] > > (5950,5):dlm_get_lock_resource:839 > > C9113043842642AD9694FDF0E9BE6E29:$RECOVERY: at least one node (1) to > > recover before lock mastery can begin > > Jun 24 12:38:42 host1 kernel: [ 2013.370661] > > (5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29: > > recovery map is not empty, but must master $RECOVERY lock now > > Jun 24 12:38:42 host1 kernel: [ 2013.378061] > > (5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for > > the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29 > > > > Is that normal that I can't access (from any of health node) to the > > ocfs until this few minutes? I do not need to write for 2 minutes but > > this kind of break for read is unacceptable > > > > I have the default settings for HB: > > O2CB_ENABLED=true > > O2CB_BOOTCLUSTER=ocfs2 > > O2CB_HEARTBEAT_THRESHOLD=31 > > O2CB_IDLE_TIMEOUT_MS=30000 > > O2CB_KEEPALIVE_DELAY_MS=2000 > > O2CB_RECONNECT_DELAY_MS=2000 > > > > ocfs2-tools 1.4.1 (debian lenny) > > kernel 2.6.26-2-amd64 > > +multipath > > +bonding > > > > modinfo ocfs2 > > filename: /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko > > license: GPL > > author: Oracle > > version: 1.5.0 > > description: OCFS2 1.5.0 > > srcversion: B19D847BA86E871E41B7A64 > > depends: jbd,ocfs2_stackglue,ocfs2_nodemanager > > vermagic: 2.6.26-2-amd64 SMP mod_unload modversions > > > > Any advise? > > > > Peter >
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users