kernel version, ocfs2 version? $ uname -a $ modinfo ocfs2 $ rpm -qa | grep ocfs2
Kees Hoekzema wrote: > Hello List, > > At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i (iscsi) > NAS. This cluster has run fine for well over a year now, but recently one of > the older and more unstable servers in the cluster has started to fail > sometimes. > > While it is not a big problem that this particular server reboots, it is > however a problem that when he does that the whole cluster becomes unusable > until that node reboots and returns. > > Today we had another crash on the server. The other nodes displayed it like > this in the dmesg output: > > May 27 16:45:03 aphaea kernel: > o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been idle > for 10.0 seconds, shutting it down. > (0,3):o2net_idle_timer:1468 here are some times that might help debug the > situation: (tmr 1243435493.522086 now 1243435503.520354 dr 1243435493.522080 > adv 1243435493.522090:1243435493.522091 func (6169a8d1:502) > 1243435148.2972:1243435148.2999) > o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777 > (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down! > (3762,1):dlm_get_lock_resource:912 ERROR: status = -112 > (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > (5196,3):dlm_get_lock_resource:912 ERROR: status = -107 > (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > (735,3):dlm_get_lock_resource:912 ERROR: status = -107 > (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > (21573,3):dlm_get_lock_resource:912 ERROR: status = -107 > (2825,3):o2net_connect_expired:1629 ERROR: no connection established with > node 5 after 10.0 seconds, giving up and returning errors. > (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > (1916,3):dlm_get_lock_resource:912 ERROR: status = -107 > .. > [and a lot more similar errors] > .. > May 27 17:14:45 aphaea kernel: (2825,3):o2dlm_eviction_cb:258 o2dlm has > evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640 > > The node that is in fault was totally frozen, so it most likely did not even > receive a kernel panic from ocfs2 so that it reboots. > > After we rebooted the node, the cluster became available again. However, it > still prevented the other 6 servers from accessing the shared storage for > almost 30 minutes. > > Is there a way to 'evict' a node faster? and continue normal read/write > operations without the node? > Or is it possible to have at least read operations continue without being > locked out as well? > > Tia, > Kees Hoekzema > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users