Hi, On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote: > SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 > openais-1.1.4-5.22.1.7) > > Its a dual primary drbd cluster, which mounts a file system resource > on both the cluster nodes simultaneously(file system type is ocfs2). > > Whenever any of the nodes goes down, the file system(/sharedata) > become inaccessible for exact 35 seconds on the other > (surviving/online) node, and then become available again on the > online node. > > Please help me understand why the node which survives or remains > online unable to access the file system resource(/sharedata) for 35 > seconds ? and how can I fix the cluster so that file system remains > accessible on the surviving node without any interruption/delay(as > in my case of about 35 seconds) > > By inaccessible, I meant to say that running "ls -l /sharedata" and > "df /sharedata" does not return any output and does not return the > prompt back on the online node for exact 35 seconds once the other > node becomes offline. > > e.g "node1" got offline somewhere around 01:37:15, and then > /sharedata file system was inaccessible during 01:37:35 and 01:38:18 > on the online node i.e "node2".
Before the failing node gets fenced you won't be able to use the ocfs2 filesystem. In this case, the fencing operation takes 40 seconds: > [...] > Jul 5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1 > Jul 5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40 > Jul 5 01:38:15 node2 sbd: [6197]: info: reset successfully > delivered to node1 > Jul 5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered. > [...] You may want to reduce that sbd timeout. Thanks, Dejan _______________________________________________ Linux-HA mailing list is closing down. Please subscribe to us...@clusterlabs.org instead. http://clusterlabs.org/mailman/listinfo/users _______________________________________________ Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha