It went pretty well now. Thanks a lot for your help. I set interval to 10s and timeout to 20s now. Also it may be worth to mention, in RA 1.0-32.2, the depth-level of OCF Filesystem were not implemented. But it is impletmentd in the latest 1.0.2 resource agent, Might be a legacy version problem.
Regards On Mon, Mar 22, 2010 at 10:20 PM, Tim Serong <[email protected]> wrote: > On 3/23/2010 at 11:20 AM, Tony Gan <[email protected]> wrote: > > Hi Tim > > Thanks for your advice. > > > > Now I have modified and had something like this in my File-system in cib > > config: > > params device="/dev/sdd1" directory="/temp/tmp1" fstype="ext3" \ > > op monitor interval="3s" timeout="10s" on_fail="fence" depth="20" > > > > The file system is started on node2. Then I phisically unplug the only > Fibre > > Channel cable on this node. > > My expectation is once the file-system failed, this node will get > STONITHed, > > because I unplug the FC cable on my node2. > > > > However > > on outputs of crm_mon -1 > > I still get this Filesystem started on node2: > > fs_res_sdd1 (ocf::heartbeat:Filesystem): Started node2 > > > > Looks like the OCF monitoring script for the file-system is not running > at > > all in my assigned interval (every 3 seconds). And I did not find any > error > > log in ha-log, the file-system is just mounted > > But in /var/log/messages, I am full of I/O errors of my mounted volumn. > > Do you have any ideas? > > Sorry, my bad, that should have been: > > op monitor ... OCF_CHECK_LEVEL="20" > > The Filesystem RA doesn't log anything for successful monitor ops, so you > won't actually see any noise in the logs during normal operation. > > Once it's working to your satisfaction you may also want to experiment with > putting the filesystem under heavy read/write load - it's possible that if > the filesystem is under severe enough load, that 3 seconds may be too > frequent > a monitor, and/or 10 seconds too short a timeout, because the IO will be > blocking due to the heavy load. The last thing you want is fencing due to > monitor op timeout when nothing is actually broken, but the only way to be > sure is to test. > > Regards, > > Tim > > > > > Thanks > > > > > > On Thu, Mar 18, 2010 at 2:51 AM, Tim Serong <[email protected]> wrote: > > > > > On 3/17/2010 at 10:20 AM, Tony Gan <[email protected]> wrote: > > > > Hi, > > > > I'm using heartbeat-3.0.0-33.2 and pacemaker-1.0.5-4.6 to create a > two > > > node > > > > cluster. And both nodes connected to a shared storage device through > > > Fibre > > > > Channel through a FC switch. And I am going to use the shared storage > as > > > my > > > > file system resource in cluster, I can mount the file system > succesfully > > > on > > > > both nodes. > > > > > > > > Now I am trying to trigger a Fail-over after I unplug my FC cable > from my > > > > active node. > > > > My expectation is that the file-system resource should failed and > after a > > > > failed-count it should fail-over and let my passive node take the > > > resource. > > > > > > > > However, > > > > It looks like OCF script of File-system did not handle this kind of > > > > situation. Which is located in > > > /usr/lib/ocf/resource.d/heartbeat/Filesystem > > > > After I unplugged my FC cable, all file-system resources still > started > > > and > > > > running fine. There's no additional logs in ha-log or ha-debug > > > > > > > > I can only find logs in system message log which I believe is kernel > > > error > > > > log about a I/O error on the file system (device sde is my shared > > > storage): > > > > Mar 15 22:17:24 node2 kernel: end_request: I/O error, dev sde, sector > > > 12727 > > > > Mar 15 22:17:24 node2 kernel: end_request: I/O error, dev sde, sector > > > 12743 > > > > > > > > My question is, is there a way I can monitor the connectivity of my > > > shared > > > > storage through heartbeat? > > > > I'm not familiar with storage network, what's the way to check the > > > > connectivity? I was thinking if I can do this by using a similar way > of > > > > pingd. > > > > > > The only way to be sure you've still got physical connectivity is to > > > actually > > > read and/or write data from/to the underlying block device, in direct > mode, > > > so that whatever you're reading won't be provided from some cache. > This > > > will necessarily have some performance impact during any monitor op (in > > > particular, if your filesystem is otherwise heavily loaded). > > > > > > Anyway... Have a look at setting monitor depth=10 or depth=20 for your > > > filesystem resource. The default monitor op just checks if the > filesystem > > > is mounted. Depth=10 will try to read 16 blocks off the target device, > > > which will either fail or timeout if you're disconnected. Depth=20 > will > > > actually try to write then read a status file with each monitor op. > > > > > > HTH, > > > > > > Tim > > > > > > > > > -- > > > Tim Serong <[email protected]> > > > Senior Clustering Engineer, OPS Engineering, Novell Inc. > > > > > > > > > > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
