Re: [Linux-HA] File-system resources still running on unplugged fibre channel

Tim Serong Mon, 22 Mar 2010 22:20:53 -0700

On 3/23/2010 at 11:20 AM, Tony Gan <tonygan1...@gmail.com> wrote: 
> Hi Tim 
> Thanks for your advice. 
>  
> Now I have modified and had something like this in my File-system in cib 
> config: 
> params device="/dev/sdd1" directory="/temp/tmp1" fstype="ext3" \ 
>         op monitor interval="3s" timeout="10s" on_fail="fence" depth="20" 
>  
> The file system is started on node2. Then I phisically unplug the only Fibre 
> Channel cable on this node. 
> My expectation is once the file-system failed, this node will get STONITHed, 
> because I unplug the FC cable on my node2. 
>  
> However 
> on outputs of crm_mon -1 
> I still get this Filesystem started on node2: 
>     fs_res_sdd1      (ocf::heartbeat:Filesystem):    Started node2 
>  
> Looks like the OCF monitoring script for the file-system is not running at 
> all in my assigned interval (every 3 seconds). And I did not find any error 
> log in ha-log, the file-system is just mounted 
> But in /var/log/messages, I am full of I/O errors of my mounted volumn. 
> Do you have any ideas?


Sorry, my bad, that should have been:

  op monitor ... OCF_CHECK_LEVEL="20"

The Filesystem RA doesn't log anything for successful monitor ops, so you
won't actually see any noise in the logs during normal operation.

Once it's working to your satisfaction you may also want to experiment with
putting the filesystem under heavy read/write load - it's possible that if
the filesystem is under severe enough load, that 3 seconds may be too frequent
a monitor, and/or 10 seconds too short a timeout, because the IO will be
blocking due to the heavy load.  The last thing you want is fencing due to
monitor op timeout when nothing is actually broken, but the only way to be
sure is to test.

Regards,

Tim

>  
> Thanks 
>  
>  
> On Thu, Mar 18, 2010 at 2:51 AM, Tim Serong <tser...@novell.com> wrote: 
>  
> > On 3/17/2010 at 10:20 AM, Tony Gan <tonygan1...@gmail.com> wrote: 
> > > Hi, 
> > > I'm using heartbeat-3.0.0-33.2 and pacemaker-1.0.5-4.6 to create a two 
> > node 
> > > cluster. And both nodes connected to a shared storage device through 
> > Fibre 
> > > Channel through a FC switch. And I am going to use the shared storage as 
> > my 
> > > file system resource in cluster, I can mount the file system succesfully 
> > on 
> > > both nodes. 
> > > 
> > > Now I am trying to trigger a Fail-over after I unplug my FC cable from my 
> > > active node. 
> > > My expectation is that the file-system resource should failed and after a 
> > > failed-count it should fail-over and let my passive node take the 
> > resource. 
> > > 
> > > However, 
> > > It looks like OCF script of File-system did not handle this kind of 
> > > situation. Which is located in 
> > /usr/lib/ocf/resource.d/heartbeat/Filesystem 
> > > After I unplugged my FC cable, all file-system resources still started 
> > and 
> > > running fine. There's no additional logs in ha-log or ha-debug 
> > > 
> > > I can only find logs in system message log which I believe is kernel 
> > error 
> > > log about a I/O error on the file system (device sde is my shared 
> > storage): 
> > > Mar 15 22:17:24 node2 kernel: end_request: I/O error, dev sde, sector 
> > 12727 
> > > Mar 15 22:17:24 node2 kernel: end_request: I/O error, dev sde, sector 
> > 12743 
> > > 
> > > My question is, is there a way I can monitor the connectivity of my 
> > shared 
> > > storage through heartbeat? 
> > > I'm not familiar with storage network, what's the way to check the 
> > > connectivity? I was thinking if I can do this by using a similar way of 
> > > pingd. 
> > 
> > The only way to be sure you've still got physical connectivity is to 
> > actually 
> > read and/or write data from/to the underlying block device, in direct mode, 
> > so that whatever you're reading won't be provided from some cache.  This 
> > will necessarily have some performance impact during any monitor op (in 
> > particular, if your filesystem is otherwise heavily loaded). 
> > 
> > Anyway...  Have a look at setting monitor depth=10 or depth=20 for your 
> > filesystem resource.  The default monitor op just checks if the filesystem 
> > is mounted.  Depth=10 will try to read 16 blocks off the target device, 
> > which will either fail or timeout if you're disconnected.  Depth=20 will 
> > actually try to write then read a status file with each monitor op. 
> > 
> > HTH, 
> > 
> > Tim 
> > 
> > 
> > -- 
> > Tim Serong <tser...@novell.com> 
> > Senior Clustering Engineer, OPS Engineering, Novell Inc. 
> > 
> > 
> > 
> > _______________________________________________ 
> > Linux-HA mailing list 
> > Linux-HA@lists.linux-ha.org 
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> > See also: http://linux-ha.org/ReportingProblems 
> > 
>  


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] File-system resources still running on unplugged fibre channel

Reply via email to