On Tue, Apr 01, 2008 at 02:52:29PM +0200, Danny Sternkopf wrote:
> Hi,
> 
> Dejan Muhamedagic wrote:
> >Hi,
> >
> >On Wed, Mar 26, 2008 at 02:38:00PM +0100, Danny Sternkopf wrote:
> >>Hi,
> >>
> >>we have a simple config: (hav2)
> >>- 2 nodes <active-active>
> >>- 6 FILESYSTEM resources as one group + 1 Stonith resource on each node
> >>  (Use contraints to score them)
> >>- Default Resource stickiness is 0
> >>- Default Resource failover stickiness is 0
> >>
> >>As we have seen in past is that if a filesystem has failed the whole
> >>group is moved to the other node and the failing node is stonith'ed due
> >>to the filesystem could not be unmounted properly.
> >>
> >>But this filesystem could not be mounted on both nodes anymore. So the
> >>group was moved from one node to another and nodes got reset all the
> >>time.
> >
> >If it was not possible to mount the filesystem then how/why did
> >the cluster try to unmount it? Also, if the filesystem's not
> >mounted then the stop operation should've succeeded. Or did you
> >see different behaviour?
> >
> 
> The Filesystem was mounted fine. But then an problem occurred, let's 
> assume the device is gone due to a HW issue. So the monitor will 
> detected it and initiate a movement to the other node. Umount fails on 
> the current node

So far I'm able to follow and this is a typical situation for
stonith reset.

> and maint fails on the new node so to say.

But if mount failed on the new node then, i.e. if the filesystem
could not have been mounted, then the stop operation, if invoked,
should have succeeded. This is the point which is not clear to me.

> In our case the device was still there, but the mount was stucking. The 
> operation timed out. Filesystem stop always failed, even if the 
> filesystem was mounted may be due to the stucking mount command.
> 
> I've to check the Fileystem script next time.
> 
> >>Start and Stop of the Filesystem always ended up with a Timeout
> >>(of 120 s).
> >>
> >>How can we treat that issue within HA? What happens with a resource
> >>which can not run anymore? When will HA give up to run it?
> >
> >Depends on the start-failure-is-fatal crm_config parameter. If
> >it's set to true, the CRM should give up after the first failed
> >start operation. Of course, in case the machine is rebooted it
> >will try again.
> 
> Yes, exactly it doesn't help.
> 
> Meanwhile we implemented a function in our stonith script to check when 
> was the last stonith operation. If it was not longer that 15 minutes ago 
> we will set the failing node to stand-by and perform the stonith reset then.

Somebody else recently had a similar suggestion, i.e. that the
node's uptime and cooperation should be taken into account on
stonith. It is worth investigating.

You could also try to set the crm_config stonith-action option to
poweroff instead of reboot.

Thanks,

Dejan

> Best regards,
> 
> Danny
> -- 
> Danny Sternkopf http://www.nec.de/hpc       [EMAIL PROTECTED]
> HPCE Division  Germany phone: +49-711-68770-35 fax: +49-711-6877145
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NEC Deutschland GmbH, Hansaallee 101, 40549 D?sseldorf
> Gesch?ftsf?hrer Makoto Tsukakoshi
> Handelsregister D?sseldorf HRB 57941; VAT ID DE129424743
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to