Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Feb 21, 2008 at 09:00:35PM +0100, Johan Hoeke wrote:
>> Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Thu, Feb 21, 2008 at 04:09:19PM +0100, Johan Hoeke wrote:
>>>> Dejan Muhamedagic wrote:
>>>>> Hi,
>>>>>
>>>>> On Thu, Feb 21, 2008 at 01:26:12PM +0100, Johan Hoeke wrote:
>>>>>> Dejan Muhamedagic wrote:

<snip>

>> OK, I understand. I'll change from monitor on_fail=fence to stop
>> on_fail=fence and test,test,test.
> 
> on_fail=fence is default for stop operations as those failures
> are dangerous.

OK, good to know. Is this in the DTD? I looked for it just now but
didn't find it.

> 
>> I have to be super careful that the
>> SAN filesystem doesn't get corrupted again. That happened the other day
>> by accident when a wrong ipfilter config was pushed by mistake. The
>> heartbeat interface was filtered out, a split brain situation occurred
>> and the SAN filesystem was corrupted. Stonith didn't save us for
>> whatever reason.
> 
> You have to have a reliable stonith device. Do you think that
> on_fail=fence in the monitor op would have made the situation
> better?

No probably not, just ignorance on my part.

> 
>> The application managers don't have much confidence in
>> heartbeat since then. :(
> 
> That's a shame.

I was overreacting. Tests have gone well since then. Confidence of the
application managers is back on the rise. I'm due to test the cluster
again this afternoon. We're going to pull out the heartbeat cable to
test and make sure the data doesn't get corrupted. Stonith / riloe has
worked well, except that one time apparently. I'll be sure to keep logs
and run hb_report if anything strange happens this time.


regards,
Johan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to