Hi,

On Thu, Nov 04, 2010 at 11:06:48AM -0300, mike wrote:
> Looking for a more experienced person who can explain this issue we had 
> last night.
> 
> Our backups kicked in during the night at 1AM. At 1:01AM, our mysql 
> cluster had issues. Specifically I can see in crm_mon where the cluster 
> has it as failed due to an "unknown exec error". Looking at the 
> performance of the node, I can see where wait on I/O went through the 
> roof at 1AM when the tsm backups kicked in. I can see where this caused 
> heartbeat issues because mysql was late checking its instances - it 
> generally takes a few seconds but in this case it took 3 minutes. Of 
> course this is all due to the extremely high wait on I/O but I am 
> curious - why didn't the cluster fail over? Why put MySQL in an 
> unmanaged state and simply say there was an "unknown exec error?".

Can't say without looking at the logs and the PE files. One
possible explanation is that a resource was for whatever reason
not allowed to run on the other node: a failure in the past
which didn't expire or a negative location constraint. Or the
fail count reached migration threshold (if defined).

Thanks,

Dejan


> Thanks for any comments
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to