Hi, On Thu, Nov 04, 2010 at 11:06:48AM -0300, mike wrote: > Looking for a more experienced person who can explain this issue we had > last night. > > Our backups kicked in during the night at 1AM. At 1:01AM, our mysql > cluster had issues. Specifically I can see in crm_mon where the cluster > has it as failed due to an "unknown exec error". Looking at the > performance of the node, I can see where wait on I/O went through the > roof at 1AM when the tsm backups kicked in. I can see where this caused > heartbeat issues because mysql was late checking its instances - it > generally takes a few seconds but in this case it took 3 minutes. Of > course this is all due to the extremely high wait on I/O but I am > curious - why didn't the cluster fail over? Why put MySQL in an > unmanaged state and simply say there was an "unknown exec error?".
Can't say without looking at the logs and the PE files. One possible explanation is that a resource was for whatever reason not allowed to run on the other node: a failure in the past which didn't expire or a negative location constraint. Or the fail count reached migration threshold (if defined). Thanks, Dejan > Thanks for any comments > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems