> But generally I believe this test case is invalid. I might agree here that this test case does not necessarily reproduce what happened on my production system (unfortunately I do not know for sure what happened there, the dev who caused this just tells me he used some stupid sql statement and even executed it several times in parallel), but I do not think the testcase is invalid. If there is an OOM situation on a node and therefore the local pacemaker can't do it's job anymore (I base this statement on the various lrmd "cannot allocate memory" logs), this is a case the cluster should be able to recover from.
What I saw while doing this test was that the bad node discovered failures on the running ip and mysql resources, scheduled the recovery, but never managed to recover. I think it was lmb who suggested "periodic health-checks" on the pacemaker layer. If pacemaker on $good had periodically tried to talk to pacemaker on $bad, then it might have seen that $bad does not respond and might have done something about it. Just my theory though. Opinions? Regards Dominik _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker