Hi My release is : pacemaker-1.1.2-7 (on RHEL6) and I have checked that the patch : High: PE: Bug lf#2433 - No services should be stopped until probes finish is effectively integrated in this release.
Nethertheless, it seems that I got a similar problem from time to time for whatever primitive: a primitive under pacemaker is flagged "failed" for one node whereas the primitive is already started on the other node. Then a simple cleanup on the group erase the Failure and all is fine, but it happens let's say within two hours when I start a loop (a robustness test) of migration on the group (which includes the primitive) from one node to the other and vice-versa with a delay of 300s between each migration. If I compare the logs (syslog) generated by the scenario when all is fine and when I got the error, the first error I found is : node1 daemon info lrmd [38904]: info: flush_op: process for operation monitor[2973] on ocf:<provider>:<scriptname>::<primitive name> for client 38907 still running, flush delayed node1 daemon debug crmd [38907]: debug: cancel_op: Op 2973 for <primitive-name> (<primitive-name>:2973): cancelled It seems that Pacemaker applies the stop on the primitive running on node1 just at the moment when a monitoring is currently checking the primitive, so the monitor stop operation is delayed. The primitive stop is effective and the primitive starts on node2. After 20 seconds, the monitor operation on node1 is running again, it fails and is notfied as errorneous on node1. Therefore, no more switch to node1 is possible, unless a manual crm cleanup on the primitive is executed. Thanks for your ideas on this problem. Alain _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
