Hi

My release is :
pacemaker-1.1.2-7 (on RHEL6)
and I have checked that the patch :
High: PE: Bug lf#2433 - No services should be stopped until probes finish
is effectively integrated in this release.

Nethertheless, it seems that I got a similar problem from time to time for 
whatever primitive: a primitive under pacemaker is flagged "failed" for 
one
node whereas the primitive is already started on the other node. Then a 
simple cleanup on the group erase the Failure and all is fine, but
it happens let's say within two hours when I start a loop (a robustness 
test) of migration on the group (which includes the primitive) from one 
node to the other and vice-versa with a delay of 300s between each 
migration.

If I compare the logs (syslog) generated by the scenario when all is fine 
and when I got the error, the first error I found is :
node1 daemon info lrmd [38904]: info: flush_op: process for operation 
monitor[2973] on ocf:<provider>:<scriptname>::<primitive name> for client 
38907 still running, flush delayed 
node1 daemon debug crmd [38907]: debug: cancel_op: Op 2973 for 
<primitive-name> (<primitive-name>:2973): cancelled 

It seems that Pacemaker applies the stop on the primitive running on node1 
just at the moment when a monitoring is currently checking the primitive, 
so the
monitor stop operation is delayed. The primitive stop is effective and the 
primitive starts on node2. After 20 seconds, the monitor operation on 
node1 is running again, it fails and is notfied as errorneous on node1. 
Therefore, no more switch to node1 is possible, unless a manual crm 
cleanup on the primitive is executed.

Thanks for your ideas on this problem.
Alain



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to