Hi,

I've noticed the same type of behavior, however in a different context, my setup includes 3 drbd devices and a group of resources, all have to run on the same node and move together to other nodes. My issue was with the first resource that required access to a drbd device, which was the ocf:heartbeat:Filesystem RA trying to do a mount and failing.

The reason, it was trying to do the mount of the drbd device before the drbd device had finished migrating to primary state. Same as you, I introduced a start-delay, but on the start action. This proved to be of no use as the behavior persisted, even with an increased start-delay. However, it only happened when performing a fail-back operation, during fail-over, everything was ok, during fail-back, error.

The fix I've made was to remove any start-delay and to add group collocation constraints to all ms_drbd resources. Before that I only had one collocation constraint for the drbd device being promoted last.

I hope this helps.

Regards,

Dan

Pavlos Parissis wrote:
Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
"init_wait" on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to