Hi,
I've noticed the same type of behavior, however in a different context,
my setup includes 3 drbd devices and a group of resources, all have to
run on the same node and move together to other nodes. My issue was with
the first resource that required access to a drbd device, which was the
ocf:heartbeat:Filesystem RA trying to do a mount and failing.
The reason, it was trying to do the mount of the drbd device before the
drbd device had finished migrating to primary state. Same as you, I
introduced a start-delay, but on the start action. This proved to be of
no use as the behavior persisted, even with an increased start-delay.
However, it only happened when performing a fail-back operation, during
fail-over, everything was ok, during fail-back, error.
The fix I've made was to remove any start-delay and to add group
collocation constraints to all ms_drbd resources. Before that I only had
one collocation constraint for the drbd device being promoted last.
I hope this helps.
Regards,
Dan
Pavlos Parissis wrote:
Hi,
I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.
The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.
I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.
I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.
Nevertheless, I would like to ask if you are thinking to implement an
"init_wait" on 1st monitor action. Could be useful.
To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.
Cheers,
Pavlos
[1] Cent0S 5.4
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker