On Wed, Jul 11, 2012 at 11:38:52AM +0200, Nikola Ciprich wrote:
> > Well, I'd expect that to be safer as your current configuration ...
> > discard-zero-changes will never overwrite data automatically .... have
> > you tried adding the start-delay to DRBD start operation? I'm curious if
> > that is already sufficient for your problem.
> Hi,
> 
> tried 
> <op id="drbd-sas0-start-0" interval="0" name="start" start-delay="10s" 
> timeout="240s"/>
> (I hope it's the setting You've meant, although I'm not sure, I haven't found 
> any documentation
> on start-delay option)
> 
> but didn't help..

Of course not.


You "Problem" is this:

        DRBD config:
               allow-two-primaries,
               but *NO* fencing policy,
               and *NO* fencing handler.

        And, as if that was not bad enough already,
        Pacemaker config:
                no-quorum-policy="ignore" \
                stonith-enabled="false"

D'oh.

And then, well,
your nodes come up some minute+ after each other,
and Pacemaker and DRBD behave exactly as configured:


Jul 10 06:00:12 vmnci20 crmd: [3569]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.


Note the *1* ...

So it starts:
Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Start   
drbd-sas0:0        (vmnci20)

But leaves:
Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Leave   
drbd-sas0:1        (Stopped)
as there is no peer node yet.


And on the next iteration, we still have only one node:
Jul 10 06:00:15 vmnci20 crmd: [3569]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.

So we promote:
Jul 10 06:00:15 vmnci20 pengine: [3568]: notice: LogActions: Promote 
drbd-sas0:0        (Slave -> Master vmnci20)


And only some minute later, the peer node joins:
Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: State 
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: All 2 cluster 
nodes responded to the join offer.

So now we can start the peer:

Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Leave   
drbd-sas0:0        (Master vmnci20)
Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Start   
drbd-sas0:1        (vmnci21)


And it even is promoted right away:
Jul 10 06:01:36 vmnci20 pengine: [3568]: notice: LogActions: Promote 
drbd-sas0:1        (Slave -> Master vmnci21)

And within those 3 seconds, DRBD was not able to establish the connection yet.


You configured DRBD and Pacemaker to produce data divergence.
Not suprisingly, that is exactly what you get.



Fix your Problem.
See above; hint: fencing resource-and-stonith,
crm-fence-peer.sh + stonith_admin,
add stonith, maybe add a third node so you don't need to ignore quorum,
...

And all will be well.



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to