Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-16 Thread Nikola Ciprich
 It is not.
 
 Pacemaker may just be quicker to promote now,
 or in your setup other things may have changed
 which also changed the timing behaviour.
 
 But what you are trying to do has always been broken,
 and will always be broken.
 

Hello Lars,

You were right, fixing configuration indeed fixed my issue. I humbly apologize
for my ignorance and will further immerse in documentation :)

have a nice day!

nik



-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpz99YkZSCI8.pgp
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-12 Thread Lars Ellenberg
On Wed, Jul 11, 2012 at 11:38:52AM +0200, Nikola Ciprich wrote:
  Well, I'd expect that to be safer as your current configuration ...
  discard-zero-changes will never overwrite data automatically  have
  you tried adding the start-delay to DRBD start operation? I'm curious if
  that is already sufficient for your problem.
 Hi,
 
 tried 
 op id=drbd-sas0-start-0 interval=0 name=start start-delay=10s 
 timeout=240s/
 (I hope it's the setting You've meant, although I'm not sure, I haven't found 
 any documentation
 on start-delay option)
 
 but didn't help..

Of course not.


You Problem is this:

DRBD config:
   allow-two-primaries,
   but *NO* fencing policy,
   and *NO* fencing handler.

And, as if that was not bad enough already,
Pacemaker config:
no-quorum-policy=ignore \
stonith-enabled=false

D'oh.

And then, well,
your nodes come up some minute+ after each other,
and Pacemaker and DRBD behave exactly as configured:


Jul 10 06:00:12 vmnci20 crmd: [3569]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.


Note the *1* ...

So it starts:
Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Start   
drbd-sas0:0(vmnci20)

But leaves:
Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Leave   
drbd-sas0:1(Stopped)
as there is no peer node yet.


And on the next iteration, we still have only one node:
Jul 10 06:00:15 vmnci20 crmd: [3569]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.

So we promote:
Jul 10 06:00:15 vmnci20 pengine: [3568]: notice: LogActions: Promote 
drbd-sas0:0(Slave - Master vmnci20)


And only some minute later, the peer node joins:
Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: State 
transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: All 2 cluster 
nodes responded to the join offer.

So now we can start the peer:

Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Leave   
drbd-sas0:0(Master vmnci20)
Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Start   
drbd-sas0:1(vmnci21)


And it even is promoted right away:
Jul 10 06:01:36 vmnci20 pengine: [3568]: notice: LogActions: Promote 
drbd-sas0:1(Slave - Master vmnci21)

And within those 3 seconds, DRBD was not able to establish the connection yet.


You configured DRBD and Pacemaker to produce data divergence.
Not suprisingly, that is exactly what you get.



Fix your Problem.
See above; hint: fencing resource-and-stonith,
crm-fence-peer.sh + stonith_admin,
add stonith, maybe add a third node so you don't need to ignore quorum,
...

And all will be well.



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-12 Thread Nikola Ciprich
Hello Lars,

thanks for Your reply..

 You Problem is this:
 
   DRBD config:
  allow-two-primaries,
  but *NO* fencing policy,
  and *NO* fencing handler.
 
   And, as if that was not bad enough already,
   Pacemaker config:
   no-quorum-policy=ignore \
   stonith-enabled=false

yes, I've written it's just test cluster on virtual machines. therefore no 
fencing devices.

however I don't think it's the whole problem source, I've tried starting node2 
much later
after node1 (actually node1 has been running for about 1 day), and got right 
into same situation..
pacemaker just doesn't wait long enough before the drbds can connect at all and 
seems to promote them both.
it really seems to be regression to me, as this was always working well...

even though I've set no-quorum-policy to freeze, the problem returns as soon as 
cluster becomes quorate..
I have all split-brain and fencing scripts in drbd disabled intentionaly so I 
had chance to investigate, otherwise
one of the nodes always commited suicide but there should be no reason for 
split brain..

cheers!

nik




 D'oh.
 
 And then, well,
 your nodes come up some minute+ after each other,
 and Pacemaker and DRBD behave exactly as configured:
 
 
 Jul 10 06:00:12 vmnci20 crmd: [3569]: info: do_state_transition: All 1 
 cluster nodes are eligible to run resources.
 
 
 Note the *1* ...
 
 So it starts:
 Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Start   
 drbd-sas0:0  (vmnci20)
 
 But leaves:
 Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Leave   
 drbd-sas0:1  (Stopped)
 as there is no peer node yet.
 
 
 And on the next iteration, we still have only one node:
 Jul 10 06:00:15 vmnci20 crmd: [3569]: info: do_state_transition: All 1 
 cluster nodes are eligible to run resources.
 
 So we promote:
 Jul 10 06:00:15 vmnci20 pengine: [3568]: notice: LogActions: Promote 
 drbd-sas0:0  (Slave - Master vmnci20)
 
 
 And only some minute later, the peer node joins:
 Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: State 
 transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED 
 cause=C_FSA_INTERNAL origin=check_join_state ]
 Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: All 2 
 cluster nodes responded to the join offer.
 
 So now we can start the peer:
 
 Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Leave   
 drbd-sas0:0  (Master vmnci20)
 Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Start   
 drbd-sas0:1  (vmnci21)
 
 
 And it even is promoted right away:
 Jul 10 06:01:36 vmnci20 pengine: [3568]: notice: LogActions: Promote 
 drbd-sas0:1  (Slave - Master vmnci21)
 
 And within those 3 seconds, DRBD was not able to establish the connection yet.
 
 
 You configured DRBD and Pacemaker to produce data divergence.
 Not suprisingly, that is exactly what you get.
 
 
 
 Fix your Problem.
 See above; hint: fencing resource-and-stonith,
 crm-fence-peer.sh + stonith_admin,
 add stonith, maybe add a third node so you don't need to ignore quorum,
 ...
 
 And all will be well.
 
 
 
 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgplQrfWtrRJa.pgp
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-12 Thread Lars Ellenberg
On Thu, Jul 12, 2012 at 04:23:51PM +0200, Nikola Ciprich wrote:
 Hello Lars,
 
 thanks for Your reply..
 
  You Problem is this:
  
  DRBD config:
 allow-two-primaries,
 but *NO* fencing policy,
 and *NO* fencing handler.
  
  And, as if that was not bad enough already,
  Pacemaker config:
  no-quorum-policy=ignore \
  stonith-enabled=false
 
 yes, I've written it's just test cluster on virtual machines. therefore no 
 fencing devices.
 
 however I don't think it's the whole problem source, I've tried starting 
 node2 much later
 after node1 (actually node1 has been running for about 1 day), and got right 
 into same situation..
 pacemaker just doesn't wait long enough before the drbds can connect at all 
 and seems to promote them both.
 it really seems to be regression to me, as this was always working well...

It is not.

Pacemaker may just be quicker to promote now,
or in your setup other things may have changed
which also changed the timing behaviour.

But what you are trying to do has always been broken,
and will always be broken.

 even though I've set no-quorum-policy to freeze, the problem returns as soon 
 as cluster becomes quorate..
 I have all split-brain and fencing scripts in drbd disabled intentionaly so I 
 had chance to investigate, otherwise
 one of the nodes always commited suicide but there should be no reason for 
 split brain..

Right.

That's why shooting as in stonith is not good enough a fencing
mechanism in a drbd dual Primary cluster. You also need to tell the peer
that it is outdated, respectively must not become Primary or Master
until it synced up (or at least, *starts* to sync up).

You can do that using the crm-fence-peer.sh (it does not actually tell
DRBD that it is outdated, but it tells Pacemaker to not promote that
other node, which is even better, if the rest of the system is properly set up.

crm-fence-peer.sh alone is also not good enough in certain situations.
That's why you need both, the drbd fence-peer mechanism *and* stonith.

 
 cheers!
 
 nik
 
 
 
 
  D'oh.
  
  And then, well,
  your nodes come up some minute+ after each other,
  and Pacemaker and DRBD behave exactly as configured:
  
  
  Jul 10 06:00:12 vmnci20 crmd: [3569]: info: do_state_transition: All 1 
  cluster nodes are eligible to run resources.
  
  
  Note the *1* ...
  
  So it starts:
  Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Start   
  drbd-sas0:0(vmnci20)
  
  But leaves:
  Jul 10 06:00:12 vmnci20 pengine: [3568]: notice: LogActions: Leave   
  drbd-sas0:1(Stopped)
  as there is no peer node yet.
  
  
  And on the next iteration, we still have only one node:
  Jul 10 06:00:15 vmnci20 crmd: [3569]: info: do_state_transition: All 1 
  cluster nodes are eligible to run resources.
  
  So we promote:
  Jul 10 06:00:15 vmnci20 pengine: [3568]: notice: LogActions: Promote 
  drbd-sas0:0(Slave - Master vmnci20)
  
  
  And only some minute later, the peer node joins:
  Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: State 
  transition S_INTEGRATION - S_FINALIZE_JOIN [ input=I_INTEGRATED 
  cause=C_FSA_INTERNAL origin=check_join_state ]
  Jul 10 06:01:33 vmnci20 crmd: [3569]: info: do_state_transition: All 2 
  cluster nodes responded to the join offer.
  
  So now we can start the peer:
  
  Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Leave   
  drbd-sas0:0(Master vmnci20)
  Jul 10 06:01:33 vmnci20 pengine: [3568]: notice: LogActions: Start   
  drbd-sas0:1(vmnci21)
  
  
  And it even is promoted right away:
  Jul 10 06:01:36 vmnci20 pengine: [3568]: notice: LogActions: Promote 
  drbd-sas0:1(Slave - Master vmnci21)
  
  And within those 3 seconds, DRBD was not able to establish the connection 
  yet.
  
  
  You configured DRBD and Pacemaker to produce data divergence.
  Not suprisingly, that is exactly what you get.
  
  
  
  Fix your Problem.
  See above; hint: fencing resource-and-stonith,
  crm-fence-peer.sh + stonith_admin,
  add stonith, maybe add a third node so you don't need to ignore quorum,
  ...
  
  And all will be well.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-11 Thread Andreas Kurz
On 07/11/2012 04:50 AM, Andrew Beekhof wrote:
 On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz andr...@hastexo.com wrote:
 On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
 nikola.cipr...@linuxbox.cz wrote:
 Hello Andreas,
 Why not using the RA that comes with the resource-agent package?
 well, I've historically used my scripts, haven't even noticed when LVM
 resource appeared.. I switched to it now.., thanks for the hint..
 this become-primary-on was never activated?
 nope.


 Is the drbd init script deactivated on system boot? Cluster logs should
 give more insights 
 yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
 rebooted both nodes, checked drbd ain't started and started corosync.
 result is here:
 http://nelide.cz/nik/logs.tar.gz

 It really really looks like Pacemaker is too fast when promoting to
 primary ... before the connection to the already up second node can be
 established.
 
 Do you mean we're violating a constraint?
 Or is it a problem of the RA returning too soon?

It looks like a RA problem ... notifications after the start of the
resource and the following promote are very fast and DRBD is still not
finished with establishing the connection to the peer. I can't remember
seeing this before.

Regards,
Andreas

 
 I see in your logs you have DRBD 8.3.13 userland  but
 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
 ... there have been fixes that look like addressing this problem.

 Another quick-fix, that should also do: add a start-delay of some
 seconds to the start operation of DRBD

 ... or fix your after-split-brain policies to automatically solve this
 special type of split-brain (with 0 blocks to sync).

 Best Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 thanks for Your time.
 n.



 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 thanks a lot in advance

 nik


 On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
 On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
 hello,

 I'm trying to solve quite mysterious problem here..
 I've got new cluster with bunch of SAS disks for testing purposes.
 I've configured DRBDs (in primary/primary configuration)

 when I start drbd using drbdadm, it get's up nicely (both nodes
 are Primary, connected).
 however when I start it using corosync, I always get split-brain, 
 although
 there are no data written, no network disconnection, anything..

 your full drbd and Pacemaker configuration please ... some snippets from
 something are very seldom helpful ...

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 here's drbd resource config:
 primitive drbd-sas0 ocf:linbit:drbd \
 params drbd_resource=drbd-sas0 \
 operations $id=drbd-sas0-operations \
 op start interval=0 timeout=240s \
 op stop interval=0 timeout=200s \
 op promote interval=0 timeout=200s \
 op demote interval=0 timeout=200s \
 op monitor interval=179s role=Master timeout=150s \
 op monitor interval=180s role=Slave timeout=150s

 ms ms-drbd-sas0 drbd-sas0 \
meta clone-max=2 clone-node-max=1 master-max=2 
 master-node-max=1 notify=true globally-unique=false 
 interleave=true target-role=Started


 here's the dmesg output when pacemaker tries to promote drbd, causing 
 the splitbrain:
 [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
 [6892])
 [  157.646539] block drbd2: disk( Diskless - Attaching )
 [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
 activity log.
 [  157.650560] block drbd2: Method to ensure write ordering: drain
 [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
 584667688
 [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
 pages=2231
 [  157.653760] block drbd2: size = 279 GB (292333844 KB)
 [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
 [  157.673722] block drbd2: recounting of set bits took additional 2 
 jiffies
 [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
 bit-map.
 [  157.673972] block drbd2: disk( Attaching - UpToDate )
 [  157.674100] block drbd2: attached to UUIDs 
 0150944D23F16BAE::8C175205284E3262:8C165205284E3263
 [  157.685539] block drbd2: conn( StandAlone - Unconnected )
 [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
 [6893])
 [  157.685928] block drbd2: receiver (re)started
 [  157.686071] block drbd2: conn( Unconnected - WFConnection )
 [  158.960577] block drbd2: role( Secondary - Primary )
 [  158.960815] block drbd2: new current UUID 
 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
 [  162.686990] block drbd2: Handshake successful: Agreed network 
 protocol version 96
 [  162.687183] block drbd2: conn( WFConnection - WFReportParams )
 [  162.687404] block drbd2: Starting asender thread (from 
 drbd2_receiver [6927])
 [  162.687741] block drbd2: data-integrity-alg: 

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-11 Thread Andreas Kurz
On 07/11/2012 09:23 AM, Nikola Ciprich wrote:
 It really really looks like Pacemaker is too fast when promoting to
 primary ... before the connection to the already up second node can be
 established.

 Do you mean we're violating a constraint?
 Or is it a problem of the RA returning too soon?
 dunno, I tried older drbd userspaces to check if it's not problem
 of newer RA, to no avail...
 

 I see in your logs you have DRBD 8.3.13 userland  but
 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
 ... there have been fixes that look like addressing this problem.
 tried 8.3.13 userspace + 8.3.13 module (on top of 3.0.36 kernel), 
 unfortunately same result..
 

 Another quick-fix, that should also do: add a start-delay of some
 seconds to the start operation of DRBD

 ... or fix your after-split-brain policies to automatically solve this
 special type of split-brain (with 0 blocks to sync).
 I'll try that, although I'd not like to use this for production :)

Well, I'd expect that to be safer as your current configuration ...
discard-zero-changes will never overwrite data automatically  have
you tried adding the start-delay to DRBD start operation? I'm curious if
that is already sufficient for your problem.

Regards,
Andreas

 

 Best Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 thanks for Your time.
 n.



 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 thanks a lot in advance

 nik


 On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
 On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
 hello,

 I'm trying to solve quite mysterious problem here..
 I've got new cluster with bunch of SAS disks for testing purposes.
 I've configured DRBDs (in primary/primary configuration)

 when I start drbd using drbdadm, it get's up nicely (both nodes
 are Primary, connected).
 however when I start it using corosync, I always get split-brain, 
 although
 there are no data written, no network disconnection, anything..

 your full drbd and Pacemaker configuration please ... some snippets from
 something are very seldom helpful ...

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 here's drbd resource config:
 primitive drbd-sas0 ocf:linbit:drbd \
 params drbd_resource=drbd-sas0 \
 operations $id=drbd-sas0-operations \
 op start interval=0 timeout=240s \
 op stop interval=0 timeout=200s \
 op promote interval=0 timeout=200s \
 op demote interval=0 timeout=200s \
 op monitor interval=179s role=Master timeout=150s \
 op monitor interval=180s role=Slave timeout=150s

 ms ms-drbd-sas0 drbd-sas0 \
meta clone-max=2 clone-node-max=1 master-max=2 
 master-node-max=1 notify=true globally-unique=false 
 interleave=true target-role=Started


 here's the dmesg output when pacemaker tries to promote drbd, causing 
 the splitbrain:
 [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
 [6892])
 [  157.646539] block drbd2: disk( Diskless - Attaching )
 [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
 activity log.
 [  157.650560] block drbd2: Method to ensure write ordering: drain
 [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
 584667688
 [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
 pages=2231
 [  157.653760] block drbd2: size = 279 GB (292333844 KB)
 [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
 [  157.673722] block drbd2: recounting of set bits took additional 2 
 jiffies
 [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on 
 disk bit-map.
 [  157.673972] block drbd2: disk( Attaching - UpToDate )
 [  157.674100] block drbd2: attached to UUIDs 
 0150944D23F16BAE::8C175205284E3262:8C165205284E3263
 [  157.685539] block drbd2: conn( StandAlone - Unconnected )
 [  157.685704] block drbd2: Starting receiver thread (from 
 drbd2_worker [6893])
 [  157.685928] block drbd2: receiver (re)started
 [  157.686071] block drbd2: conn( Unconnected - WFConnection )
 [  158.960577] block drbd2: role( Secondary - Primary )
 [  158.960815] block drbd2: new current UUID 
 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
 [  162.686990] block drbd2: Handshake successful: Agreed network 
 protocol version 96
 [  162.687183] block drbd2: conn( WFConnection - WFReportParams )
 [  162.687404] block drbd2: Starting asender thread (from 
 drbd2_receiver [6927])
 [  162.687741] block drbd2: data-integrity-alg: not-used
 [  162.687930] block drbd2: drbd_sync_handshake:
 [  162.688057] block drbd2: self 
 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
 bits:0 flags:0
 [  162.688244] block drbd2: peer 
 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 
 bits:0 flags:0
 [  162.688428] block drbd2: uuid_compare()=100 by rule 90
 [  162.688544] block drbd2: helper command: /sbin/drbdadm 
 

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-11 Thread Nikola Ciprich
 Well, I'd expect that to be safer as your current configuration ...
 discard-zero-changes will never overwrite data automatically  have
 you tried adding the start-delay to DRBD start operation? I'm curious if
 that is already sufficient for your problem.
Hi,

tried 
op id=drbd-sas0-start-0 interval=0 name=start start-delay=10s 
timeout=240s/
(I hope it's the setting You've meant, although I'm not sure, I haven't found 
any documentation
on start-delay option)

but didn't help..




 
 Regards,
 Andreas
 
  
 
  Best Regards,
  Andreas
 
  --
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  thanks for Your time.
  n.
 
 
 
  Regards,
  Andreas
 
  --
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  thanks a lot in advance
 
  nik
 
 
  On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
  On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
  hello,
 
  I'm trying to solve quite mysterious problem here..
  I've got new cluster with bunch of SAS disks for testing purposes.
  I've configured DRBDs (in primary/primary configuration)
 
  when I start drbd using drbdadm, it get's up nicely (both nodes
  are Primary, connected).
  however when I start it using corosync, I always get split-brain, 
  although
  there are no data written, no network disconnection, anything..
 
  your full drbd and Pacemaker configuration please ... some snippets 
  from
  something are very seldom helpful ...
 
  Regards,
  Andreas
 
  --
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  here's drbd resource config:
  primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource=drbd-sas0 \
  operations $id=drbd-sas0-operations \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=200s \
  op promote interval=0 timeout=200s \
  op demote interval=0 timeout=200s \
  op monitor interval=179s role=Master timeout=150s \
  op monitor interval=180s role=Slave timeout=150s
 
  ms ms-drbd-sas0 drbd-sas0 \
 meta clone-max=2 clone-node-max=1 master-max=2 
  master-node-max=1 notify=true globally-unique=false 
  interleave=true target-role=Started
 
 
  here's the dmesg output when pacemaker tries to promote drbd, 
  causing the splitbrain:
  [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
  [6892])
  [  157.646539] block drbd2: disk( Diskless - Attaching )
  [  157.650364] block drbd2: Found 1 transactions (1 active extents) 
  in activity log.
  [  157.650560] block drbd2: Method to ensure write ordering: drain
  [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
  584667688
  [  157.653442] block drbd2: resync bitmap: bits=73083461 
  words=1141930 pages=2231
  [  157.653760] block drbd2: size = 279 GB (292333844 KB)
  [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
  [  157.673722] block drbd2: recounting of set bits took additional 2 
  jiffies
  [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on 
  disk bit-map.
  [  157.673972] block drbd2: disk( Attaching - UpToDate )
  [  157.674100] block drbd2: attached to UUIDs 
  0150944D23F16BAE::8C175205284E3262:8C165205284E3263
  [  157.685539] block drbd2: conn( StandAlone - Unconnected )
  [  157.685704] block drbd2: Starting receiver thread (from 
  drbd2_worker [6893])
  [  157.685928] block drbd2: receiver (re)started
  [  157.686071] block drbd2: conn( Unconnected - WFConnection )
  [  158.960577] block drbd2: role( Secondary - Primary )
  [  158.960815] block drbd2: new current UUID 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
  [  162.686990] block drbd2: Handshake successful: Agreed network 
  protocol version 96
  [  162.687183] block drbd2: conn( WFConnection - WFReportParams )
  [  162.687404] block drbd2: Starting asender thread (from 
  drbd2_receiver [6927])
  [  162.687741] block drbd2: data-integrity-alg: not-used
  [  162.687930] block drbd2: drbd_sync_handshake:
  [  162.688057] block drbd2: self 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
  bits:0 flags:0
  [  162.688244] block drbd2: peer 
  7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 
  bits:0 flags:0
  [  162.688428] block drbd2: uuid_compare()=100 by rule 90
  [  162.688544] block drbd2: helper command: /sbin/drbdadm 
  initial-split-brain minor-2
  [  162.691332] block drbd2: helper command: /sbin/drbdadm 
  initial-split-brain minor-2 exit code 0 (0x0)
 
  to me it seems to be that it's promoting it too early, and I also 
  wonder why there is the
  new current UUID stuff?
 
  I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
 
  could anybody please try to advice me? I'm sure I'm doing something 
  stupid, but can't figure out what...
 
  thanks a lot in advance
 
  with best regards
 
  nik
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-10 Thread Nikola Ciprich
Hello Andreas,
 Why not using the RA that comes with the resource-agent package?
well, I've historically used my scripts, haven't even noticed when LVM
resource appeared.. I switched to it now.., thanks for the hint..
 this become-primary-on was never activated?
nope.


 Is the drbd init script deactivated on system boot? Cluster logs should
 give more insights 
yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
rebooted both nodes, checked drbd ain't started and started corosync.
result is here:
http://nelide.cz/nik/logs.tar.gz

thanks for Your time.
n.


 
 Regards,
 Andreas
 
 -- 
 Need help with Pacemaker?
 http://www.hastexo.com/now
 
  
  thanks a lot in advance
  
  nik
  
  
  On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
  On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
  hello,
 
  I'm trying to solve quite mysterious problem here..
  I've got new cluster with bunch of SAS disks for testing purposes.
  I've configured DRBDs (in primary/primary configuration)
 
  when I start drbd using drbdadm, it get's up nicely (both nodes
  are Primary, connected).
  however when I start it using corosync, I always get split-brain, although
  there are no data written, no network disconnection, anything..
 
  your full drbd and Pacemaker configuration please ... some snippets from
  something are very seldom helpful ...
 
  Regards,
  Andreas
 
  -- 
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  here's drbd resource config:
  primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource=drbd-sas0 \
  operations $id=drbd-sas0-operations \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=200s \
  op promote interval=0 timeout=200s \
  op demote interval=0 timeout=200s \
  op monitor interval=179s role=Master timeout=150s \
  op monitor interval=180s role=Slave timeout=150s
 
  ms ms-drbd-sas0 drbd-sas0 \
 meta clone-max=2 clone-node-max=1 master-max=2 
  master-node-max=1 notify=true globally-unique=false 
  interleave=true target-role=Started
 
 
  here's the dmesg output when pacemaker tries to promote drbd, causing the 
  splitbrain:
  [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
  [  157.646539] block drbd2: disk( Diskless - Attaching ) 
  [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
  activity log.
  [  157.650560] block drbd2: Method to ensure write ordering: drain
  [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
  584667688
  [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
  pages=2231
  [  157.653760] block drbd2: size = 279 GB (292333844 KB)
  [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
  [  157.673722] block drbd2: recounting of set bits took additional 2 
  jiffies
  [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
  bit-map.
  [  157.673972] block drbd2: disk( Attaching - UpToDate ) 
  [  157.674100] block drbd2: attached to UUIDs 
  0150944D23F16BAE::8C175205284E3262:8C165205284E3263
  [  157.685539] block drbd2: conn( StandAlone - Unconnected ) 
  [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
  [6893])
  [  157.685928] block drbd2: receiver (re)started
  [  157.686071] block drbd2: conn( Unconnected - WFConnection ) 
  [  158.960577] block drbd2: role( Secondary - Primary ) 
  [  158.960815] block drbd2: new current UUID 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
  [  162.686990] block drbd2: Handshake successful: Agreed network protocol 
  version 96
  [  162.687183] block drbd2: conn( WFConnection - WFReportParams ) 
  [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver 
  [6927])
  [  162.687741] block drbd2: data-integrity-alg: not-used
  [  162.687930] block drbd2: drbd_sync_handshake:
  [  162.688057] block drbd2: self 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
  bits:0 flags:0
  [  162.688244] block drbd2: peer 
  7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 
  bits:0 flags:0
  [  162.688428] block drbd2: uuid_compare()=100 by rule 90
  [  162.688544] block drbd2: helper command: /sbin/drbdadm 
  initial-split-brain minor-2
  [  162.691332] block drbd2: helper command: /sbin/drbdadm 
  initial-split-brain minor-2 exit code 0 (0x0)
 
  to me it seems to be that it's promoting it too early, and I also wonder 
  why there is the 
  new current UUID stuff?
 
  I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
 
  could anybody please try to advice me? I'm sure I'm doing something 
  stupid, but can't figure out what...
 
  thanks a lot in advance
 
  with best regards
 
  nik
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-10 Thread Andreas Kurz
On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
nikola.cipr...@linuxbox.cz wrote:
 Hello Andreas,
 Why not using the RA that comes with the resource-agent package?
 well, I've historically used my scripts, haven't even noticed when LVM
 resource appeared.. I switched to it now.., thanks for the hint..
 this become-primary-on was never activated?
 nope.


 Is the drbd init script deactivated on system boot? Cluster logs should
 give more insights 
 yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
 rebooted both nodes, checked drbd ain't started and started corosync.
 result is here:
 http://nelide.cz/nik/logs.tar.gz

It really really looks like Pacemaker is too fast when promoting to
primary ... before the connection to the already up second node can be
established.  I see in your logs you have DRBD 8.3.13 userland  but
8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
... there have been fixes that look like addressing this problem.

Another quick-fix, that should also do: add a start-delay of some
seconds to the start operation of DRBD

... or fix your after-split-brain policies to automatically solve this
special type of split-brain (with 0 blocks to sync).

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 thanks for Your time.
 n.



 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now

 
  thanks a lot in advance
 
  nik
 
 
  On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
  On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
  hello,
 
  I'm trying to solve quite mysterious problem here..
  I've got new cluster with bunch of SAS disks for testing purposes.
  I've configured DRBDs (in primary/primary configuration)
 
  when I start drbd using drbdadm, it get's up nicely (both nodes
  are Primary, connected).
  however when I start it using corosync, I always get split-brain, 
  although
  there are no data written, no network disconnection, anything..
 
  your full drbd and Pacemaker configuration please ... some snippets from
  something are very seldom helpful ...
 
  Regards,
  Andreas
 
  --
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  here's drbd resource config:
  primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource=drbd-sas0 \
  operations $id=drbd-sas0-operations \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=200s \
  op promote interval=0 timeout=200s \
  op demote interval=0 timeout=200s \
  op monitor interval=179s role=Master timeout=150s \
  op monitor interval=180s role=Slave timeout=150s
 
  ms ms-drbd-sas0 drbd-sas0 \
 meta clone-max=2 clone-node-max=1 master-max=2 
  master-node-max=1 notify=true globally-unique=false 
  interleave=true target-role=Started
 
 
  here's the dmesg output when pacemaker tries to promote drbd, causing 
  the splitbrain:
  [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
  [6892])
  [  157.646539] block drbd2: disk( Diskless - Attaching )
  [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
  activity log.
  [  157.650560] block drbd2: Method to ensure write ordering: drain
  [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
  584667688
  [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
  pages=2231
  [  157.653760] block drbd2: size = 279 GB (292333844 KB)
  [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
  [  157.673722] block drbd2: recounting of set bits took additional 2 
  jiffies
  [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
  bit-map.
  [  157.673972] block drbd2: disk( Attaching - UpToDate )
  [  157.674100] block drbd2: attached to UUIDs 
  0150944D23F16BAE::8C175205284E3262:8C165205284E3263
  [  157.685539] block drbd2: conn( StandAlone - Unconnected )
  [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
  [6893])
  [  157.685928] block drbd2: receiver (re)started
  [  157.686071] block drbd2: conn( Unconnected - WFConnection )
  [  158.960577] block drbd2: role( Secondary - Primary )
  [  158.960815] block drbd2: new current UUID 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
  [  162.686990] block drbd2: Handshake successful: Agreed network 
  protocol version 96
  [  162.687183] block drbd2: conn( WFConnection - WFReportParams )
  [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver 
  [6927])
  [  162.687741] block drbd2: data-integrity-alg: not-used
  [  162.687930] block drbd2: drbd_sync_handshake:
  [  162.688057] block drbd2: self 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
  bits:0 flags:0
  [  162.688244] block drbd2: peer 
  7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 
  bits:0 flags:0
  [  162.688428] block drbd2: uuid_compare()=100 by rule 90
  [  162.688544] block drbd2: 

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-10 Thread Andrew Beekhof
On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz andr...@hastexo.com wrote:
 On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
 nikola.cipr...@linuxbox.cz wrote:
 Hello Andreas,
 Why not using the RA that comes with the resource-agent package?
 well, I've historically used my scripts, haven't even noticed when LVM
 resource appeared.. I switched to it now.., thanks for the hint..
 this become-primary-on was never activated?
 nope.


 Is the drbd init script deactivated on system boot? Cluster logs should
 give more insights 
 yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
 rebooted both nodes, checked drbd ain't started and started corosync.
 result is here:
 http://nelide.cz/nik/logs.tar.gz

 It really really looks like Pacemaker is too fast when promoting to
 primary ... before the connection to the already up second node can be
 established.

Do you mean we're violating a constraint?
Or is it a problem of the RA returning too soon?

 I see in your logs you have DRBD 8.3.13 userland  but
 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
 ... there have been fixes that look like addressing this problem.

 Another quick-fix, that should also do: add a start-delay of some
 seconds to the start operation of DRBD

 ... or fix your after-split-brain policies to automatically solve this
 special type of split-brain (with 0 blocks to sync).

 Best Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 thanks for Your time.
 n.



 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now

 
  thanks a lot in advance
 
  nik
 
 
  On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
  On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
  hello,
 
  I'm trying to solve quite mysterious problem here..
  I've got new cluster with bunch of SAS disks for testing purposes.
  I've configured DRBDs (in primary/primary configuration)
 
  when I start drbd using drbdadm, it get's up nicely (both nodes
  are Primary, connected).
  however when I start it using corosync, I always get split-brain, 
  although
  there are no data written, no network disconnection, anything..
 
  your full drbd and Pacemaker configuration please ... some snippets from
  something are very seldom helpful ...
 
  Regards,
  Andreas
 
  --
  Need help with Pacemaker?
  http://www.hastexo.com/now
 
 
  here's drbd resource config:
  primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource=drbd-sas0 \
  operations $id=drbd-sas0-operations \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=200s \
  op promote interval=0 timeout=200s \
  op demote interval=0 timeout=200s \
  op monitor interval=179s role=Master timeout=150s \
  op monitor interval=180s role=Slave timeout=150s
 
  ms ms-drbd-sas0 drbd-sas0 \
 meta clone-max=2 clone-node-max=1 master-max=2 
  master-node-max=1 notify=true globally-unique=false 
  interleave=true target-role=Started
 
 
  here's the dmesg output when pacemaker tries to promote drbd, causing 
  the splitbrain:
  [  157.646292] block drbd2: Starting worker thread (from drbdsetup 
  [6892])
  [  157.646539] block drbd2: disk( Diskless - Attaching )
  [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
  activity log.
  [  157.650560] block drbd2: Method to ensure write ordering: drain
  [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
  584667688
  [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
  pages=2231
  [  157.653760] block drbd2: size = 279 GB (292333844 KB)
  [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
  [  157.673722] block drbd2: recounting of set bits took additional 2 
  jiffies
  [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
  bit-map.
  [  157.673972] block drbd2: disk( Attaching - UpToDate )
  [  157.674100] block drbd2: attached to UUIDs 
  0150944D23F16BAE::8C175205284E3262:8C165205284E3263
  [  157.685539] block drbd2: conn( StandAlone - Unconnected )
  [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
  [6893])
  [  157.685928] block drbd2: receiver (re)started
  [  157.686071] block drbd2: conn( Unconnected - WFConnection )
  [  158.960577] block drbd2: role( Secondary - Primary )
  [  158.960815] block drbd2: new current UUID 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
  [  162.686990] block drbd2: Handshake successful: Agreed network 
  protocol version 96
  [  162.687183] block drbd2: conn( WFConnection - WFReportParams )
  [  162.687404] block drbd2: Starting asender thread (from 
  drbd2_receiver [6927])
  [  162.687741] block drbd2: data-integrity-alg: not-used
  [  162.687930] block drbd2: drbd_sync_handshake:
  [  162.688057] block drbd2: self 
  015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
  bits:0 flags:0
  [  162.688244] block drbd2: peer 
  

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Nikola Ciprich
Hello Andreas,

yes, You're right. I should have sent those in the initial post. Sorry about 
that.
I've created very simple test configuration on which I'm able to simulate the 
problem.
there's no stonith etc, since it's just two virtual machines for the test.

crm conf:

primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource=drbd-sas0 \
  operations $id=drbd-sas0-operations \
  op start interval=0 timeout=240s \
  op stop interval=0 timeout=200s \
  op promote interval=0 timeout=200s \
  op demote interval=0 timeout=200s \
  op monitor interval=179s role=Master timeout=150s \
  op monitor interval=180s role=Slave timeout=150s

primitive lvm ocf:lbox:lvm.ocf \
  op start interval=0 timeout=180 \
  op stop interval=0 timeout=180

ms ms-drbd-sas0 drbd-sas0 \
   meta clone-max=2 clone-node-max=1 master-max=2 master-node-max=1 
notify=true globally-unique=false interleave=true target-role=Started

clone cl-lvm lvm \
  meta globally-unique=false ordered=false interleave=true 
notify=false target-role=Started \
  params lvm-clone-max=2 lvm-clone-node-max=1

colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master

order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start

property $id=cib-bootstrap-options \
 dc-version=1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 no-quorum-policy=ignore \
 stonith-enabled=false

lvm resource starts vgshared volume group on top of drbd (LVM filters are set to
use /dev/drbd* devices only)

drbd configuration:

global {
   usage-count no;
}

common {
   protocol C;

handlers {
pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; ;
pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; ;
local-io-error /usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; ;

#pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; reboot 
-f;
#pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; reboot 
-f;
#local-io-error /usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger ; halt 
-f;
# fence-peer /usr/lib/drbd/crm-fence-peer.sh;
# split-brain /usr/lib/drbd/notify-split-brain.sh root;
# out-of-sync /usr/lib/drbd/notify-out-of-sync.sh root;
# before-resync-target 
/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k;
# after-resync-target 
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}

net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
#rr-conflict disconnect;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
ping-timeout 50;
}

syncer {
rate 100M;
al-extents 3833;
#   al-extents 257;
#   verify-alg sha1;
}

disk {
on-io-error   detach;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
}

startup {
# wfc-timeout  0;
degr-wfc-timeout 120;# 2 minutes.
# become-primary-on both;

}
}

note that pri-on-incon-degr etc handlers are intentionally commented out so I 
can
see what's going on.. otherwise machine always got immediate reboot..

any idea?

thanks a lot in advance

nik


On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
 On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
  hello,
  
  I'm trying to solve quite mysterious problem here..
  I've got new cluster with bunch of SAS disks for testing purposes.
  I've configured DRBDs (in primary/primary configuration)
  
  when I start drbd using drbdadm, it get's up nicely (both nodes
  are Primary, connected).
  however when I start it using corosync, I always get split-brain, although
  there are no data written, no network disconnection, anything..
 
 your full drbd and Pacemaker configuration please ... some snippets from
 something are very seldom helpful ...
 
 Regards,
 Andreas
 
 -- 
 Need help with Pacemaker?
 http://www.hastexo.com/now
 
  
  here's drbd resource config:
  primitive drbd-sas0 ocf:linbit:drbd \
params drbd_resource=drbd-sas0 \
operations $id=drbd-sas0-operations \
op 

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Andreas Kurz
On 07/09/2012 12:58 PM, Nikola Ciprich wrote:
 Hello Andreas,
 
 yes, You're right. I should have sent those in the initial post. Sorry about 
 that.
 I've created very simple test configuration on which I'm able to simulate the 
 problem.
 there's no stonith etc, since it's just two virtual machines for the test.
 
 crm conf:
 
 primitive drbd-sas0 ocf:linbit:drbd \
 params drbd_resource=drbd-sas0 \
 operations $id=drbd-sas0-operations \
 op start interval=0 timeout=240s \
 op stop interval=0 timeout=200s \
 op promote interval=0 timeout=200s \
 op demote interval=0 timeout=200s \
 op monitor interval=179s role=Master timeout=150s \
 op monitor interval=180s role=Slave timeout=150s
 
 primitive lvm ocf:lbox:lvm.ocf \

Why not using the RA that comes with the resource-agent package?

 op start interval=0 timeout=180 \
 op stop interval=0 timeout=180
 
 ms ms-drbd-sas0 drbd-sas0 \
meta clone-max=2 clone-node-max=1 master-max=2 master-node-max=1 
 notify=true globally-unique=false interleave=true target-role=Started
 
 clone cl-lvm lvm \
   meta globally-unique=false ordered=false interleave=true 
 notify=false target-role=Started \
   params lvm-clone-max=2 lvm-clone-node-max=1
 
 colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master
 
 order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start
 
 property $id=cib-bootstrap-options \
dc-version=1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
stonith-enabled=false
 
 lvm resource starts vgshared volume group on top of drbd (LVM filters are set 
 to
 use /dev/drbd* devices only)
 
 drbd configuration:
 
 global {
usage-count no;
 }
 
 common {
protocol C;
 
 handlers {
 pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; ;
 pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; ;
 local-io-error /usr/lib/drbd/notify-io-error.sh; 
 /usr/lib/drbd/notify-emergency-shutdown.sh; ;
 
 #pri-on-incon-degr 
 /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; 
 reboot -f;
 #pri-lost-after-sb 
 /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; 
 reboot -f;
 #local-io-error /usr/lib/drbd/notify-io-error.sh; 
 /usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger ; 
 halt -f;
 # fence-peer /usr/lib/drbd/crm-fence-peer.sh;
 # split-brain /usr/lib/drbd/notify-split-brain.sh root;
 # out-of-sync /usr/lib/drbd/notify-out-of-sync.sh root;
 # before-resync-target 
 /usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k;
 # after-resync-target 
 /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
 }
 
 net {
 allow-two-primaries;
 after-sb-0pri discard-zero-changes;
 after-sb-1pri discard-secondary;
 after-sb-2pri call-pri-lost-after-sb;
 #rr-conflict disconnect;
 max-buffers 8000;
 max-epoch-size 8000;
 sndbuf-size 0;
 ping-timeout 50;
 }
 
 syncer {
 rate 100M;
 al-extents 3833;
 #   al-extents 257;
 #   verify-alg sha1;
 }
 
 disk {
 on-io-error   detach;
 no-disk-barrier;
 no-disk-flushes;
 no-md-flushes;
 }
 
 startup {
 # wfc-timeout  0;
 degr-wfc-timeout 120;# 2 minutes.
 # become-primary-on both;

this become-primary-on was never activated?

 
 }
 }
 
 note that pri-on-incon-degr etc handlers are intentionally commented out so I 
 can
 see what's going on.. otherwise machine always got immediate reboot..
 
 any idea?

Is the drbd init script deactivated on system boot? Cluster logs should
give more insights 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 thanks a lot in advance
 
 nik
 
 
 On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
 On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
 hello,

 I'm trying to solve quite mysterious problem here..
 I've got new cluster with bunch of SAS disks for testing purposes.
 I've configured DRBDs (in primary/primary configuration)

 when I start drbd using drbdadm, it get's up nicely (both nodes
 are Primary, connected).
 however when I start it using corosync, I always get split-brain, although
 there are no 

Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-07 Thread Andreas Kurz
On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
 hello,
 
 I'm trying to solve quite mysterious problem here..
 I've got new cluster with bunch of SAS disks for testing purposes.
 I've configured DRBDs (in primary/primary configuration)
 
 when I start drbd using drbdadm, it get's up nicely (both nodes
 are Primary, connected).
 however when I start it using corosync, I always get split-brain, although
 there are no data written, no network disconnection, anything..

your full drbd and Pacemaker configuration please ... some snippets from
something are very seldom helpful ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 here's drbd resource config:
 primitive drbd-sas0 ocf:linbit:drbd \
 params drbd_resource=drbd-sas0 \
 operations $id=drbd-sas0-operations \
 op start interval=0 timeout=240s \
 op stop interval=0 timeout=200s \
 op promote interval=0 timeout=200s \
 op demote interval=0 timeout=200s \
 op monitor interval=179s role=Master timeout=150s \
 op monitor interval=180s role=Slave timeout=150s
 
 ms ms-drbd-sas0 drbd-sas0 \
meta clone-max=2 clone-node-max=1 master-max=2 master-node-max=1 
 notify=true globally-unique=false interleave=true target-role=Started
 
 
 here's the dmesg output when pacemaker tries to promote drbd, causing the 
 splitbrain:
 [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
 [  157.646539] block drbd2: disk( Diskless - Attaching ) 
 [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
 activity log.
 [  157.650560] block drbd2: Method to ensure write ordering: drain
 [  157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688
 [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
 pages=2231
 [  157.653760] block drbd2: size = 279 GB (292333844 KB)
 [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
 [  157.673722] block drbd2: recounting of set bits took additional 2 jiffies
 [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
 bit-map.
 [  157.673972] block drbd2: disk( Attaching - UpToDate ) 
 [  157.674100] block drbd2: attached to UUIDs 
 0150944D23F16BAE::8C175205284E3262:8C165205284E3263
 [  157.685539] block drbd2: conn( StandAlone - Unconnected ) 
 [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
 [6893])
 [  157.685928] block drbd2: receiver (re)started
 [  157.686071] block drbd2: conn( Unconnected - WFConnection ) 
 [  158.960577] block drbd2: role( Secondary - Primary ) 
 [  158.960815] block drbd2: new current UUID 
 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
 [  162.686990] block drbd2: Handshake successful: Agreed network protocol 
 version 96
 [  162.687183] block drbd2: conn( WFConnection - WFReportParams ) 
 [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver 
 [6927])
 [  162.687741] block drbd2: data-integrity-alg: not-used
 [  162.687930] block drbd2: drbd_sync_handshake:
 [  162.688057] block drbd2: self 
 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 
 flags:0
 [  162.688244] block drbd2: peer 
 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 
 flags:0
 [  162.688428] block drbd2: uuid_compare()=100 by rule 90
 [  162.688544] block drbd2: helper command: /sbin/drbdadm initial-split-brain 
 minor-2
 [  162.691332] block drbd2: helper command: /sbin/drbdadm initial-split-brain 
 minor-2 exit code 0 (0x0)
 
 to me it seems to be that it's promoting it too early, and I also wonder why 
 there is the 
 new current UUID stuff?
 
 I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
 
 could anybody please try to advice me? I'm sure I'm doing something stupid, 
 but can't figure out what...
 
 thanks a lot in advance
 
 with best regards
 
 nik
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 







signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org