Well, I think that the communication between the nodes by heartbeat
has been interrupted by the network outage in one node, but drbd has
found two nodes trying to access the disk. ¿Have you configured STONITH?
El 15/02/2008, a las 3:40, Doug Lochart escribió:
I Was following a few tutorials and had everything ready to test. The
tutorial said to yank the power cord but my sysadmin did not want me
to do that so I simply pulled out the ethernet cables (one was to a
switched network the other was a crossover between the two nodes for
heartbeats). Server 2 did not take over ( I had an init script error
that I forgot to fix). I plugged the cables back in and restarted
both machines. A cat of /proc/drbd displayed this:
GIT-hash: bd3e2c922f95c4fa0dca57a4f8c24bf8b249cc02 build by
[EMAIL PROTECTED], 2008-02-01 07:33:35
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r---
ns:0 nr:0 dw:4 dr:249 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:1 misses:0 starving:0 dirty:0
changed:0
I looked in dmesg and saw this:
-----------------------------------------------------------------------------------------------------
drbd: initialised. Version: 8.0.8 (api:86/proto:86)
drbd: GIT-hash: bd3e2c922f95c4fa0dca57a4f8c24bf8b249cc02 build by
[EMAIL PROTECTED], 2008-02-01 07:33:35
drbd: registered as block device major 147
drbd: minor_table @ 0xffff810073c19680
drbd0: disk( Diskless -> Attaching )
drbd0: Found 6 transactions (276 active extents) in activity log.
drbd0: max_segment_size ( = BIO size ) = 32768
drbd0: drbd_bm_resize called with capacity == 1953042632
drbd0: resync bitmap: bits=244130329 words=3814537
drbd0: size = 931 GB (976521316 KB)
drbd0: reading of bitmap took 580 jiffies
drbd0: recounting of set bits took additional 28 jiffies
drbd0: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
drbd0: disk( Attaching -> UpToDate )
drbd0: Writing meta data super block now.
drbd0: conn( StandAlone -> Unconnected )
drbd0: receiver (re)started
drbd0: conn( Unconnected -> WFConnection )
drbd0: Handshake successful: DRBD Network Protocol version 86
drbd0: conn( WFConnection -> WFReportParams )
drbd0: Split-Brain detected, dropping connection!
drbd0: self 6A01F5A6BB510A26:19D4D8C195E805A7:483E5FB7A5527AAD:
0000000000000004
drbd0: peer 9C679A93F13525CE:19D4D8C195E805A6:483E5FB7A5527AAC:
0000000000000004
drbd0: conn( WFReportParams -> Disconnecting )
drbd0: helper command: /sbin/drbdadm split-brain
drbd0: error receiving ReportState, l: 4!
drbd0: asender terminated
drbd0: tl_clear()
drbd0: Connection closed
drbd0: conn( Disconnecting -> StandAlone )
drbd0: receiver terminated
drbd0: role( Secondary -> Primary )
drbd0: Writing meta data super block now.
kjournald starting. Commit interval 5 seconds
EXT3 FS on drbd0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
-------------------------------------------------------------------------
So I searched in the mailing list for split-brain and many posts I
find say that doing what I did (yanking both cables) will cause a
split-brain WTF ??
I am using drbd 8.0.8, heartbeat 2.1.3_3 version 1 haresources style.
I am really confused. I am following a tutorial and I go right into a
split brain. I can't see how it would have been any different if I
yanked the power cord versus yanking the cables. I thought this is
what heartbeat was supposed to handle?
How do I recover? No data was lost as I was just doing an initial
test.
What is it I need to do to prevent a split-brain from happening again?
Is there a good place to go to read about avoiding this situation?
Much of the info I have found jumps right in as if you are already a
master of this stuff.
In case you need them my configs follow:
regards,
Douglas Lochart
haresources:
capestor1 IPaddr::10.3.120.140/24/eth0 drbddisk::r0
Filesystem::/dev/drbd0::/capestor::ext3 capestor-server
drbd.conf
-------------
global {
usage-count yes;
}
common {
syncer { rate 10M; }
}
resource r0 {
protocol C;
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
# these were commented out in the examples
# outdate-peer "/usr/lib/drbd/outdate-peer.sh on amd 192.168.22.11
192.168.23.11 on alf 192.168.22.12 192.168.23.12";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
#pri-lost "echo pri-lost. Have a look at the log files. | mail -s
'DRBD Alert' root";
# Notify someone in case DRBD split brained.
#split-brain "echo split-brain. drbdadm -- --discard-my-data
connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
# become-primary-on both;
}
disk {
on-io-error detach;
# fencing resource-only;
# size 10G;
}
net {
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 10M;
# should tis be 263168 for one terra byte?
al-extents 257;
}
##################################################################
# Setup capestor1
##################################################################
on capestor1 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.3.120.134:7788;
flexible-meta-disk internal;
}
on capestor2 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.3.120.135:7788;
meta-disk internal;
}
}
--
What profits a man if he gains the whole world yet loses his soul?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Luis Martín-Santos García - [EMAIL PROTECTED]
Director Técnico - Chief Technical Officer
Webalianza Consultoría Tecnológica - IT Consulting
------------------------------------------------------------------
Portuetxe Bidea 23 Edificio CEMEI Piso 3 Oficina 4
20018 Donostia-San Sebastian, Gipuzkoa, Spain.
Tfn. +34 902 364 368 ext 02.
Correo de Voz: +34 911 519 754
Fax +34 943 212 920
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems