Hi ALL,
Digimer, thank you very much for your response. OK See you below:
# cat /etc/drbd.d/global_common.conf
global {
usage-count yes;
}
common {
protocol C;
handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";
}
startup {
wfc-timeout 100;
degr-wfc-timeout 60;
become-primary-on both;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier
no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
ping-timeout 20;
}
syncer {
rate 110M;
}
}
#
Here my answers on your questions:
1) There is definitely split brain not a network problem. I demonstrated
at my previous message I can ping members of the cluster and they have
open firewall. When I use telnet and sniffer I see nodes try to estimate
network connection, but they send reject pockets only.
2) Here is info from /var/log/messages file
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: initialized.
Version: 8.3.8 (api:88/proto:86-94)
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: GIT-hash:
d78846e52224fd00562f7c225bcc25b2d422321d build by
[email protected], 2010-06-04 08:04:09
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: registered as block
device major 147
Dec 2 10:03:59 infplsm018 <kern.info> kernel: drbd: minor_table @
0xffff8101371471c0
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Starting
worker thread (from cqueue/0 [213])
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: disk(
Diskless -> Attaching )
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Found 4
transactions (70 active extents) in activity log.
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: Method to
ensure write ordering: barrier
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1:
max_segment_size ( = BIO size ) = 32768
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1:
drbd_bm_resize called with capacity == 629118192
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: resync
bitmap: bits=78639774 words=1228747
Dec 2 10:03:59 infplsm018 <kern.info> kernel: block drbd1: size = 300
GB (314559096 KB)
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: recounting
of set bits took additional 10 jiffies
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: 0 KB (0
bits) marked out-of-sync by on disk bit-map.
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Marked
additional 252 MB as out-of-sync based on AL.
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: disk(
Attaching -> UpToDate )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
StandAlone -> Unconnected )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting
receiver thread (from drbd1_worker [3435])
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver
(re)started
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
Unconnected -> WFConnection )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Handshake
successful: Agreed network protocol version 94
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
WFConnection -> WFReportParams )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Starting
asender thread (from drbd1_receiver [3443])
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
data-integrity-alg: <not-used>
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
drbd_sync_handshake:
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: self
B3DE46FD85A4C304:D3D8A848BA989089:F5DB2DE79EFEC3E5:AE5C6A69A1F93A43
bits:64512 flags:0
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: peer
CAD3EACF4FCC5066:D3D8A848BA989089:F5DB2DE79EFEC3E4:AE5C6A69A1F93A43
bits:130048 flags:2
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1:
uuid_compare()=100 by rule 90
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm initial-split-brain minor-1
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Dec 2 10:04:00 infplsm018 <kern.alert> kernel: block drbd1: Split-Brain
detected but unresolved, dropping connection!
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm split-brain minor-1
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: helper
command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
WFReportParams -> Disconnecting )
Dec 2 10:04:00 infplsm018 <kern.err> kernel: block drbd1: error
receiving ReportState, l: 4!
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: asender
terminated
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating
asender thread
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Connection
closed
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: conn(
Disconnecting -> StandAlone )
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: receiver
terminated
Dec 2 10:04:00 infplsm018 <kern.info> kernel: block drbd1: Terminating
receiver thread
Dec 2 10:04:01 infplsm018 <kern.info> kernel: block drbd1: role(
Secondary -> Primary )
3) And here my /etc/cluster/cluster.conf file
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="newnfscl" config_version="224" name="newnfscl">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="20"/>
<clusternodes>
<clusternode name="infplsm017-clust" nodeid="1" votes="1">
<multicast addr="224.0.0.1" interface="eth2"/>
<fence>
<method name="1">
<device name="manfence" nodename="infplsm017-clust"/>
</method>
</fence>
</clusternode>
<clusternode name="infplsm018-clust" nodeid="2" votes="1">
<multicast addr="224.0.0.1" interface="eth2"/>
<fence>
<method name="1">
<device name="manfence" nodename="infplsm018-clust"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="224.0.0.1"/>
</cman>
<fencedevices>
<fencedevice agent="fence_null" name="nullfence"/>
<fencedevice agent="fence_manual" name="manfence"/>
</fencedevices>
<rm log_facility="syslog" log_level="7">
<failoverdomains>
<failoverdomain name="Test Domain" nofailback="0" ordered="1"
restricted="1">
<failoverdomainnode name="infplsm017-clust" priority="1"/>
<failoverdomainnode name="infplsm018-clust" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<clusterfs device="/dev/Shared/home" force_unmount="0" fsid="58812"
fstype="gfs2" mountpoint="/home" name="homegfs" options="rw,localflocks"
self_fence="0"/>
<nfsexport name="homenfs"/>
<nfsclient allow_recover="1" name="nfsclient" options="rw" target="*"/>
<ip address="10.10.28.15" monitor_link="1"/>
</resources>
<service autostart="1" exclusive="0" name="nfs-over-gfs2" nfslock="1"
recovery="relocate">
<clusterfs ref="homegfs">
<nfsexport ref="homenfs">
<nfsclient ref="nfsclient"/>
</nfsexport>
</clusterfs>
<ip ref="10.10.28.15"/>
</service>
</rm>
<logging debug="on" logfile_priority="debug" syslog_facility="daemon"
syslog_priority="info" to_logfile="yes" to_syslog="yes">
<logging_daemon logfile="/var/log/cluster/qdiskd.log" name="qdiskd"/>
<logging_daemon logfile="/var/log/cluster/fenced.log" name="fenced"/>
<logging_daemon logfile="/var/log/cluster/dlm_controld.log"
name="dlm_controld"/>
<logging_daemon logfile="/var/log/cluster/gfs_controld.log"
name="gfs_controld"/>
<logging_daemon logfile="/var/log/cluster/rgmanager.log" name="rgmanager"/>
<logging_daemon logfile="/var/log/cluster/corosync.log" name="corosync"/>
</logging>
</cluster>
On 12/02/2011 03:05 PM, Digimer wrote:
On 12/01/2011 07:30 PM, Ivan Pavlenko wrote:
Hi ALL,
Could you help me to fix a problem with split brain, please?
I have Red Hat cluster based on RHEL 5.7 and provide nfs-over-gfs2
service. I use DRBD as a storage.
# cat /etc/drbd.conf
#
# please have a a look at the example configuration file in
# /usr/share/doc/drbd83/drbd.conf
#
include "/etc/drbd.d/global_common.conf";
This is a good file to see. Can you share it, please?
include "/etc/drbd.d/r0.res";
# cat /etc/drbd.d/r0.res
resource r0 {
on infplsm017 {
device /dev/drbd1;
disk /dev/sdb1;
address 10.10.24.10:7789;
meta-disk internal;
}
on infplsm018 {
device /dev/drbd1;
disk /dev/sdb1;
address 10.10.24.11:7789;
meta-disk internal;
}
}
As you can see, there is nothing sophisticated here.
I have:
# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
[email protected], 2010-06-04 08:04:09
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----
ns:0 nr:0 dw:0 dr:332 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:524288
# ping 10.10.24.11
PING 10.10.24.11 (10.10.24.11) 56(84) bytes of data.
64 bytes from 10.10.24.11: icmp_seq=1 ttl=64 time=2.99 ms
64 bytes from 10.10.24.11: icmp_seq=2 ttl=64 time=13.9 ms
But when I try to use telnet for port 7789 I get:
# telnet 10.10.24.11 7789
Trying 10.10.24.11...
telnet: connect to address 10.10.24.11: Connection refused
telnet: Unable to connect to remote host: Connection refused only
But at the same time:
# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num target prot opt source destination
Chain FORWARD (policy ACCEPT)
num target prot opt source destination
Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
I did it from my first server (INFPLSM017). And I have absolutely same
result from the second one (INFPLSM018). Could you tell me, please, wht
the possible reason of this problem and how I can fix this.
Thank you in advance,
Ivan
Is this a network or split-brain problem?
What happens when you try to connect?
What state is the other node in?
Anything interesting in /var/log/messages?
How does DRBD tie into the cluster? What is the cluster's configuration?
Are you using fencing?
More details are needed to provide assistance.
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user