Dear DRBD community,
We've had a lot of disconnections during 2 days on a DRBD resource, and
one of them has lead to a full sync that was probably not needed.
We're using DRBD v8.3.11 back-ported to Ubuntu Lucid. We have already
faced this problem with v8.3.7 too.
I have read that there used to be race conditions in the connection code
and that most were fixed.
Maybe this is one? Is there anything I can do to help fix this?
I'm attaching extracts from the logs and the config files.
Lionel Sausin.
Jul 16 23:27:14 StockageBerlioz kernel: [120591.193239] block drbd1: conn(
Ahead -> SyncSource ) pdsk( Consistent -> Inconsistent )
Jul 16 23:27:14 StockageBerlioz kernel: [120591.193250] block drbd1: Began
resync as SyncSource (will sync 2936 KB [734 bits set]).
Jul 16 23:27:14 StockageBerlioz kernel: [120591.259136] block drbd1: updated
sync UUID 19F5DB466A684489:0001000000000000:0001000000000000:00010
00000000000
Jul 16 23:27:22 StockageBerlioz kernel: [120598.524421] block drbd1:
Congestion-extents threshold reached
Jul 16 23:27:22 StockageBerlioz kernel: [120598.524432] block drbd1: conn(
SyncSource -> Ahead )
Jul 16 23:27:27 StockageBerlioz kernel: [120603.692343] block drbd1: helper
command: /sbin/drbdadm before-resync-source minor-1
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696592] block drbd1: helper
command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x
0)
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696604] block drbd1: conn(
Ahead -> SyncSource )
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696614] block drbd1: Began
resync as SyncSource (will sync 1968 KB [492 bits set]).
Jul 16 23:27:27 StockageBerlioz kernel: [120603.810379] block drbd1: updated
sync UUID 19F5DB466A684489:0002000000000000:0001000000000000:00010
00000000000
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622193] block drbd1: Resync
done (total 3 sec; paused 0 sec; 656 K/sec)
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622202] block drbd1: 0 % had
equal checksums, eliminated: 0K; transferred 1968K total 1968K
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622211] block drbd1: updated
UUIDs 19F5DB466A684489:0000000000000000:0002000000000000:000100000
0000000
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622222] block drbd1: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Jul 16 23:27:31 StockageBerlioz kernel: [120607.676864] block drbd1: bitmap
WRITE of 0 pages took 0 jiffies
Jul 16 23:27:31 StockageBerlioz kernel: [120607.677098] block drbd1: 0 KB (0
bits) marked out-of-sync by on disk bit-map.
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568432] block drbd1: sock was
shut down by peer
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568444] block drbd1: peer(
Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDat
e -> DUnknown )
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568461] block drbd1: short read
expecting header on sock: r=0
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568536] block drbd1: new
current UUID 547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583144] block drbd1: asender
terminated
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583159] block drbd1:
Terminating asender thread
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583405] block drbd1: Connection
closed
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583416] block drbd1: conn(
BrokenPipe -> Unconnected )
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583427] block drbd1: receiver
terminated
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583434] block drbd1: Restarting
receiver thread
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583441] block drbd1: receiver
(re)started
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583452] block drbd1: conn(
Unconnected -> WFConnection )
Jul 16 23:27:32 StockageBerlioz kernel: [120609.319689] block drbd1: Handshake
successful: Agreed network protocol version 96
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321064] block drbd1: Peer
authenticated using 16 bytes of 'md5' HMAC
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321076] block drbd1: conn(
WFConnection -> WFReportParams )
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321162] block drbd1: Starting
asender thread (from drbd1_receiver [26168])
Jul 16 23:27:32 StockageBerlioz kernel: [120609.322421] block drbd1:
data-integrity-alg: <not-used>
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499033] block drbd1:
drbd_sync_handshake:
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499042] block drbd1: self
547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000 bits:99
flags:0
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499050] block drbd1: peer
0002000000000000:0000000000000000:0001000000000000:0001000000000000 bits:0
flags:0
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499057] block drbd1:
uuid_compare()=2 by rule 80
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499061] block drbd1: Becoming
sync source due to disk states.
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499065] block drbd1: Writing
the whole bitmap, full sync required after drbd_sync_handshake.
Jul 16 23:27:35 StockageBerlioz kernel: [120612.393208] block drbd1: bitmap
WRITE of 31328 pages took 285 jiffies
Jul 16 23:27:35 StockageBerlioz kernel: [120612.393660] block drbd1: 3916 GB
(1026524567 bits) marked out-of-sync by on disk bit-map.
Jul 16 23:27:14 StockageAuric kernel: [120934.167573] block drbd1: updated sync
uuid 0001000000000000:0000000000000000:0001000000000000:0001000
000000000
Jul 16 23:27:14 StockageAuric kernel: [120934.218082] block drbd1: helper
command: /sbin/drbdadm before-resync-target minor-1
Jul 16 23:27:14 StockageAuric kernel: [120934.222298] block drbd1: helper
command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 16 23:27:14 StockageAuric kernel: [120934.222312] block drbd1: conn( Behind
-> SyncTarget ) disk( Outdated -> Inconsistent )
Jul 16 23:27:14 StockageAuric kernel: [120934.222333] block drbd1: Began resync
as SyncTarget (will sync 2936 KB [734 bits set]).
Jul 16 23:27:24 StockageAuric kernel: [120943.508206] block drbd1: conn(
SyncTarget -> Behind )
Jul 16 23:27:27 StockageAuric kernel: [120946.833255] block drbd1: updated sync
uuid 0002000000000000:0000000000000000:0001000000000000:0001000000000000
Jul 16 23:27:27 StockageAuric kernel: [120946.902902] block drbd1: helper
command: /sbin/drbdadm before-resync-target minor-1
Jul 16 23:27:27 StockageAuric kernel: [120946.907320] block drbd1: helper
command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 16 23:27:27 StockageAuric kernel: [120946.907333] block drbd1: conn( Behind
-> SyncTarget )
Jul 16 23:27:27 StockageAuric kernel: [120946.907346] block drbd1: Began resync
as SyncTarget (will sync 1968 KB [492 bits set]).
Jul 16 23:27:32 StockageAuric kernel: [120951.331738] block drbd1: PingAck did
not arrive in time.
Jul 16 23:27:32 StockageAuric kernel: [120951.333512] block drbd1: peer(
Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
Jul 16 23:27:32 StockageAuric kernel: [120951.334471] block drbd1: bitmap WRITE
of 0 pages took 0 jiffies
Jul 16 23:27:32 StockageAuric kernel: [120951.379697] block drbd1: asender
terminated
Jul 16 23:27:32 StockageAuric kernel: [120951.379711] block drbd1: Terminating
asender thread
Jul 16 23:27:32 StockageAuric kernel: [120951.379794] block drbd1: 0 KB (0
bits) marked out-of-sync by on disk bit-map.
Jul 16 23:27:32 StockageAuric kernel: [120951.379842] block drbd1: Connection
closed
Jul 16 23:27:32 StockageAuric kernel: [120951.379854] block drbd1: conn(
NetworkFailure -> Unconnected )
Jul 16 23:27:32 StockageAuric kernel: [120951.379868] block drbd1: receiver
terminated
Jul 16 23:27:32 StockageAuric kernel: [120951.379873] block drbd1: Restarting
receiver thread
Jul 16 23:27:32 StockageAuric kernel: [120951.379878] block drbd1: receiver
(re)started
Jul 16 23:27:32 StockageAuric kernel: [120951.379886] block drbd1: conn(
Unconnected -> WFConnection )
Jul 16 23:27:32 StockageAuric kernel: [120952.130971] block drbd1: Handshake
successful: Agreed network protocol version 96
Jul 16 23:27:32 StockageAuric kernel: [120952.133305] block drbd1: Peer
authenticated using 16 bytes of 'md5' HMAC
Jul 16 23:27:32 StockageAuric kernel: [120952.133319] block drbd1: conn(
WFConnection -> WFReportParams )
Jul 16 23:27:32 StockageAuric kernel: [120952.133406] block drbd1: Starting
asender thread (from drbd1_receiver [1628])
Jul 16 23:27:32 StockageAuric kernel: [120952.133592] block drbd1:
data-integrity-alg: <not-used>
Jul 16 23:27:32 StockageAuric kernel: [120952.134694] block drbd1:
drbd_sync_handshake:
Jul 16 23:27:32 StockageAuric kernel: [120952.134704] block drbd1: self
0002000000000000:0000000000000000:0001000000000000:0001000000000000 bits:0
flags:0
Jul 16 23:27:32 StockageAuric kernel: [120952.134712] block drbd1: peer
547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000 bits:99
flags:0
Jul 16 23:27:32 StockageAuric kernel: [120952.134719] block drbd1:
uuid_compare()=-2 by rule 60
Jul 16 23:27:32 StockageAuric kernel: [120952.134723] block drbd1: Becoming
sync target due to disk states.
Jul 16 23:27:32 StockageAuric kernel: [120952.134727] block drbd1: Writing the
whole bitmap, full sync required after drbd_sync_handshake.
Jul 16 23:27:35 StockageAuric kernel: [120954.954310] block drbd1: bitmap WRITE
of 31328 pages took 276 jiffies
Jul 16 23:27:36 StockageAuric kernel: [120955.530826] block drbd1: 3916 GB
(1026524567 bits) marked out-of-sync by on disk bit-map.
resource berlioz {
device /dev/drbd_berlioz minor 1;
disk /dev/data/berlioz;
meta-disk internal;
on StockageAuric {
# StockageAuricSAN
address 10.100.0.11:7787;
}
on StockageBerlioz {
# StockageBerliozSAN
address 10.100.0.12:7787;
}
}
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol A;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot
-f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot
-f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt
-f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh
[email protected]";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh
[email protected]";
# before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
}
disk {
# Those should be safe with our battery backed-up RAID
controller
# TODO: no-disk-flushes
# TODO: no-md-flushes
}
net {
# Restrict access to the resources with a shared secret
cram-hmac-alg md5;
shared-secret d56tuId1bbfa;
# Congestion management lets writes flow without disconnecting
on-congestion pull-ahead;
congestion-fill 1M;
}
syncer {
use-rle;
# Resync checksuming while verifying used lead to a deadlock,
fixed in v8.3.11
verify-alg md5;
csums-alg md5;
rate 5M;
# TODO: tune al-extents
}
}
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user