Dear DRBD community,
We've had a lot of disconnections during 2 days on a DRBD resource, and
one of them has lead to a full sync that was probably not needed.
We're using DRBD v8.3.11 back-ported to Ubuntu Lucid. We have already
faced this problem with v8.3.7 too.
I have read that there used to be race conditions in the connection code
and that most were fixed.
Maybe this is one? Is there anything I can do to help fix this?
I'm attaching extracts from the logs and the config files.
Lionel Sausin.
Jul 16 23:27:14 StockageBerlioz kernel: [120591.193239] block drbd1: conn( 
Ahead -> SyncSource ) pdsk( Consistent -> Inconsistent ) 
Jul 16 23:27:14 StockageBerlioz kernel: [120591.193250] block drbd1: Began 
resync as SyncSource (will sync 2936 KB [734 bits set]).
Jul 16 23:27:14 StockageBerlioz kernel: [120591.259136] block drbd1: updated 
sync UUID 19F5DB466A684489:0001000000000000:0001000000000000:00010
00000000000
Jul 16 23:27:22 StockageBerlioz kernel: [120598.524421] block drbd1: 
Congestion-extents threshold reached
Jul 16 23:27:22 StockageBerlioz kernel: [120598.524432] block drbd1: conn( 
SyncSource -> Ahead ) 
Jul 16 23:27:27 StockageBerlioz kernel: [120603.692343] block drbd1: helper 
command: /sbin/drbdadm before-resync-source minor-1
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696592] block drbd1: helper 
command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x
0)
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696604] block drbd1: conn( 
Ahead -> SyncSource ) 
Jul 16 23:27:27 StockageBerlioz kernel: [120603.696614] block drbd1: Began 
resync as SyncSource (will sync 1968 KB [492 bits set]).
Jul 16 23:27:27 StockageBerlioz kernel: [120603.810379] block drbd1: updated 
sync UUID 19F5DB466A684489:0002000000000000:0001000000000000:00010
00000000000
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622193] block drbd1: Resync 
done (total 3 sec; paused 0 sec; 656 K/sec)
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622202] block drbd1: 0 % had 
equal checksums, eliminated: 0K; transferred 1968K total 1968K
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622211] block drbd1: updated 
UUIDs 19F5DB466A684489:0000000000000000:0002000000000000:000100000
0000000
Jul 16 23:27:31 StockageBerlioz kernel: [120607.622222] block drbd1: conn( 
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
Jul 16 23:27:31 StockageBerlioz kernel: [120607.676864] block drbd1: bitmap 
WRITE of 0 pages took 0 jiffies
Jul 16 23:27:31 StockageBerlioz kernel: [120607.677098] block drbd1: 0 KB (0 
bits) marked out-of-sync by on disk bit-map.
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568432] block drbd1: sock was 
shut down by peer
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568444] block drbd1: peer( 
Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDat
e -> DUnknown ) 
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568461] block drbd1: short read 
expecting header on sock: r=0
Jul 16 23:27:32 StockageBerlioz kernel: [120608.568536] block drbd1: new 
current UUID 547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583144] block drbd1: asender 
terminated
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583159] block drbd1: 
Terminating asender thread
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583405] block drbd1: Connection 
closed
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583416] block drbd1: conn( 
BrokenPipe -> Unconnected ) 
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583427] block drbd1: receiver 
terminated
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583434] block drbd1: Restarting 
receiver thread
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583441] block drbd1: receiver 
(re)started
Jul 16 23:27:32 StockageBerlioz kernel: [120608.583452] block drbd1: conn( 
Unconnected -> WFConnection ) 
Jul 16 23:27:32 StockageBerlioz kernel: [120609.319689] block drbd1: Handshake 
successful: Agreed network protocol version 96
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321064] block drbd1: Peer 
authenticated using 16 bytes of 'md5' HMAC
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321076] block drbd1: conn( 
WFConnection -> WFReportParams ) 
Jul 16 23:27:32 StockageBerlioz kernel: [120609.321162] block drbd1: Starting 
asender thread (from drbd1_receiver [26168])
Jul 16 23:27:32 StockageBerlioz kernel: [120609.322421] block drbd1: 
data-integrity-alg: <not-used>
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499033] block drbd1: 
drbd_sync_handshake:
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499042] block drbd1: self 
547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000 bits:99 
flags:0
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499050] block drbd1: peer 
0002000000000000:0000000000000000:0001000000000000:0001000000000000 bits:0 
flags:0
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499057] block drbd1: 
uuid_compare()=2 by rule 80
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499061] block drbd1: Becoming 
sync source due to disk states.
Jul 16 23:27:33 StockageBerlioz kernel: [120609.499065] block drbd1: Writing 
the whole bitmap, full sync required after drbd_sync_handshake.
Jul 16 23:27:35 StockageBerlioz kernel: [120612.393208] block drbd1: bitmap 
WRITE of 31328 pages took 285 jiffies
Jul 16 23:27:35 StockageBerlioz kernel: [120612.393660] block drbd1: 3916 GB 
(1026524567 bits) marked out-of-sync by on disk bit-map.

Jul 16 23:27:14 StockageAuric kernel: [120934.167573] block drbd1: updated sync 
uuid 0001000000000000:0000000000000000:0001000000000000:0001000
000000000
Jul 16 23:27:14 StockageAuric kernel: [120934.218082] block drbd1: helper 
command: /sbin/drbdadm before-resync-target minor-1
Jul 16 23:27:14 StockageAuric kernel: [120934.222298] block drbd1: helper 
command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 16 23:27:14 StockageAuric kernel: [120934.222312] block drbd1: conn( Behind 
-> SyncTarget ) disk( Outdated -> Inconsistent ) 
Jul 16 23:27:14 StockageAuric kernel: [120934.222333] block drbd1: Began resync 
as SyncTarget (will sync 2936 KB [734 bits set]).
Jul 16 23:27:24 StockageAuric kernel: [120943.508206] block drbd1: conn( 
SyncTarget -> Behind ) 
Jul 16 23:27:27 StockageAuric kernel: [120946.833255] block drbd1: updated sync 
uuid 0002000000000000:0000000000000000:0001000000000000:0001000000000000
Jul 16 23:27:27 StockageAuric kernel: [120946.902902] block drbd1: helper 
command: /sbin/drbdadm before-resync-target minor-1
Jul 16 23:27:27 StockageAuric kernel: [120946.907320] block drbd1: helper 
command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 16 23:27:27 StockageAuric kernel: [120946.907333] block drbd1: conn( Behind 
-> SyncTarget ) 
Jul 16 23:27:27 StockageAuric kernel: [120946.907346] block drbd1: Began resync 
as SyncTarget (will sync 1968 KB [492 bits set]).
Jul 16 23:27:32 StockageAuric kernel: [120951.331738] block drbd1: PingAck did 
not arrive in time.
Jul 16 23:27:32 StockageAuric kernel: [120951.333512] block drbd1: peer( 
Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> 
DUnknown ) 
Jul 16 23:27:32 StockageAuric kernel: [120951.334471] block drbd1: bitmap WRITE 
of 0 pages took 0 jiffies
Jul 16 23:27:32 StockageAuric kernel: [120951.379697] block drbd1: asender 
terminated
Jul 16 23:27:32 StockageAuric kernel: [120951.379711] block drbd1: Terminating 
asender thread
Jul 16 23:27:32 StockageAuric kernel: [120951.379794] block drbd1: 0 KB (0 
bits) marked out-of-sync by on disk bit-map.
Jul 16 23:27:32 StockageAuric kernel: [120951.379842] block drbd1: Connection 
closed
Jul 16 23:27:32 StockageAuric kernel: [120951.379854] block drbd1: conn( 
NetworkFailure -> Unconnected ) 
Jul 16 23:27:32 StockageAuric kernel: [120951.379868] block drbd1: receiver 
terminated
Jul 16 23:27:32 StockageAuric kernel: [120951.379873] block drbd1: Restarting 
receiver thread
Jul 16 23:27:32 StockageAuric kernel: [120951.379878] block drbd1: receiver 
(re)started
Jul 16 23:27:32 StockageAuric kernel: [120951.379886] block drbd1: conn( 
Unconnected -> WFConnection ) 
Jul 16 23:27:32 StockageAuric kernel: [120952.130971] block drbd1: Handshake 
successful: Agreed network protocol version 96
Jul 16 23:27:32 StockageAuric kernel: [120952.133305] block drbd1: Peer 
authenticated using 16 bytes of 'md5' HMAC
Jul 16 23:27:32 StockageAuric kernel: [120952.133319] block drbd1: conn( 
WFConnection -> WFReportParams ) 
Jul 16 23:27:32 StockageAuric kernel: [120952.133406] block drbd1: Starting 
asender thread (from drbd1_receiver [1628])
Jul 16 23:27:32 StockageAuric kernel: [120952.133592] block drbd1: 
data-integrity-alg: <not-used>
Jul 16 23:27:32 StockageAuric kernel: [120952.134694] block drbd1: 
drbd_sync_handshake:
Jul 16 23:27:32 StockageAuric kernel: [120952.134704] block drbd1: self 
0002000000000000:0000000000000000:0001000000000000:0001000000000000 bits:0 
flags:0
Jul 16 23:27:32 StockageAuric kernel: [120952.134712] block drbd1: peer 
547210AA10A51D3F:19F5DB466A684489:0002000000000000:0001000000000000 bits:99 
flags:0
Jul 16 23:27:32 StockageAuric kernel: [120952.134719] block drbd1: 
uuid_compare()=-2 by rule 60
Jul 16 23:27:32 StockageAuric kernel: [120952.134723] block drbd1: Becoming 
sync target due to disk states.
Jul 16 23:27:32 StockageAuric kernel: [120952.134727] block drbd1: Writing the 
whole bitmap, full sync required after drbd_sync_handshake.
Jul 16 23:27:35 StockageAuric kernel: [120954.954310] block drbd1: bitmap WRITE 
of 31328 pages took 276 jiffies
Jul 16 23:27:36 StockageAuric kernel: [120955.530826] block drbd1: 3916 GB 
(1026524567 bits) marked out-of-sync by on disk bit-map.

resource berlioz {
  device      /dev/drbd_berlioz minor 1;
  disk        /dev/data/berlioz;
  meta-disk   internal;
  on StockageAuric {
    # StockageAuricSAN
    address   10.100.0.11:7787;
  }
  on StockageBerlioz {
    # StockageBerliozSAN
    address   10.100.0.12:7787;
  }
}
global {
        usage-count yes;
        # minor-count dialog-refresh disable-ip-verification
}

common {
        protocol A;

        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
                # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh 
[email protected]";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh 
[email protected]";
                # before-resync-target 
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                # after-resync-target 
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        }

        startup {
        }

        disk {
                # Those should be safe with our battery backed-up RAID 
controller
                # TODO: no-disk-flushes
                # TODO: no-md-flushes
        }

        net {
                # Restrict access to the resources with a shared secret
                cram-hmac-alg md5;
                shared-secret d56tuId1bbfa;
                
                # Congestion management lets writes flow without disconnecting
                on-congestion pull-ahead;
                congestion-fill 1M;
        }

        syncer {
                use-rle;
                # Resync checksuming while verifying used lead to a deadlock, 
fixed in v8.3.11
                verify-alg md5;
                csums-alg md5;
                rate 5M;
                # TODO: tune al-extents
        }
}
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to