Dear DRBD-list, we've used DRBD+Heartbeat for a couple of years in a large LAMP based CMS system with 300+ gigs of MySQL data. We use it to store MySQL data on two physical 64 bit RHEL5 5.7 servers. It has served us very well. There has been no crashes until now, but being a HA setup, it has allowed for sw / hw maintenance with very little associated downtime.

Lately, however, we've been subject to huge number of 'Digest mismatch, buffer modified by upper layers ' events and associated resyncs.This happens many times an hour and makes online verification
impossible. Replacing the network interfaces have not helped.

Very lately, we've had two nasty kernel panics. Syslog data for the latest is included below.

We know that digest verification of all transactions is not the recommended practice in a production environment, but in case that the problems should be related to hardware data corruption we want to play it safe.
Loosing the DRBD data would cause hours of downtime.

Is there any good advice for solving the problems, are we doing something wrong?

  - Michael

-----------------------------------short-info---------------------------------

[root@db1 ~]# rpm -q heartbeat
heartbeat-2.1.3-1

[root@db1 ~]# rpm -qa | grep drbd
drbd-km-2.6.18_274.el5-8.3.11-1
drbd-utils-8.3.11-1
drbd-heartbeat-8.3.11-1
drbd-udev-8.3.11-1

Hardware: 2x Dell R710 with 96G RAM and internal disks.3
[root@db2 ~]# ethtool  --driver eth0
driver: bnx2
version: 2.0.21
firmware-version: 5.0.9 bc 5.0.6 NCSI 2.0.3
bus-info: 0000:01:00.0

See below for detailed info.

-----------------------------------------detailed-info--------------------------

db1: /etc/drbd.conf:

global {
        usage-count no;
}
common {
        #syncer { rate 30M; verify-alg sha1; }
        syncer { rate 20M; verify-alg sha1; }
        net {
                cram-hmac-alg sha1;
                shared-secret "###";
                data-integrity-alg sha1;
        }
        protocol C;
        handlers {
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater.sh -t 60 -p db1 -r sql";
        }
        disk {
                no-disk-barrier;
                fencing resource-only;
        }
}
resource sql {
        on db1 {
                device    minor 1;
                disk      /dev/plads/WebSqlRaw;
                address   x.y.z.4:7789;
                meta-disk internal;
        }
        on db2 {
                device    minor 1;
                disk      /dev/plads/WebSqlRaw;
                address   x.y.z.5:7789;
                meta-disk internal;
        }
}
////////////////////////
[root@db2 ~]# cat /usr/lib64/heartbeat/drbd-peer-outdater.sh
#!/bin/sh

# FE to drbd-peer-outdater to cope with bug in HeartBeat 2.1.3

/usr/lib64/heartbeat/drbd-peer-outdater "$@"
ex=$?
if [ $ex = 20 ]; then
ex=5
fi
exit $ex
////////////////////////
[root@db2 ~]# tail -3 /etc/ha.d/ha.cf
respawn hacluster /usr/lib64/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

////////////////////////
[root@db2 ~]# cat /etc/ha.d/haresources
db1 drbddisk::sql Filesystem::/dev/drbd1::/var/lib/mysql::ext3::noatime x.y.z.150 mysqld memcached
////////////////////////

[root@db1 ~]# rpm -q heartbeat
heartbeat-2.1.3-1

[root@db1 ~]# rpm -qa | grep drbd
drbd-km-2.6.18_274.el5-8.3.11-1
drbd-utils-8.3.11-1
drbd-heartbeat-8.3.11-1
drbd-udev-8.3.11-1

[root@db1 ~]# uname -a ; cat /etc/redhat-release
Linux db1 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 5.7 (Tikanga)



Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Digest mismatch, buffer modified by upper layers during write: 84581616s +4096 Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Digest integrity check FAILED: 84581616s +4096
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: error receiving Data, l: 4140!
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: asender terminated
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Terminating asender thread
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Connection closed
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( ProtocolError -> Unconnected )
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver terminated
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: Restarting receiver thread
Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: receiver (re)started
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: sock was shut down by peer
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: short read expecting header on sock: r=0 Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: meta connection shut down by peer.
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: asender terminated
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Terminating asender thread
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: new current UUID 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Connection closed
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( BrokenPipe -> Unconnected )
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver terminated
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: Restarting receiver thread
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: receiver (re)started
Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 Aug 27 20:10:03 x.y.z.4 kernel:: block drbd1: conn( Unconnected -> WFConnection )
Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd peer: db2
Aug 27 20:10:03 x.y.z.4 drbd-peer-outdater: [4120]: debug: drbd resource: sql Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Connecting channel Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Client outdater (0x11617150) connected Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: invoked: outdater Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Processing msg from outdater Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Got message from (drbd-peer-outdater). (peer: db2, res :sql) Aug 27 20:10:03 x.y.z.4 /usr/lib64/heartbeat/dopd: [17880]: debug: Starting node walk Aug 27 20:10:03 x.y.z.5 kernel:: block drbd1: conn( Unconnected -> WFConnection ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Handshake successful: Agreed network protocol version 96 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFConnection -> WFReportParams ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Starting asender thread (from drbd1_receiver [29862])
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: data-integrity-alg: sha1
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Handshake successful: Agreed network protocol version 96 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFConnection -> WFReportParams ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Starting asender thread (from drbd1_receiver [13639])
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: data-integrity-alg: sha1
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: drbd_sync_handshake:
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: self 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 bits:40 flags:0 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 bits:0 flags:0
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: uuid_compare()=1 by rule 70
Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: drbd_sync_handshake:
ug 27 20:10:04 x.y.z.5 kernel:: block drbd1: self CE3F2D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 bits:0 flags:0 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer 211E1556F8380CC3:CE3F2D613EF98849:58345E642CB68BA5:58335E642CB68BA5 bits:40 flags:0
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: uuid_compare()=-1 by rule 50
Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFBitMapT -> WFSyncUUID ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: Began resync as SyncSource (will sync 160 KB [40 bits set]). Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: updated sync UUID 211E1556F8380CC3:CE402D613EF98849:CE3F2D613EF98849:58345E642CB68BA5 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: updated sync uuid CE402D613EF98848:0000000000000000:58345E642CB68BA4:58335E642CB68BA5 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Aug 27 20:10:04 x.y.z.5 kernel:: block drbd1: Began resync as SyncTarget (will sync 160 KB [40 bits set]). Aug 27 20:10:04 x.y.z.4 kernel:: block drbd1: BAD! sector=837397720s enr=25555 rs_left=-1 rs_failed=0 count=1 Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: <4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6082"]
Aug 27 20:10:04 x.y.z.4 kernel:: Call Trace:
Aug 27 20:10:04 x.y.z.4 kernel:: <IRQ> [<ffffffff885e4e69>] :drbd:drbd_try_clear_on_disk_bm+0xbb/0x27b Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e6ecf>] :drbd:__drbd_set_in_sync+0x27b/0x2aa Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e1b80>] :drbd:_req_may_be_done+0x311/0x52b Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885e2040>] :drbd:__req_mod+0x293/0xccc Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff885d5476>] :drbd:drbd_endio_pri+0xbe/0xee Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff881495af>] :dm_mod:clone_endio+0xc1/0xd0 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8002cf01>] __end_that_request_first+0x23c/0x4b6 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a202>] :scsi_mod:scsi_end_request+0x27/0xcd Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a3f6>] :scsi_mod:scsi_io_completion+0x14e/0x324 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff880a7802>] :sd_mod:sd_rw_intr+0x252/0x28c Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8807a68d>] :scsi_mod:scsi_device_unbusy+0x67/0x81 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80037ba5>] blk_done_softirq+0x5f/0x6d Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80012562>] __do_softirq+0x89/0x133
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d636>] do_softirq+0x2c/0x7d
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d4c6>] do_IRQ+0xee/0xf7
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Aug 27 20:10:04 x.y.z.4 kernel:: <EOI> [<ffffffff800968ea>] ksoftirqd+0x0/0xbf Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8014fc8d>] cfq_queue_empty+0x0/0x16 Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80154f55>] kobject_release+0x0/0x9
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8006d685>] do_softirq+0x7b/0x7d
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80096949>] ksoftirqd+0x5f/0xbf
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032722>] kthread+0xfe/0x132
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff80032624>] kthread+0x0/0x132
Aug 27 20:10:04 x.y.z.4 kernel:: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Aug 27 20:10:04 x.y.z.4 syslog-ng[4325]: Error processing log message: <4>1 2011-08-27T20:10:04+02:00 db1 kernel: - - [meta sequenceId="6110"]

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to