Am 20.05.2011 10:22, schrieb Lars Ellenberg:
According to you, it "had been working" before with the exact same
hardware and configuration.
Now if it does not anymore, then I strongly suspect network problems.
I am currently checking the network switches together with the DELL
support. Just to sort out a possible responsibility of the 10GbE switches.
Did you do some network benchmarks on the replication link recently?
Packet loss, excessive retransmits, checksum errors, bad cabling?
Port stats on the switch, any error counters?
Duplex or other auto-negotiating mismatch?
Use flood ping with both large (>32k) and small packet sizes, iperf,
your favorite network integrity checker or benchmarking tool...
Up to now I did tests with:
1. netio: up to 230 MByte/s
2. iperf: always below 80MBit/s
...both did not affect the CPU usage stats in any bigger kind.
3. ping: 9k and 18k packages, simultaneously from/to both hosts.
...led to high CPU peaks. Services of the virtual guests not available
for that time, dmesg log outputs about timeouts - the log entries that I
posted already.
In the end of this email I attach three logiles of three machines
starting from the moment when I restarted the resync of the
split-brain-DRBD.
host1: The machine where the running virtual guests reside. Runs
Xenserver 5.6.
host2: The machine that lost the DRBD sync. No running virtual machines.
Runs Xenserver 5.6.
guest: The mailserver as an example for all other virtual guests. Runs
CentOS 5.5. Resides on host1.
An explanation for the timeline:
13:41:42: DRBD resync started on host2
13:49:25: DRBD resync aborted by disconnecting on host2
13:51:44: DRBD disconnection still not finished. So I tried stopping the
DRBD service on host2. After that has also been refused, I rebooted host2.
While the DRBD resync attempt was in progress the mailserver was not
accessible in the network, neither its mail services nor the webmail
service or the ssh console. In that time top showed just very few CPU
usage on host1 but around 95% on host2 (drbd_receiver process).
Any further ideas?
I am already thinking about installing a fresh CentOS 5.6 or Debian 6 on
host2, setting up DRBD as Standalone, installing Xen and moving (system
as image, data as copy) onto that system ... then setting up host1
similarly and syncing everything onto its I'd lose fine things like
XenMotion but I possibly gained much more control over the resources.
What's your opinion about that? Possibly Xenserver, my hardware setup
and my virtualization goals simply do not fit? :-/
CU,
Daniel.
---
host1 /var/log messages
May 24 13:41:45 localhost kernel: block drbd0: Handshake successful:
Agreed network protocol version 94
May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection ->
WFReportParams )
May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread
(from drbd0_receiver [12355])
May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:41:45 localhost kernel: block drbd0: self
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE
bits:1420811336 flags:0
May 24 13:41:45 localhost kernel: block drbd0: peer
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5
bits:3874219255 flags:0
May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=1 by rule 70
May 24 13:41:45 localhost kernel: block drbd0: Becoming sync source due
to disk states.
May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS )
May 24 13:41:47 localhost kernel: block drbd0: peer( Secondary ->
Unknown ) conn( WFBitMapS -> TearDown )
May 24 13:41:47 localhost kernel: block drbd0: asender terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread
May 24 13:41:51 localhost kernel: block drbd0: sock_sendmsg returned -32
May 24 13:41:51 localhost kernel: block drbd0: short sent ReportBitMap
size=4096 sent=408
May 24 13:41:51 localhost kernel: block drbd0: Connection closed
May 24 13:41:51 localhost kernel: block drbd0: conn( TearDown ->
Unconnected )
May 24 13:41:51 localhost kernel: block drbd0: receiver terminated
May 24 13:41:51 localhost kernel: block drbd0: Restarting receiver thread
May 24 13:41:51 localhost kernel: block drbd0: receiver (re)started
May 24 13:41:51 localhost kernel: block drbd0: conn( Unconnected ->
WFConnection )
May 24 13:42:11 localhost kernel: block drbd0: Handshake successful:
Agreed network protocol version 94
May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection ->
WFReportParams )
May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread
(from drbd0_receiver [12355])
May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:42:11 localhost kernel: block drbd0: self
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE
bits:1420811336 flags:0
May 24 13:42:11 localhost kernel: block drbd0: peer
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5
bits:3874219261 flags:1
May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=1 by rule 70
May 24 13:42:11 localhost kernel: block drbd0: Becoming sync source due
to disk states.
May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS )
May 24 13:49:25 localhost kernel: block drbd0: peer( Secondary ->
Unknown ) conn( WFBitMapS -> TearDown )
May 24 13:49:25 localhost kernel: block drbd0: asender terminated
May 24 13:49:25 localhost kernel: block drbd0: Terminating asender thread
May 24 13:52:03 localhost kernel: INFO: task drbd0_receiver:12355
blocked for more than 120 seconds.
May 24 13:52:03 localhost kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 24 13:52:03 localhost kernel: drbd0_receive D c0120d7a 0
12355 2
May 24 13:52:03 localhost kernel: ed769f0c 00000246 eab68900
c0120d7a 000002d0 00000000 ed768000 ed769ec0
May 24 13:52:03 localhost kernel: c0134ecc 64260759 00014afe
ed769edc edd5d938 edd5d7b0 edd5d938 c16bdb00
May 24 13:52:03 localhost kernel: 00000000 ea87d740 ed769ed8
0e2cd9dc 00000000 00000005 004e6cd1 c0346aaf
May 24 13:52:03 localhost kernel: Call Trace:
May 24 13:52:03 localhost kernel: [<c0120d7a>] ? wake_up_state+0xa/0x10
May 24 13:52:03 localhost kernel: [<c0134ecc>] ?
complete_signal+0x17c/0x1b0
May 24 13:52:03 localhost kernel: [<c0346aaf>] ? wait_for_common+0x1f/0x1c0
May 24 13:52:03 localhost kernel: [<c0135b5b>] ? force_sig_info+0x9b/0xb0
May 24 13:52:03 localhost kernel: [<c03473b5>]
__mutex_lock_slowpath+0x55/0xd0
May 24 13:52:03 localhost kernel: [<c03471b7>] mutex_lock+0x17/0x20
May 24 13:52:03 localhost kernel: [<f062306e>] drbd_free_sock+0x1e/0xa0
[drbd]
May 24 13:52:03 localhost kernel: [<f0612b85>]
drbd_disconnect+0x45/0x7d0 [drbd]
May 24 13:52:03 localhost kernel: [<f06114c6>] ?
drbd_recv_header+0x16/0xa0 [drbd]
May 24 13:52:03 localhost kernel: [<c011eae0>] ? __wake_up+0x40/0x50
May 24 13:52:03 localhost kernel: [<f0614c21>] drbdd_init+0xb1/0x120 [drbd]
May 24 13:52:03 localhost kernel: [<f0627d22>]
drbd_thread_setup+0x132/0x1d0 [drbd]
May 24 13:52:03 localhost kernel: [<f0627bf0>] ?
drbd_thread_setup+0x0/0x1d0 [drbd]
May 24 13:52:03 localhost kernel: [<c010569b>]
kernel_thread_helper+0x7/0x10
May 24 13:52:03 localhost kernel: =======================
May 24 13:53:21 localhost kernel: block drbd0: short sent ReportBitMap
size=4096 sent=1392
May 24 13:53:21 localhost kernel: block drbd0: Connection closed
May 24 13:53:21 localhost kernel: block drbd0: conn( TearDown ->
Unconnected )
May 24 13:53:21 localhost kernel: block drbd0: receiver terminated
May 24 13:53:21 localhost kernel: block drbd0: Restarting receiver thread
May 24 13:53:21 localhost kernel: block drbd0: receiver (re)started
May 24 13:53:21 localhost kernel: block drbd0: conn( Unconnected ->
WFConnection )
---
host2 /var/log/messages
May 24 13:41:42 localhost kernel: drbd: initialized. Version: 8.3.8.1
(api:88/proto:86-94)
May 24 13:41:42 localhost kernel: drbd: GIT-hash:
0d8589fcc32c874df57c930ca1691399b55ec893 build by
root@localhost.localdomain, 2010-08-12 07:46:27
May 24 13:41:42 localhost kernel: drbd: registered as block device major 147
May 24 13:41:42 localhost kernel: drbd: minor_table @ 0xeb0d6340
May 24 13:41:42 localhost kernel: block drbd0: Starting worker thread
(from cqueue [6339])
May 24 13:41:42 localhost kernel: block drbd0: disk( Diskless -> Attaching )
May 24 13:41:42 localhost kernel: klogd 1.4.1, ---------- state change
----------
May 24 13:41:43 localhost kernel: block drbd0: Found 10 transactions
(582 active extents) in activity log.
May 24 13:41:43 localhost kernel: block drbd0: Method to ensure write
ordering: drain
May 24 13:41:43 localhost kernel: block drbd0: max_segment_size ( = BIO
size ) = 32768
May 24 13:41:43 localhost kernel: block drbd0: Adjusting my ra_pages to
backing device's (32 -> 256)
May 24 13:41:43 localhost kernel: block drbd0: drbd_bm_resize called
with capacity == 31204648888
May 24 13:41:43 localhost kernel: block drbd0: resync bitmap:
bits=3900581111 words=121893160
May 24 13:41:43 localhost kernel: block drbd0: size = 15 TB (15602324444 KB)
May 24 13:41:45 localhost kernel: block drbd0: recounting of set bits
took additional 98 jiffies
May 24 13:41:45 localhost kernel: block drbd0: 2491 GB (3874219255 bits)
marked out-of-sync by on disk bit-map.
May 24 13:41:45 localhost kernel: block drbd0: disk( Attaching ->
Inconsistent )
May 24 13:41:45 localhost kernel: block drbd0: conn( StandAlone ->
Unconnected )
May 24 13:41:45 localhost kernel: block drbd0: Starting receiver thread
(from drbd0_worker [6355])
May 24 13:41:45 localhost kernel: block drbd0: receiver (re)started
May 24 13:41:45 localhost kernel: block drbd0: conn( Unconnected ->
WFConnection )
May 24 13:41:45 localhost kernel: block drbd0: Handshake successful:
Agreed network protocol version 94
May 24 13:41:45 localhost kernel: block drbd0: conn( WFConnection ->
WFReportParams )
May 24 13:41:45 localhost kernel: block drbd0: Starting asender thread
(from drbd0_receiver [6367])
May 24 13:41:45 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:41:45 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:41:45 localhost kernel: block drbd0: self
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5
bits:3874219255 flags:0
May 24 13:41:45 localhost kernel: block drbd0: peer
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE
bits:1420811336 flags:0
May 24 13:41:45 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50
May 24 13:41:45 localhost kernel: block drbd0: Becoming sync target due
to disk states.
May 24 13:41:45 localhost kernel: block drbd0: peer( Unknown -> Primary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
peer_isp( 0 -> 1 )
May 24 13:41:47 localhost kernel: block drbd0: peer( Primary -> Unknown
) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown )
peer_isp( 1 -> 0 )
May 24 13:41:47 localhost kernel: block drbd0: error receiving
ReportBitMap, l: 4088!
May 24 13:41:47 localhost kernel: block drbd0: asender terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating asender thread
May 24 13:41:47 localhost kernel: block drbd0: Connection closed
May 24 13:41:47 localhost kernel: block drbd0: conn( Disconnecting ->
StandAlone )
May 24 13:41:47 localhost kernel: block drbd0: receiver terminated
May 24 13:41:47 localhost kernel: block drbd0: Terminating receiver thread
May 24 13:42:11 localhost kernel: block drbd0: conn( StandAlone ->
Unconnected )
May 24 13:42:11 localhost kernel: block drbd0: Starting receiver thread
(from drbd0_worker [6355])
May 24 13:42:11 localhost kernel: block drbd0: receiver (re)started
May 24 13:42:11 localhost kernel: block drbd0: conn( Unconnected ->
WFConnection )
May 24 13:42:11 localhost kernel: block drbd0: Handshake successful:
Agreed network protocol version 94
May 24 13:42:11 localhost kernel: block drbd0: conn( WFConnection ->
WFReportParams )
May 24 13:42:11 localhost kernel: block drbd0: Starting asender thread
(from drbd0_receiver [6412])
May 24 13:42:11 localhost kernel: block drbd0: data-integrity-alg: crc32c
May 24 13:42:11 localhost kernel: block drbd0: drbd_sync_handshake:
May 24 13:42:11 localhost kernel: block drbd0: self
1B3487AE74F5DBA0:0000000000000000:6CEB7B1CC746F358:42F748D007A8C8C5
bits:3874219261 flags:0
May 24 13:42:11 localhost kernel: block drbd0: peer
06F5FB89BEF343B1:1B3487AE74F5DBA1:947F913C965D9046:622E658C94927BDE
bits:1420811336 flags:0
May 24 13:42:11 localhost kernel: block drbd0: uuid_compare()=-1 by rule 50
May 24 13:42:11 localhost kernel: block drbd0: Becoming sync target due
to disk states.
May 24 13:42:11 localhost kernel: block drbd0: peer( Unknown -> Primary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
peer_isp( 0 -> 1 )
May 24 13:49:25 localhost kernel: block drbd0: peer( Primary -> Unknown
) conn( WFBitMapT -> Disconnecting ) pdsk( UpToDate -> DUnknown )
peer_isp( 1 -> 0 )
May 24 13:49:35 localhost kernel: block drbd0: process_done_ee() = NOT_OK
May 24 13:49:35 localhost kernel: block drbd0: asender terminated
May 24 13:49:35 localhost kernel: block drbd0: Terminating asender thread
Broadcast message from root (Tue May 24 13:51:44 2011):
The system is going down for reboot NOW!
May 24 13:51:44 localhost shutdown[6941]: shutting down for system reboot
May 24 13:52:02 localhost python: PERFMON: Caught signal 15 - exiting
May 24 13:52:06 localhost mountd[10542]: Caught signal 15,
un-registering and exiting.
May 24 13:52:07 localhost kernel: nfsd: last server has exited, flushing
export cache
---
mailserver (virtual guest) /var/log/dmesg
INFO: task mailserver:3114 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mailserver D ffff8800e5e66860 0 3114 1 3117 3105
(NOTLB)
ffff8800d444bc88 0000000000000286 ffff8800ff99e040 ffff880000eec670
000000000000000a ffff8800e5e66860 ffffffff804fdb00 000000000001f341
ffff8800e5e66a48 ffffffff802648f1
Call Trace:
[<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8022937f>] sync_page+0x0/0x43
[<ffffffff8022937f>] sync_page+0x0/0x43
[<ffffffff802635d9>] io_schedule+0x3f/0x67
[<ffffffff802293bd>] sync_page+0x3e/0x43
[<ffffffff8026371d>] __wait_on_bit_lock+0x36/0x66
[<ffffffff80241b63>] __lock_page+0x5e/0x64
[<ffffffff8029e060>] wake_bit_function+0x0/0x23
[<ffffffff80214338>] filemap_nopage+0x268/0x360
[<ffffffff80208e8c>] __handle_mm_fault+0x442/0x1445
[<ffffffff80267d58>] do_page_fault+0xf7b/0x1324
[<ffffffff802648f1>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8023fa13>] lock_timer_base+0x1b/0x3c
[<ffffffff8024d2d6>] try_to_del_timer_sync+0x7f/0x88
[<ffffffff8026082b>] error_exit+0x0/0x6e
INFO: task mailserver:3136 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mailserver D ffff8800d796c0c0 0 3136 1 3262 3128
(NOTLB)
ffff8800d4b83ef8 0000000000000286 ffff8800e4943680 ffff8800ff9914f8
000000000000000a ffff8800d796c0c0 ffffffff804fdb00 000000000000b5e2
ffff8800d796c2a8 ffffffff8021f04a
Call Trace:
[<ffffffff8021f04a>] __dentry_open+0x101/0x1dc
[<ffffffff8020e8da>] vfs_getattr+0x62/0xa9
[<ffffffff802643ea>] __down_write_nested+0x82/0x9a
[<ffffffff802c971b>] sys_mmap_pgoff+0x55/0xac
[<ffffffff8026168d>] ia32_sysret+0x0/0x5
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user