Hi,

I'm experimenting with drbd for two openstreetmap tile servers (dell
R210, 16G mem, ubuntu natty). No cluster manager, just drbd over md
raid0 to have something that resembles a raid0+1 setup.

One server, drbd primary, working ok, is at the moment happily importing
the 16G planet data into postgresql. However, the secondary server is
not happy. I cannot remember how many crashes and hung tasks the second
server has experienced but I cannot seem to be able to blame any
*hardware* as the culprit (mem is ok, disks are ok, swapped network
cards, contents of root partition (kernel & userland software) on both
servers is the same).

The server is started with 'delayacct hpet=disable nohz=off' as parameters.

This is what happened this afternoon, four minutes after I started syncing:

Jun 23 16:24:27 nadir kernel: [167457.306951] block drbd0: Began resync
as SyncTarget (will sync 5328613760 KB [1332153440 bits set]).
Jun 23 16:28:41 nadir kernel: [167710.031697] INFO: task kworker/u:1:14
blocked for more than 120 seconds.
Jun 23 16:28:41 nadir kernel: [167710.038500] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 16:28:41 nadir kernel: [167710.046445] kworker/u:1     D
0000000000000000     0    14      2 0x00000000
Jun 23 16:28:41 nadir kernel: [167710.046449]  ffff88041f995c00
0000000000000046 ffff88041f995fd8 ffff88041f994000
Jun 23 16:28:41 nadir kernel: [167710.046452]  0000000000013d00
ffff88041f933178 ffff88041f995fd8 0000000000013d00
Jun 23 16:28:41 nadir kernel: [167710.046455]  ffffffff81a0b020
ffff88041f932dc0 0000000000000010 ffff88041aafa000
Jun 23 16:28:41 nadir kernel: [167710.046457] Call Trace:
Jun 23 16:28:41 nadir kernel: [167710.046469]  [<ffffffffa0241145>]
drbd_req_state+0x165/0x400 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046475]  [<ffffffff81087940>] ?
autoremove_wake_function+0x0/0x40
Jun 23 16:28:41 nadir kernel: [167710.046480]  [<ffffffffa0244bf0>] ?
drbd_nl_disconnect+0x0/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046485]  [<ffffffffa0241412>]
_drbd_request_state+0x32/0xe0 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046491]  [<ffffffff8105e71a>] ?
load_balance+0xca/0x5a0
Jun 23 16:28:41 nadir kernel: [167710.046495]  [<ffffffff8108e40d>] ?
sched_clock_cpu+0xbd/0x110
Jun 23 16:28:41 nadir kernel: [167710.046500]  [<ffffffffa0244bf0>] ?
drbd_nl_disconnect+0x0/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046506]  [<ffffffffa0244c1e>]
drbd_nl_disconnect+0x2e/0x190 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046511]  [<ffffffffa024ab16>]
drbd_connector_callback+0x116/0x600 [drbd]
Jun 23 16:28:41 nadir kernel: [167710.046516]  [<ffffffff813b57e0>] ?
cn_queue_wrapper+0x0/0x50
Jun 23 16:28:41 nadir kernel: [167710.046518]  [<ffffffff813b5808>]
cn_queue_wrapper+0x28/0x50
Jun 23 16:28:41 nadir kernel: [167710.046522]  [<ffffffff8108224d>]
process_one_work+0x11d/0x420
Jun 23 16:28:41 nadir kernel: [167710.046526]  [<ffffffff81082ce9>]
worker_thread+0x169/0x360
Jun 23 16:28:41 nadir kernel: [167710.046529]  [<ffffffff81082b80>] ?
worker_thread+0x0/0x360
Jun 23 16:28:41 nadir kernel: [167710.046531]  [<ffffffff810871f6>]
kthread+0x96/0xa0
Jun 23 16:28:41 nadir kernel: [167710.046535]  [<ffffffff8100cde4>]
kernel_thread_helper+0x4/0x10
Jun 23 16:28:41 nadir kernel: [167710.046538]  [<ffffffff81087160>] ?
kthread+0x0/0xa0
Jun 23 16:28:41 nadir kernel: [167710.046540]  [<ffffffff8100cde0>] ?
kernel_thread_helper+0x0/0x10

Does anyone recognise this stack trace? What could be going on?

Second question: Is it normal for 'drbdadm' to timeout or
could it be related? This happens on both the primary and the secondary,
depends on the change I made in the config.


root@nadir:/etc/drbd.d# drbdadm disconnect r0
Command 'drbdsetup 0 disconnect' did not terminate within 5 seconds


thanks for any input or pointers,
Maarten.



_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to