Hi, I'm experimenting with drbd for two openstreetmap tile servers (dell R210, 16G mem, ubuntu natty). No cluster manager, just drbd over md raid0 to have something that resembles a raid0+1 setup.
One server, drbd primary, working ok, is at the moment happily importing the 16G planet data into postgresql. However, the secondary server is not happy. I cannot remember how many crashes and hung tasks the second server has experienced but I cannot seem to be able to blame any *hardware* as the culprit (mem is ok, disks are ok, swapped network cards, contents of root partition (kernel & userland software) on both servers is the same). The server is started with 'delayacct hpet=disable nohz=off' as parameters. This is what happened this afternoon, four minutes after I started syncing: Jun 23 16:24:27 nadir kernel: [167457.306951] block drbd0: Began resync as SyncTarget (will sync 5328613760 KB [1332153440 bits set]). Jun 23 16:28:41 nadir kernel: [167710.031697] INFO: task kworker/u:1:14 blocked for more than 120 seconds. Jun 23 16:28:41 nadir kernel: [167710.038500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 23 16:28:41 nadir kernel: [167710.046445] kworker/u:1 D 0000000000000000 0 14 2 0x00000000 Jun 23 16:28:41 nadir kernel: [167710.046449] ffff88041f995c00 0000000000000046 ffff88041f995fd8 ffff88041f994000 Jun 23 16:28:41 nadir kernel: [167710.046452] 0000000000013d00 ffff88041f933178 ffff88041f995fd8 0000000000013d00 Jun 23 16:28:41 nadir kernel: [167710.046455] ffffffff81a0b020 ffff88041f932dc0 0000000000000010 ffff88041aafa000 Jun 23 16:28:41 nadir kernel: [167710.046457] Call Trace: Jun 23 16:28:41 nadir kernel: [167710.046469] [<ffffffffa0241145>] drbd_req_state+0x165/0x400 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046475] [<ffffffff81087940>] ? autoremove_wake_function+0x0/0x40 Jun 23 16:28:41 nadir kernel: [167710.046480] [<ffffffffa0244bf0>] ? drbd_nl_disconnect+0x0/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046485] [<ffffffffa0241412>] _drbd_request_state+0x32/0xe0 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046491] [<ffffffff8105e71a>] ? load_balance+0xca/0x5a0 Jun 23 16:28:41 nadir kernel: [167710.046495] [<ffffffff8108e40d>] ? sched_clock_cpu+0xbd/0x110 Jun 23 16:28:41 nadir kernel: [167710.046500] [<ffffffffa0244bf0>] ? drbd_nl_disconnect+0x0/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046506] [<ffffffffa0244c1e>] drbd_nl_disconnect+0x2e/0x190 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046511] [<ffffffffa024ab16>] drbd_connector_callback+0x116/0x600 [drbd] Jun 23 16:28:41 nadir kernel: [167710.046516] [<ffffffff813b57e0>] ? cn_queue_wrapper+0x0/0x50 Jun 23 16:28:41 nadir kernel: [167710.046518] [<ffffffff813b5808>] cn_queue_wrapper+0x28/0x50 Jun 23 16:28:41 nadir kernel: [167710.046522] [<ffffffff8108224d>] process_one_work+0x11d/0x420 Jun 23 16:28:41 nadir kernel: [167710.046526] [<ffffffff81082ce9>] worker_thread+0x169/0x360 Jun 23 16:28:41 nadir kernel: [167710.046529] [<ffffffff81082b80>] ? worker_thread+0x0/0x360 Jun 23 16:28:41 nadir kernel: [167710.046531] [<ffffffff810871f6>] kthread+0x96/0xa0 Jun 23 16:28:41 nadir kernel: [167710.046535] [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10 Jun 23 16:28:41 nadir kernel: [167710.046538] [<ffffffff81087160>] ? kthread+0x0/0xa0 Jun 23 16:28:41 nadir kernel: [167710.046540] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10 Does anyone recognise this stack trace? What could be going on? Second question: Is it normal for 'drbdadm' to timeout or could it be related? This happens on both the primary and the secondary, depends on the change I made in the config. root@nadir:/etc/drbd.d# drbdadm disconnect r0 Command 'drbdsetup 0 disconnect' did not terminate within 5 seconds thanks for any input or pointers, Maarten. _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
