Hi,
We have drbd 8.3.11 or 8.3.13 dual-primary on a pacemaker cluster
running on kernel 3.0.41.
The cluster just does its work, nothing is stopped or started and then,
after a week or so, we get a drbsetup locking-up (associated with below
kernel trace) when we want to administer a resource.
Usually only one resource of several resources is affected, sometimes
even two.
We have seen several such traces, with different drbdsetup sub-commands,
all ending at the same place.
Could this be the problem addressed by
http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=c586d79e49135831dbe0629e2d9a7b3739c615ef
Fix comparison of is_valid_transition()'s return code
in 8.4 ?
We fiddled that patch into a 8.3.13, which is currently running on a
test machine, but since the problem only appears now and then it is hard
to say if the problem is gone.
Has anyone an idea how to get into this state ?
TIA
Andi
---8<---
<03>2012 Sep 10 17:17:01 cnode1 [609601.848157] INFO: task
drbdsetup:5670 blocked for more than 120 seconds.
<03>2012 Sep 10 17:17:01 cnode1 [609601.848160] \"echo 0 >
/proc/sys/kernel/hung_task_timeout_secs\" disables this message.
<06>2012 Sep 10 17:17:01 cnode1 [609601.848162] drbdsetup D
0000000000000000 0 5670 1 0x00000004
<04>2012 Sep 10 17:17:01 cnode1 [609601.848166] ffff88000f423968
0000000000000082 ffff88003ffd7c00 ffff88000f423fd8
<04>2012 Sep 10 17:17:01 cnode1 [609601.848170] ffff88000f423838
0000000000012340 0000000000012340 0000000000012340
<04>2012 Sep 10 17:17:01 cnode1 [609601.848173] 0000000000012340
0000000000012340 ffff88000ee045c0 0000000000012340
<04>2012 Sep 10 17:17:01 cnode1 [609601.848177] Call Trace:
<04>2012 Sep 10 17:17:01 cnode1 [609601.852026] [<ffffffff8103960c>] ?
spin_unlock_irqrestore+0x9/0xb
<04>2012 Sep 10 17:17:01 cnode1 [609601.880322] [<ffffffff810416d6>] ?
__wake_up+0x43/0x50
<04>2012 Sep 10 17:17:01 cnode1 [609601.884293] [<ffffffffa03a745f>] ?
put_ldev+0x85/0x8a [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916943] [<ffffffffa03a7be5>] ?
is_valid_state+0x73/0x1e3 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916953] [<ffffffffa03a698f>] ?
spin_unlock_irqrestore+0x9/0xb [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.916969] [<ffffffffa03a7e22>] ?
_req_st_cond+0xcd/0xdf [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919191] [<ffffffff815ad428>]
schedule+0x44/0x46
<04>2012 Sep 10 17:17:01 cnode1 [609601.919208] [<ffffffffa03aadb2>]
drbd_req_state+0x1b6/0x2df [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919224] [<ffffffff8105f3cc>] ?
wake_up_bit+0x23/0x23
<04>2012 Sep 10 17:17:01 cnode1 [609601.919241] [<ffffffffa03aaefd>]
_drbd_request_state+0x22/0xb2 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.919252] [<ffffffff810bbcb6>] ?
zone_statistics+0x77/0x7e
<04>2012 Sep 10 17:17:01 cnode1 [609601.920356] [<ffffffff810ab9da>] ?
set_page_refcounted+0xd/0x1a
<04>2012 Sep 10 17:17:01 cnode1 [609601.920401] [<ffffffff810ade41>] ?
get_page_from_freelist+0x58b/0x64d
<04>2012 Sep 10 17:17:01 cnode1 [609601.920446] [<ffffffffa03b1895>]
drbd_nl_invalidate+0xa1/0x133 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.920462] [<ffffffffa03b1c1d>]
drbd_connector_callback+0x104/0x195 [drbd]
<04>2012 Sep 10 17:17:01 cnode1 [609601.924378] [<ffffffffa026446a>]
cn_rx_skb+0xb0/0xd2 [cn]
<04>2012 Sep 10 17:17:01 cnode1 [609601.936338] [<ffffffff81514227>]
netlink_unicast+0xe2/0x14b
<04>2012 Sep 10 17:17:01 cnode1 [609601.963889] [<ffffffff814f1ea6>] ?
memcpy_fromiovec+0x42/0x73
<04>2012 Sep 10 17:17:01 cnode1 [609601.963897] [<ffffffff8151545c>]
netlink_sendmsg+0x230/0x250
<04>2012 Sep 10 17:17:01 cnode1 [609601.963909] [<ffffffff814e71c1>]
__sock_sendmsg_nosec+0x55/0x62
<04>2012 Sep 10 17:17:01 cnode1 [609601.963913] [<ffffffff814e8456>]
__sock_sendmsg+0x39/0x42
<04>2012 Sep 10 17:17:01 cnode1 [609601.963917] [<ffffffff814e8c2e>]
sock_sendmsg+0xa3/0xbc
<04>2012 Sep 10 17:17:01 cnode1 [609601.963920] [<ffffffff810c1137>] ?
handle_pte_fault+0x2ef/0x843
<04>2012 Sep 10 17:17:01 cnode1 [609601.963924] [<ffffffff810c1e32>] ?
handle_mm_fault+0x19c/0x1b3
<04>2012 Sep 10 17:17:01 cnode1 [609601.963936] [<ffffffff810eedbe>] ?
fget_light+0x2f/0x7c
<04>2012 Sep 10 17:17:01 cnode1 [609601.963939] [<ffffffff814e8c71>] ?
sockfd_lookup_light+0x1b/0x53
<04>2012 Sep 10 17:17:01 cnode1 [609601.963943] [<ffffffff814e91b6>]
sys_sendto+0xfa/0x11f
<04>2012 Sep 10 17:17:01 cnode1 [609601.963946] [<ffffffff8151355b>] ?
netlink_table_ungrab+0x2e/0x30
<04>2012 Sep 10 17:17:01 cnode1 [609601.963949] [<ffffffff81515609>] ?
netlink_bind+0x106/0x11c
<04>2012 Sep 10 17:17:01 cnode1 [609601.963952] [<ffffffff814e9c33>] ?
sys_bind+0x7d/0x91
<04>2012 Sep 10 17:17:01 cnode1 [609601.963955] [<ffffffff810ebd14>] ?
spin_lock+0x9/0xb
<04>2012 Sep 10 17:17:01 cnode1 [609601.963960] [<ffffffff815b3a92>]
system_call_fastpath+0x16/0x1b
--->8---
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user