On Mon, Aug 14, 2017 at 02:41:56PM +0200, Frank Rust wrote: > Hi all, > after upgrading DRBD-Dkms to 9.0.8+linbit-1 (on kernel 4.4.67-1-pve) I get > those errors: > > Software is: root@virt5:~# dpkg -l |grep drbd > ii drbd-dkms 9.0.8+linbit-1 all > RAID 1 over TCP/IP for Linux module source > ii drbd-utils 9.0.0+linbit-1 > amd64 RAID 1 over TCP/IP for Linux (user utilities) > ii drbdmanage-proxmox 1.0-1 all > DRBD distributed resource management utility > ii python-drbdmanage 0.99.8-1 all > DRBD distributed resource management utility > > > > kernel:[233761.855451] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! > [drbd_s_vm-109-d:65197] > Aug 14 14:21:49 virt2 kernel: [233761.855451] NMI watchdog: BUG: soft lockup > - CPU#1 stuck for 22s! [drbd_s_vm-109-d:65197] > Aug 14 14:21:49 virt2 kernel: [233761.855544] Modules linked in: ipt_REJECT > nf_reject_ipv4 drbd_transport_tcp(O) drbd(O) ip_set ip6table_filter > ip6_tables xt_multiport binfmt_misc iptable_filter ip_tables x_tables softdog > rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc > ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp > libiscsi_tcp libiscsi scsi_transport_iscsi nfnetlink_log nfnetlink zfs(PO) > zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) dm_thin_pool > dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipmi_ssif intel_rapl > x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel sb_edac > aes_x86_64 snd_pcm lrw snd_timer edac_core gf128mul snd glue_helper shpchp > soundcore joydev pcspkr input_leds ablk_helper cryptd i2c_i801 lpc_ich > ipmi_si 8250_fintek mei_me ipmi_msghandler mei mac_hid acpi_power_meter wmi > vhost_net vhost macvtap macvlan coretemp autofs4 hid_generic ixgbe(O) usbkbd > usbmouse dca be2net usbhid vxlan ahci ip6_udp_tunnel ptp hid libahci > udp_tunnel pps_core megaraid_sas fjes > Aug 14 14:21:49 virt2 kernel: [233761.855592] CPU: 1 PID: 65197 Comm: > drbd_s_vm-109-d Tainted: P O L 4.4.67-1-pve #1 > Aug 14 14:21:49 virt2 kernel: [233761.855593] Hardware name: FUJITSU PRIMERGY > RX2530 M1/D3279-A1, BIOS V5.0.0.9 R1.28.0 for D3279-A1x > 12/09/2015 > Aug 14 14:21:49 virt2 kernel: [233761.855594] task: ffff883003707000 ti: > ffff881dc5100000 task.ti: ffff881dc5100000 > Aug 14 14:21:49 virt2 kernel: [233761.855595] RIP: 0010:[<ffffffffc0936ae2>] > [<ffffffffc0936ae2>] wait_for_sender_todo+0x142/0x270 [drbd] > Aug 14 14:21:49 virt2 kernel: [233761.855606] RSP: 0018:ffff881dc5103db0 > EFLAGS: 00000282 > Aug 14 14:21:49 virt2 kernel: [233761.855607] RAX: 00000000002aef32 RBX: > ffff88300dd94000 RCX: 8000000000000000 > Aug 14 14:21:49 virt2 kernel: [233761.855608] RDX: 0000000000000001 RSI: > 0000000000000000 RDI: ffff881c71ab1998 > Aug 14 14:21:49 virt2 kernel: [233761.855609] RBP: ffff881dc5103e10 R08: > ffff88300dd944d8 R09: 0000000000000001 > Aug 14 14:21:49 virt2 kernel: [233761.855610] R10: 000000002b300018 R11: > 00000000fffffffb R12: ffff88300dd940b8 > Aug 14 14:21:49 virt2 kernel: [233761.855610] R13: ffff88300dd944c0 R14: > ffff883003707000 R15: 0000000000000001 > Aug 14 14:21:49 virt2 kernel: [233761.855611] FS: 0000000000000000(0000) > GS:ffff88181f640000(0000) knlGS:0000000000000000 > Aug 14 14:21:49 virt2 kernel: [233761.855612] CS: 0010 DS: 0000 ES: 0000 > CR0: 0000000080050033 > Aug 14 14:21:49 virt2 kernel: [233761.855613] CR2: 0000000000000030 CR3: > 0000000001e0b000 CR4: 00000000001426e0 > Aug 14 14:21:49 virt2 kernel: [233761.855614] Stack: > Aug 14 14:21:49 virt2 kernel: [233761.855614] ffff88300dd94520 > 0000000000000000 ffff883003707000 ffffffff810c46e0 > Aug 14 14:21:49 virt2 kernel: [233761.855616] ffff881dc5103dd0 > ffff881dc5103dd0 00000000e2e71a4a ffff88300dd94000 > Aug 14 14:21:49 virt2 kernel: [233761.855617] ffff88300dd944d8 > ffff88300dd94528 ffff88300dd94520 ffff88300dd944d8 > Aug 14 14:21:49 virt2 kernel: [233761.855618] Call Trace: > Aug 14 14:21:49 virt2 kernel: [233761.855623] [<ffffffff810c46e0>] ? > wait_woken+0x90/0x90 > Aug 14 14:21:49 virt2 kernel: [233761.855627] [<ffffffffc093c1fe>] > drbd_sender+0x34e/0x400 [drbd] > Aug 14 14:21:49 virt2 kernel: [233761.855633] [<ffffffffc095ad60>] ? > w_complete+0x20/0x20 [drbd] > Aug 14 14:21:49 virt2 kernel: [233761.855637] [<ffffffffc095adc0>] > drbd_thread_setup+0x60/0x110 [drbd] > Aug 14 14:21:49 virt2 kernel: [233761.855641] [<ffffffffc095ad60>] ? > w_complete+0x20/0x20 [drbd] > Aug 14 14:21:49 virt2 kernel: [233761.855643] [<ffffffff810a134a>] > kthread+0xfa/0x110 > Aug 14 14:21:49 virt2 kernel: [233761.855644] [<ffffffff810a1250>] ? > kthread_park+0x60/0x60 > Aug 14 14:21:49 virt2 kernel: [233761.855646] [<ffffffff81864a8f>] > ret_from_fork+0x3f/0x70 > Aug 14 14:21:49 virt2 kernel: [233761.855647] [<ffffffff810a1250>] ? > kthread_park+0x60/0x60 > Aug 14 14:21:49 virt2 kernel: [233761.855648] Code: 8b 7b 10 8b 87 a0 03 00 > 00 84 d2 74 09 3b 83 38 09 00 00 0f 95 c2 48 81 c7 98 01 00 00 c6 07 00 0f 1f > 40 00 fb 66 0f 1f 44 00 00 <84> d2 0f 85 99 00 00 00 f0 0f ba b3 f8 00 00 00 > 0f 72 43 83 bb > Aug 14 14:21:51 virt2 kernel: [233763.964203] drbd_send_and_submit: 546 > callbacks suppressed > Aug 14 14:21:51 virt2 kernel: [233763.964206] drbd vm-109-disk-1/0 drbd176: > IO ERROR: neither local nor remote data, sector 38107320+8 > Aug 14 14:21:51 virt2 kernel: [233763.964327] buffer_io_error: 546 callbacks > suppressed > > > top shows processes using all of one cpu core > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > > 65197 root 20 0 0 0 0 R 100.0 0.0 92:01.19 > drbd_s_vm-109-d > > > 65229 root 20 0 0 0 0 R 100.0 0.0 91:54.35 > kworker/u145:2 > >
Any logs you can share before that happend? +/- 50 lines around the time this happened would help. > >drbdadm status > root@virt5:~# drbdadm status > .drbdctrl role:Secondary > volume:0 disk:UpToDate > volume:1 disk:UpToDate > fs1 role:Secondary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate > fs2 role:Secondary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate > virt1 role:Secondary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate > virt2 role:Primary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate > virt3 role:Secondary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate > virt4 role:Secondary > volume:0 peer-disk:UpToDate > volume:1 peer-disk:UpToDate Maybe a bit unrelated, but why would you spawn the control volume over that many nodes? Especially virtual machines. That does not make sense to me. I guess these should be satellite nodes. > Is there anything I can do about it? (reboot is no option on production > machines…) You saw the trace in the kernel? This is what it is going to be. Regards, rck _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
