On 05-05-2017 9:24, Adi Pircalabu wrote:
On 5/5/17 2:18 AM, Robert Altnoeder wrote:
On 04/26/2017 06:03 AM, Adi Pircalabu wrote:
Just fyi, crashed again yesterday morning 7:06am, similar backtrace.
crash output for bt, ps, task & vm attached. I've since downgraded
the
drbd module version from 8.4.9-2 to 8.4.9-1, waiting for the crash to
replicate again. And, as expected, the folks @RedHat closed the bug
after reopening it as notabug, blaming drbd.
If they really explicitly blamed DRBD, then I suggest reopening the
bug
and requesting a copy of their root cause analysis that proves that
DRBD
is causing the problem.
I have, along with providing more debug information and asking why
they think DRBD is to blame.
Obviously, their point will be something like "noone knows what that
out-of-tree code might be doing"; granted, it's not an entirely
invalid
point.
Agree.
But then, I am quite sure - judging by the frequency and number of
kernel updates that are provided each year - that noone really knows
what the in-tree code might be doing, so one had better look there too
before blaming a piece of out-of-tree code that's pretty small
compared
to all the other pieces of code that may have caused the crash.
Here is the additional comment when reopening the bug (email client
wrapping may make it unreadable):
Looking further into the 2 backtraces:
1. First crash, linux-3.10.0-514.10.2.el7.x86_64
[793292.358213] .1BUG: unable to handle kernel NULL pointer
dereference at 0000000000000014
[793292.358710] IP: [<ffffffff810c8375>] account_system_time+0x15/0x170
[793292.358966] PGD 0
[793292.359202] Oops: 0000 [#1] SMP
[793292.359444] Modules linked in: binfmt_misc vfat fat drbd(OE)
mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase dell_rbu
bonding ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack iptable_filter dm_cache_smq dm_cache
dm_persistent_data dm_bio_prison dm_bufio intel_powerclamp coretemp
intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel
aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt
iTCO_vendor_support dcdbas pcspkr mxm_wmi sg sb_edac edac_core
ipmi_devintf ipmi_si ipmi_msghandler lpc_ich mei_me mei shpchp
acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc
tcp_htcp ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic
crct10dif_pclmul crct10dif_common crc32c_intel drm_kms_helper
syscopyarea sysfillrect
[793292.362164] sysimgblt fb_sys_fops ttm ixgbe drm ahci uas igb
libahci mdio i2c_algo_bit usb_storage ptp libata pps_core i2c_core
megaraid_sas dca fjes dm_mirror dm_region_hash dm_log dm_mod
[793292.362978] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G OE
------------ 3.10.0-514.10.2.el7.x86_64 #1
[793292.363448] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS
2.3.4 11/08/2016
[793292.363910] task: ffff8804fa559f60 ti: ffff8804fa680000 task.ti:
ffff8804fa680000
[793292.364377] RIP: 0010:[<ffffffff810c8375>] [<ffffffff810c8375>]
account_system_time+0x15/0x170
[793292.364850] RSP: 0018:ffff88086de43e00 EFLAGS: 00010086
[793292.365088] RAX: 0000000000000000 RBX: ffff88086de56c40 RCX:
00000000000f4240
[793292.365550] RDX: 00000000000f4240 RSI: 0000000000010000 RDI:
0000000000000000
[793292.366012] RBP: ffff88086de43e28 R08: 0000000000000000 R09:
00000000000c1af5
[793292.366470] R10: 000000003b9aca00 R11: 0000000000000000 R12:
00000000000f4240
[793292.367018] R13: 0000000000000000 R14: 0000000000000000 R15:
ffff88086de4f9d8
[793292.367473] FS: 0000000000000000(0000) GS:ffff88086de40000(0000)
knlGS:0000000000000000
[793292.367935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[793292.368179] CR2: 0000000000000014 CR3: 00000000019ba000 CR4:
00000000003407e0
[793292.368640] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[793292.369106] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[793292.369571] Stack:
[793292.369801] ffff88086de56c40 0000000000016c40 0000000000000000
0000000000000000
[793292.370282] ffff88086de4f9d8 ffff88086de43e60 ffffffff810c8682
0000000000000000
[793292.370764] 0000000000000000 0000000000000003 ffffffff810f3180
ffff88086de4f9d8
[793292.371250] Call Trace:
[793292.371484] <IRQ>
[793292.371494]
[793292.371730] [<ffffffff810c8682>] account_process_tick+0x62/0x170
[793292.371973] [<ffffffff810f3180>] ?
tick_sched_handle.isra.13+0x60/0x60
[793292.372218] [<ffffffff8109932d>] update_process_times+0x2d/0x80
[793292.372465] [<ffffffff810f3145>]
tick_sched_handle.isra.13+0x25/0x60
[793292.372712] [<ffffffff810f31c1>] tick_sched_timer+0x41/0x70
[793292.372957] [<ffffffff810b4a32>] __hrtimer_run_queues+0xd2/0x260
[793292.373197] [<ffffffff810b4fd0>] hrtimer_interrupt+0xb0/0x1e0
[793292.373445] [<ffffffff81050fd7>]
local_apic_timer_interrupt+0x37/0x60
[793292.373692] [<ffffffff8169920f>]
smp_apic_timer_interrupt+0x3f/0x60
[793292.373935] [<ffffffff8169775d>] apic_timer_interrupt+0x6d/0x80
[793292.374178] <EOI>
[793292.374187]
[793292.374423] [<ffffffff81514492>] ? cpuidle_enter_state+0x52/0xc0
[793292.374664] [<ffffffff815145d9>] cpuidle_idle_call+0xd9/0x210
[793292.374908] [<ffffffff810350ee>] arch_cpu_idle+0xe/0x30
[793292.375154] [<ffffffff810e7e65>] cpu_startup_entry+0x245/0x290
[793292.375398] [<ffffffff8104f07a>] start_secondary+0x1ba/0x230
[793292.375640] Code: e8 81 63 07 00 5b 41 5c 41 5d 41 5e 5d c3 0f 1f
84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
49 89 d4 53 <f6> 47 14 10 48 89 fb 74 1c 65 48 8b 04 25 b8 cd 00 00 8b
80 44
[793292.376666] RIP [<ffffffff810c8375>]
account_system_time+0x15/0x170
[793292.376917] RSP <ffff88086de43e00>
[793292.377158] CR2: 0000000000000014
crash> dis -rl ffffffff810c8375
/usr/src/debug/kernel-3.10.0-514.10.2.el7/linux-3.10.0-514.10.2.el7.x86_64/kernel/sched/cputime.c:
213
0xffffffff810c8360 <account_system_time>: nopl
0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810c8365 <account_system_time+5>: push %rbp
0xffffffff810c8366 <account_system_time+6>: mov %rsp,%rbp
0xffffffff810c8369 <account_system_time+9>: push %r15
0xffffffff810c836b <account_system_time+11>: push %r14
0xffffffff810c836d <account_system_time+13>: push %r13
0xffffffff810c836f <account_system_time+15>: push %r12
0xffffffff810c8371 <account_system_time+17>: mov %rdx,%r12
0xffffffff810c8374 <account_system_time+20>: push %rbx
/usr/src/debug/kernel-3.10.0-514.10.2.el7/linux-3.10.0-514.10.2.el7.x86_64/kernel/sched/cputime.c:
216
0xffffffff810c8375 <account_system_time+21>: testb $0x10,0x14(%rdi)
2. Second crash, linux-3.10.0-514.16.1.el7.x86_64:
[647323.702265] BUG: unable to handle kernel NULL pointer dereference
at (null)
[647323.702774] IP: [<ffffffff8168e48f>]
_raw_spin_lock_irqsave+0x1f/0x60
[647323.703030] PGD 0
[647323.703274] Oops: 0002 [#1] SMP
[647323.703519] Modules linked in: mpt3sas mpt2sas raid_class
scsi_transport_sas mptctl mptbase vfat fat drbd(OE) bonding ipt_REJECT
nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack iptable_filter dm_cache_smq dm_cache dm_persistent_data
dm_bio_prison dm_bufio intel_powerclamp coretemp intel_rapl iosf_mbi
kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw
gf128mul glue_helper ablk_helper cryptd iTCO_wdt ipmi_devintf
iTCO_vendor_support sb_edac sg pcspkr ipmi_si edac_core mxm_wmi dcdbas
ipmi_msghandler mei_me mei lpc_ich shpchp acpi_power_meter wmi nfsd
auth_rpcgss nfs_acl lockd grace sunrpc tcp_htcp ip_tables xfs
libcrc32c sd_mod crc_t10dif crct10dif_generic uas usb_storage
crct10dif_pclmul crct10dif_common crc32c_intel drm_kms_helper
syscopyarea sysfillrect sysimgblt
[647323.706272] fb_sys_fops ttm ixgbe ahci igb drm libahci mdio ptp
libata i2c_algo_bit pps_core i2c_core megaraid_sas dca fjes dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: dell_rbu]
[647323.707081] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G OE
------------ 3.10.0-514.16.1.el7.x86_64 #1
[647323.707562] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS
2.3.4 11/08/2016
[647323.708033] task: ffff8804fa61edd0 ti: ffff8804fa620000 task.ti:
ffff8804fa620000
[647323.708506] RIP: 0010:[<ffffffff8168e48f>] [<ffffffff8168e48f>]
_raw_spin_lock_irqsave+0x1f/0x60
[647323.708986] RSP: 0018:ffff8804fa623e10 EFLAGS: 00010082
[647323.709227] RAX: 0000000000000082 RBX: ffff88086de4f8e0 RCX:
000000000c0d81e5
[647323.709698] RDX: 0000000000020000 RSI: ffff8804fa623e48 RDI:
0000000000000000
[647323.710163] RBP: ffff8804fa623e10 R08: 0000000000000082 R09:
0000000000000000
[647323.710631] R10: 0000000000000004 R11: 0000000000000005 R12:
ffff88086de4fe10
[647323.711100] R13: ffff8804fa623e48 R14: ffff8804fa620000 R15:
0000000000000000
[647323.711578] FS: 0000000000000000(0000) GS:ffff88086de40000(0000)
knlGS:0000000000000000
[647323.712051] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[647323.712295] CR2: 0000000000000000 CR3: 00000000019ba000 CR4:
00000000003407e0
[647323.712851] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[647323.713311] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[647323.713771] Stack:
[647323.714001] ffff8804fa623e38 ffffffff810b4735 ffff88086de4fde0
00000000ffffffff
[647323.714490] ffff8804fa620000 ffff8804fa623e70 ffffffff810b4f37
ffffffff81514a2a
[647323.714983] a88acbe2089d5b89 ffff88086de4fde0 00024cbb3a6654d9
ffff8804fa620000
[647323.715467] Call Trace:
[647323.715707] [<ffffffff810b4735>]
lock_hrtimer_base.isra.20+0x25/0x50
[647323.715949] [<ffffffff810b4f37>]
hrtimer_try_to_cancel.part.25+0x37/0x100
[647323.716202] [<ffffffff81514a2a>] ? cpuidle_enter_state+0x5a/0xc0
[647323.716445] [<ffffffff810b5048>] hrtimer_cancel+0x28/0x40
[647323.716691] [<ffffffff810f36d7>] tick_nohz_restart+0x17/0x70
[647323.716935] [<ffffffff810f417f>] tick_nohz_idle_exit+0x8f/0x150
[647323.717182] [<ffffffff810e81d1>] cpu_startup_entry+0x171/0x290
[647323.717434] [<ffffffff8104f07a>] start_secondary+0x1ba/0x230
[647323.717676] Code: df 0f 1f 80 00 00 00 00 eb e0 66 90 0f 1f 44 00
00 55 48 89 e5 9c 58 0f 1f 44 00 00 49 89 c0 fa 66 0f 1f 44 00 00 ba
00 00 02 00 <f0> 0f c1 17 89 d1 c1 e9 10 66 39 d1 75 05 4c 89 c0 5d c3
83 e1
[647323.718706] RIP [<ffffffff8168e48f>]
_raw_spin_lock_irqsave+0x1f/0x60
[647323.718952] RSP <ffff8804fa623e10>
[647323.719191] CR2: 0000000000000000
crash> dis -rl ffffffff8168e48f
/usr/src/debug/kernel-3.10.0-514.16.1.el7/linux-3.10.0-514.16.1.el7.x86_64/kernel/spinlock.c:
144
0xffffffff8168e470 <_raw_spin_lock_irqsave>: nopl
0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff8168e475 <_raw_spin_lock_irqsave+5>: push %rbp
0xffffffff8168e476 <_raw_spin_lock_irqsave+6>: mov %rsp,%rbp
/usr/src/debug/kernel-3.10.0-514.16.1.el7/linux-3.10.0-514.16.1.el7.x86_64/arch/x86/include/asm/paravirt.h:
775
0xffffffff8168e479 <_raw_spin_lock_irqsave+9>: pushfq
0xffffffff8168e47a <_raw_spin_lock_irqsave+10>: pop %rax
0xffffffff8168e47b <_raw_spin_lock_irqsave+11>: nopl 0x0(%rax,%rax,1)
0xffffffff8168e480 <_raw_spin_lock_irqsave+16>: mov %rax,%r8
/usr/src/debug/kernel-3.10.0-514.16.1.el7/linux-3.10.0-514.16.1.el7.x86_64/arch/x86/include/asm/paravirt.h:
785
0xffffffff8168e483 <_raw_spin_lock_irqsave+19>: cli
0xffffffff8168e484 <_raw_spin_lock_irqsave+20>: nopw 0x0(%rax,%rax,1)
/usr/src/debug/kernel-3.10.0-514.16.1.el7/linux-3.10.0-514.16.1.el7.x86_64/arch/x86/include/asm/spinlock.h:
86
0xffffffff8168e48a <_raw_spin_lock_irqsave+26>: mov $0x20000,%edx
0xffffffff8168e48f <_raw_spin_lock_irqsave+31>: lock xadd %edx,(%rdi)
In both cases RDI was NULL. *And* there's no evidence in any of the 2
stacktraces of DRBD causing the crash.
Just a short update on this. After upgrading to
kernel-3.10.0-514.21.1.el7.x86_64 I haven't seen any more crashes. The
same drbd module has been in use all along and still is:
modinfo drbd
filename:
/lib/modules/3.10.0-514.21.1.el7.x86_64/weak-updates/drbd84/drbd.ko
alias: block-major-147-*
license: GPL
version: 8.4.9-1
description: drbd - Distributed Replicated Block Device v8.4.9-1
author: Philipp Reisner <[email protected]>, Lars Ellenberg
<[email protected]>
rhelversion: 7.3
srcversion: D502CE1D6329A5626F8A7CD
depends: libcrc32c
vermagic: 3.10.0-514.el7.x86_64 SMP mod_unload modversions
signer: The ELRepo Project (http://elrepo.org): ELRepo.org
Secure Boot Key
sig_key:
F3:65:AD:34:81:A7:B2:0E:34:27:B6:1B:2A:26:63:5B:83:FE:42:7B
sig_hashalgo: sha256
parm: minor_count:Approximate number of drbd devices (1-255)
(uint)
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: proc_details:int
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: usermode_helper:string
Cheers,
---
Adi Pircalabu
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user