Hi everyone,

after updating our storage servers to kernel 4.9.33 we experience repeated 
crashes on some machines.

A typical kernel stacktrace looks like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000304
[136234.369489] IP: [<ffffffff814f1717>] flush_unmaps_timeout+0xa7/0x1c0
[136234.382388] PGD 0 [136234.386238]
[136234.389393] Oops: 0000 [#1] SMP
[136234.395844] Modules linked in: rbd libceph deadline_iosched nf_log_ipv4 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv6 
nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack 
x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel acpi_cpufreq 
nf_conntrack_ftp nf_conntrack dm_zero dm_thin_pool dm_persistent_data 
dm_bio_prison dm_round_robin dm_multipath xts aesni_intel glue_helper lrw 
ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio dm_crypt dm_mirror 
dm_region_hash dm_log
[136234.491272] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 4.9.33 #1
[136234.503785] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 
07/31/2013
[136234.518733] task: ffff88105b86b900 task.stack: ffffc90006518000
[136234.530735] RIP: 0010:[<ffffffff814f1717>]  [<ffffffff814f1717>] 
flush_unmaps_timeout+0xa7/0x1c0
[136234.548498] RSP: 0018:ffffc9000651bd18  EFLAGS: 00010006
[136234.559283] RAX: ffff880858a01f48 RBX: 0000000000000000 RCX: 
ffff88085c246000
[136234.573708] RDX: 000000000000003f RSI: 0000000000000086 RDI: 
ffff88085d12c5c0
[136234.588133] RBP: ffffc9000651bd68 R08: 0000000000000000 R09: 
ffff88085d12c5c0
[136234.602558] R10: 0000000000000001 R11: 0000000000000000 R12: 
ffff88084f8e8a80
[136234.616986] R13: ffff880858a01f48 R14: ffff88085c25a3c0 R15: 
00000000000fddc9
[136234.631410] FS:  0000000000000000(0000) GS:ffff88107fa00000(0000) 
knlGS:0000000000000000
[136234.647743] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[136234.659393] CR2: 0000000000000304 CR3: 000000073a8a1000 CR4: 
00000000000406e0
[136234.673815] Stack:
[136234.678013]  0000000000000286 ffff88107fa137c0 000000018108f5c4 
ffffc9000651bd88
[136234.693095]  ffffffff8108ed23 0000000000000001 ffff88107fa137c8 
0000000000000100
[136234.708178]  ffffffff814f1670 0000000000000006 ffffc9000651bda0 
ffffffff810c0915
[136234.723263] Call Trace:
[136234.728337]  [<ffffffff8108ed23>] ? put_prev_entity+0x83/0x850
[136234.740197]  [<ffffffff814f1670>] ? iommu_flush_iotlb_psi+0x120/0x120
[136234.753243]  [<ffffffff810c0915>] call_timer_fn+0x35/0x120
[136234.764378]  [<ffffffff810c189e>] run_timer_softirq+0x1fe/0x460
[136234.776397]  [<ffffffff818865d7>] __do_softirq+0xe7/0x256
[136234.787369]  [<ffffffff8107ec80>] ? sort_range+0x30/0x30
[136234.798152]  [<ffffffff810635bc>] run_ksoftirqd+0x1c/0x30
[136234.809111]  [<ffffffff8107ed8a>] smpboot_thread_fn+0x10a/0x160
[136234.821117]  [<ffffffff8107b377>] kthread+0xd7/0xf0
[136234.831044]  [<ffffffff8107b2a0>] ? kthread_park+0x60/0x60
[136234.842183]  [<ffffffff81884112>] ret_from_fork+0x22/0x30
[136234.853141] Code: 8b 45 00 85 c0 0f 84 d3 00 00 00 41 f6 46 18 80 0f 84 02 
01 00 00 85 c0 0f 8e b8 00 00 00 31 db eb 3b ba ff ff ff ff 49 0f bd d4 <41> 80 
bb 04 03 00 00 00 75 7d 49 8d bb 18 03 00 00 4c 89 e2 4c
[136234.893223] RIP  [<ffffffff814f1717>] flush_unmaps_timeout+0xa7/0x1c0
[136234.906296]  RSP <ffffc9000651bd18>
[136234.913446] CR2: 0000000000000304
[136234.920624] ---[ end trace c3bd71ceb3b717b2 ]---
[136234.930021] Kernel panic - not syncing: Fatal exception in interrupt

As far as I can see, this bug occurs only on machines with a Xeon E5-2620.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Stepping:              7
CPU MHz:               1200.000
CPU max MHz:           2001.0000
CPU min MHz:           1200.0000
BogoMIPS:              4007.87
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23

Has anyone an idea how to work around this problem? The crashes are frequent 
enough to affect production in our data centers.

Many thanks in advance!

Best regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · [email protected] · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to