[Kernel-packages] [Bug 1733662] Re: System hang with Linux kernel 4.13, not with 4.10

Rod Smith Fri, 05 Jan 2018 14:51:12 -0800

That one completed two runs, but on the second run, dmesg included the
following message at one point:


[  240.841694] kernel BUG at 
/home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  240.842765] invalid opcode: 0000 [#1] SMP
[  240.843718] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf 
ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler 
acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm 
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure 
scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic 
mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel 
sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd 
i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[  240.851457]  drm pps_core megaraid_sas scsi_transport_fc enic wmi
[  240.852693] CPU: 8 PID: 2724 Comm: irqbalance Not tainted 4.13.0-13-generic 
#14~lp1733662Commitac2fc5adab0f4
[  240.853965] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, 
BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  240.855281] task: ffff9b62a76645c0 task.stack: ffffb973cf6fc000
[  240.856603] RIP: 0010:kfree+0x11c/0x160
[  240.857937] RSP: 0018:ffffb973cf6ffa08 EFLAGS: 00010246
[  240.859280] RAX: fffff8803cff0020 RBX: ffff9b6200000000 RCX: 0000000000000000
[  240.860632] RDX: 0000000000000000 RSI: ffff9b62b0eb5348 RDI: 000064dcc0000000
[  240.861995] RBP: ffffb973cf6ffa20 R08: ffff9b62b22f70f0 R09: 0000000180220021
[  240.863367] R10: fffff8803d000000 R11: 0000000000000001 R12: ffff9b62b1648780
[  240.864756] R13: ffffffffb65dd4e0 R14: ffff9b62a872f0d8 R15: ffff9b62a872fac0
[  240.866145] FS:  00007ff8c4d06740(0000) GS:ffff9b62bf200000(0000) 
knlGS:0000000000000000
[  240.867562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  240.868986] CR2: 00007fff9ef860f8 CR3: 0000003fe7876000 CR4: 00000000001406e0
[  240.870438] Call Trace:
[  240.871882]  kfree_const+0x20/0x30
[  240.873328]  kernfs_put+0x71/0x180
[  240.874778]  kernfs_dop_release+0x12/0x20
[  240.876218]  __dentry_kill+0xe5/0x150
[  240.877644]  shrink_dentry_list+0x11f/0x2e0
[  240.879078]  d_invalidate+0x67/0x110
[  240.880526]  lookup_fast+0x2b9/0x310
[  240.881968]  ? dput.part.23+0x2d/0x1e0
[  240.883393]  walk_component+0x49/0x340
[  240.884811]  ? kernfs_iop_permission+0x4f/0x60
[  240.886253]  link_path_walk+0x1bc/0x590
[  240.887690]  ? path_init+0x177/0x2f0
[  240.889105]  path_lookupat+0x56/0x1f0
[  240.890529]  filename_lookup+0xb6/0x190
[  240.891964]  ? sprintf+0x51/0x70
[  240.893387]  ? __check_object_size+0xaf/0x1b0
[  240.894822]  ? strncpy_from_user+0x4d/0x170
[  240.896240]  user_path_at_empty+0x36/0x40
[  240.897673]  ? user_path_at_empty+0x36/0x40
[  240.899101]  vfs_statx+0x76/0xe0
[  240.900517]  SYSC_newstat+0x3d/0x70
[  240.901934]  ? ____fput+0xe/0x10
[  240.903365]  ? task_work_run+0x7b/0x90
[  240.904783]  ? exit_to_usermode_loop+0x9b/0xd0
[  240.906181]  SyS_newstat+0xe/0x10
[  240.907559]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  240.908900] RIP: 0033:0x7ff8c3df6bb5
[  240.910196] RSP: 002b:00007ffe6cf8a928 EFLAGS: 00000246 ORIG_RAX: 
0000000000000004
[  240.911496] RAX: ffffffffffffffda RBX: 0000000000fe9a40 RCX: 00007ff8c3df6bb5
[  240.912763] RDX: 00007ffe6cf8a980 RSI: 00007ffe6cf8a980 RDI: 00007ffe6cf8c210
[  240.913985] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000039
[  240.915181] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  240.916320] R13: 00007ffe6cf8b22b R14: 0000000000fe9a40 R15: 0000000000fe92f0
[  240.917447] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 
c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 
49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  240.919769] RIP: kfree+0x11c/0x160 RSP: ffffb973cf6ffa08
[  240.920909] ---[ end trace 67fe147f4dd931eb ]---

A third run produced a hang when offlining CPU 8, with the following
dmesg output:

[  352.776303] EDAC MC1: Giving out device to module sb_edac.c controller 
Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  352.776572] EDAC sbridge: Some needed devices are missing
[  352.801614] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: 
DEV 0000:ff:12.0
[  352.825588] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: 
DEV 0000:7f:12.0
[  352.826090] EDAC sbridge: Couldn't find mci handler
[  352.826457] EDAC sbridge: Couldn't find mci handler
[  352.826826] EDAC sbridge: Failed to register device with error -19.
[  353.286163] BUG: unable to handle kernel paging request at 0000317865646e69
[  353.286790] IP: __kmalloc_node+0x135/0x2a0
[  353.287303] PGD 0 
[  353.287304] P4D 0 

[  353.288695] Oops: 0000 [#2] SMP
[  353.289158] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf 
ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler 
acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm 
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure 
scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic 
mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel 
sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd 
i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[  353.294318]  drm pps_core megaraid_sas scsi_transport_fc enic wmi
[  353.295246] CPU: 8 PID: 56 Comm: cpuhp/8 Tainted: G      D         
4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[  353.296231] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, 
BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  353.297274] task: ffff9b62b8fc0000 task.stack: ffffb973cc780000
[  353.298341] RIP: 0010:__kmalloc_node+0x135/0x2a0
[  353.299416] RSP: 0018:ffffb973cc783bb0 EFLAGS: 00010246
[  353.300511] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000008a2
[  353.301652] RDX: 00000000000008a1 RSI: 0000000000000000 RDI: 000000000001f3e0
[  353.302793] RBP: ffffb973cc783bf0 R08: ffff9b62bf21f3e0 R09: ffff9b42bf807c00
[  353.303960] R10: 000000000000024c R11: 0000000000020dd1 R12: 00000000014080c0
[  353.305155] R13: 0000000000000008 R14: 0000317865646e69 R15: ffff9b42bf807c00
[  353.306379] FS:  0000000000000000(0000) GS:ffff9b62bf200000(0000) 
knlGS:0000000000000000
[  353.307637] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  353.308901] CR2: 0000317865646e69 CR3: 0000002343409000 CR4: 00000000001406e0
[  353.310205] Call Trace:
[  353.311531]  ? alloc_cpumask_var_node+0x1f/0x30
[  353.312881]  alloc_cpumask_var_node+0x1f/0x30
[  353.314245]  zalloc_cpumask_var+0x14/0x20
[  353.315616]  cpudl_init+0x6a/0xe0
[  353.316992]  init_rootdomain+0x7a/0xd0
[  353.318393]  build_sched_domains+0x26a/0xdd0
[  353.319817]  ? call_rcu_sched+0x17/0x20
[  353.321249]  ? cpu_attach_domain+0x1af/0x6a0
[  353.322698]  ? kfree+0x14a/0x160
[  353.324146]  partition_sched_domains+0x1c6/0x2f0
[  353.325623]  ? sched_cpu_activate+0xd0/0xd0
[  353.327122]  cpuset_update_active_cpus+0x17/0x40
[  353.328583]  sched_cpu_deactivate+0x94/0xd0
[  353.330052]  ? call_rcu_bh+0x20/0x20
[  353.331495]  ? call_rcu_bh+0x20/0x20
[  353.332894]  ? trace_raw_output_rcu_utilization+0x50/0x50
[  353.334320]  ? pick_next_task_fair+0x48e/0x560
[  353.335736]  cpuhp_invoke_callback+0x84/0x3b0
[  353.337164]  cpuhp_down_callbacks+0x42/0x80
[  353.338579]  cpuhp_thread_fun+0x88/0xe0
[  353.339971]  smpboot_thread_fn+0xec/0x160
[  353.341346]  kthread+0x125/0x140
[  353.342723]  ? sort_range+0x30/0x30
[  353.344106]  ? kthread_create_on_node+0x70/0x70
[  353.345521]  ret_from_fork+0x25/0x30
[  353.346928] Code: 89 cf 4c 89 4d c0 e8 0b 7f 01 00 49 89 c7 4c 8b 4d c0 4d 
85 ff 0f 85 47 ff ff ff 45 31 f6 eb 3c 49 63 47 20 49 8b 3f 48 8d 4a 01 <49> 8b 
1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 20 ff 
[  353.349833] RIP: __kmalloc_node+0x135/0x2a0 RSP: ffffb973cc783bb0
[  353.351218] CR2: 0000317865646e69
[  353.352559] ---[ end trace 67fe147f4dd931ec ]---

Although the test script hung, I was able to continue using my other
terminal normally, run other programs, log out, log back in, etc. An
attempt to reboot ("sudo shutdown -h now") did not succeed; the system
hung with "[ OK ] Stopped target Multi-User System" on the console.
After forcing a restart via the BMC, I ran the test script again, which
completed one run but then hung on the second run, with limited
functionality thereafter. The dmesg output on the second run included
the following:

[  103.752641] ------------[ cut here ]------------
[  103.752643] kernel BUG at 
/home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  103.753548] invalid opcode: 0000 [#1] SMP
[  103.754440] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal 
intel_powerclamp ipmi_ssif coretemp joydev input_leds intel_cstate ipmi_si 
intel_rapl_perf mei_me ipmi_devintf ipmi_msghandler kvm_intel kvm irqbypass mei 
mac_hid shpchp acpi_power_meter lpc_ich acpi_pad ib_iser rdma_cm iw_cm ib_cm 
ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure 
scsi_transport_sas crct10dif_pclmul mgag200 crc32_pclmul igb ttm hid_generic 
ghash_clmulni_intel drm_kms_helper fnic pcbc usbhid dca syscopyarea aesni_intel 
sysfillrect i2c_algo_bit sysimgblt fb_sys_fops hid libfcoe aes_x86_64 ahci ptp 
crypto_simd libfc glue_helper mxm_wmi cryptd drm
[  103.762134]  libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[  103.763369] CPU: 0 PID: 3649 Comm: python3 Not tainted 4.13.0-13-generic 
#14~lp1733662Commitac2fc5adab0f4
[  103.764641] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, 
BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  103.765948] task: ffff8e90a5999740 task.stack: ffff9dbb4e320000
[  103.767263] RIP: 0010:kfree+0x11c/0x160
[  103.768601] RSP: 0018:ffff9dbb4e323cb0 EFLAGS: 00010246
[  103.769941] RAX: fffffa5b3cff0020 RBX: ffff8eb000000000 RCX: 0000000000000000
[  103.771301] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000718ec0000000
[  103.772663] RBP: ffff9dbb4e323cc8 R08: dead000000000100 R09: ffffffff985ed7a8
[  103.774049] R10: fffffa5b3d000000 R11: 0000000000000000 R12: 0000000000000028
[  103.775426] R13: ffffffff97eead09 R14: 000000000000000a R15: ffffffff977143f0
[  103.776809] FS:  00007f1e1c29f700(0000) GS:ffff8e90bfc00000(0000) 
knlGS:0000000000000000
[  103.778214] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  103.779645] CR2: 000055be9d7243a8 CR3: 0000003ff74a3000 CR4: 00000000001406f0
[  103.781094] Call Trace:
[  103.782527]  free_cpumask_var+0x9/0x10
[  103.783961]  smpcfd_dead_cpu+0x24/0x40
[  103.785415]  cpuhp_invoke_callback+0x84/0x3b0
[  103.786859]  ? flow_cache_lookup+0x4c0/0x4c0
[  103.788303]  cpuhp_down_callbacks+0x42/0x80
[  103.789745]  _cpu_down+0xc2/0x100
[  103.791191]  do_cpu_down+0x33/0x50
[  103.792624]  cpu_down+0x10/0x20
[  103.794056]  cpu_subsys_offline+0x14/0x20
[  103.795492]  device_offline+0x73/0xc0
[  103.796926]  online_store+0x4c/0xa0
[  103.798351]  dev_attr_store+0x18/0x30
[  103.799779]  sysfs_kf_write+0x37/0x40
[  103.801201]  kernfs_fop_write+0x11c/0x1a0
[  103.802634]  __vfs_write+0x18/0x40
[  103.804065]  vfs_write+0xb1/0x1a0
[  103.805485]  SyS_write+0x55/0xc0
[  103.806888]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  103.808310] RIP: 0033:0x7f1e1be7f4a0
[  103.809730] RSP: 002b:00007ffc4ead2768 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[  103.811181] RAX: ffffffffffffffda RBX: 0000000001d8b410 RCX: 00007f1e1be7f4a0
[  103.812648] RDX: 0000000000000002 RSI: 0000000001ea1060 RDI: 0000000000000003
[  103.814122] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001
[  103.815600] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[  103.817048] R13: 0000000000501520 R14: 00007ffc4ead2bd0 R15: 00007f1e1ad98240
[  103.818475] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 
c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 
49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  103.821390] RIP: kfree+0x11c/0x160 RSP: ffff9dbb4e323cb0
[  103.822826] ---[ end trace 7c1d545f713a5ad1 ]---

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1733662

Title:
  System hang with Linux kernel 4.13, not with 4.10

Status in linux package in Ubuntu:
  In Progress
Status in linux-hwe package in Ubuntu:
  New
Status in linux source package in Artful:
  In Progress
Status in linux-hwe source package in Artful:
  New
Status in linux source package in Bionic:
  In Progress
Status in linux-hwe source package in Bionic:
  New

Bug description:
  In doing Ubuntu 17.10 regression testing, we've encountered one
  computer (boldore, a Cisco UCS C240 M4 [VIC]), that hangs about one in
  four times when running our cpu_offlining test. This test attempts to
  take all the CPU cores offline except one, then brings them back
  online again. This test ran successfully on boldore with previous
  releases, but with 17.10, the system sometimes (about one in four
  runs) hangs. Reverting to Ubuntu 16.04.3, I found no problems; but
  when I upgraded the 16.04.3 installation to linux-
  image-4.13.0-16-generic, the problem appeared again, so I'm confident
  this is a problem with the kernel. I'm attaching two files, dmesg-
  output-4.10.txt and dmesg-output-4.13.txt, which show the dmesg output
  that appears when running the cpu_offlining test with 4.10.0-38 and
  4.13.0-16 kernels, respectively; the system hung on the 4.13 run. (I
  was running "dmesg -w" in a second SSH login; the files are cut-and-
  pasted from that.)

  I initiated this bug report from an Ubuntu 16.04.3 installation
  running a 4.10 kernel; but as I said, this applies to the 4.13 kernel.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.10.0-38-generic 4.10.0-38.42~16.04.1
  ProcVersionSignature: User Name 4.10.0-38.42~16.04.1-generic 4.10.17
  Uname: Linux 4.10.0-38-generic x86_64
  ApportVersion: 2.20.1-0ubuntu2.10
  Architecture: amd64
  Date: Tue Nov 21 17:36:06 2017
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  SourcePackage: linux-hwe
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1733662/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1733662] Re: System hang with Linux kernel 4.13, not with 4.10

Reply via email to