[Kernel-packages] [Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

Max Wolffe Thu, 28 Nov 2024 11:35:35 -0800

Hey Philip,

Thank you for the response. We think we've isolated an eBPF program
we're running which might cause this interaction, I'll see on Monday if
I can get you some more information to help debug.


> 1)  Can you please run the command:
      apport-collect 2089318

Will aim to get you this on Monday when the team resumes investigation.

> 2) Is there anything I can do increase the likelihood of reproducing
this?

I'll see if I can get you a better shape of the data here on Monday as
well which could help with repro.

> 3) The bug title states you hit this on kernel version
5.15.0-1072-aws. Did you hit this on previous kernels, or is this a new
regression that has appeared in the 5.15.0-1072-aws kernel?

We were definitely able to reproduce this as well on 5.15.0-1070-aws,
and we think this has been a latent bug for a while which a recent
deploy may have exposed.

> 4) You state that the same issue occurs on Azure, and GCP. Is that
using the AWS kernel, or the Azure and GCP kernels (respectively)?

These are using cloud kernels respectively. Those are:

Azure - 5.15.0-1075-azure
GCP - 5.15.0-1071-gcp


Thanks again for taking a look - will aim to share more info on Monday.
Happy Thanksgiving!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2089318

Title:
  kernel hard lockup 5.15.0-1072-aws

Status in linux package in Ubuntu:
  Triaged
Status in linux-aws-5.15 package in Ubuntu:
  Triaged
Status in linux source package in Focal:
  New
Status in linux-aws-5.15 source package in Focal:
  New

Bug description:
  Hi friends,

  We hit a kernel hard lockup where all CPUs are stuck acquiring an
  already-locked spinlock (css_set_lock) within the cgroup subsystem.
  Below are the call stacks from a memory dump of a two-core system
  taken on Ubuntu 20.04 (5.15 kernel) on AWS, but the same issue occurs
  on Azure and GCP too.  This is happening in a non-deterministic
  fashion (less than 1%), and can occur at any time of the VM execution.
  We suspect it’s a deadlock triggered by some race condition, but we
  don’t know for sure.

  ```
  PID: 21079    TASK: ffff91fdcd1dc000  CPU: 0    COMMAND: "sh"
   #0 [fffffe7127850cb8] machine_kexec at ffffffffadc92680
   #1 [fffffe7127850d18] __crash_kexec at ffffffffadda0b9f
   #2 [fffffe7127850de0] panic at ffffffffae8f56be
   #3 [fffffe7127850e70] unknown_nmi_error.cold at ffffffffae8eb4c8
   #4 [fffffe7127850e90] default_do_nmi at ffffffffae99c639
   #5 [fffffe7127850eb8] exc_nmi at ffffffffae99c7db
   #6 [fffffe7127850ef0] end_repeat_nmi at ffffffffaea017f3
      [exception RIP: native_queued_spin_lock_slowpath+63]
      RIP: ffffffffadd40eff  RSP: ffffa1f68589fc60  RFLAGS: 00000002 (interrupt 
disabled!!)
      RAX: 0000000000000001  RBX: ffffffffb0ea5804  RCX: ffff91fb597c8980
      RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffffffffb0ea5804
      RBP: ffffa1f68589fc88   R8: 0000000000005259   R9: 00000000597c8980
      R10: 0000000000000000  R11: 0000000000000000  R12: ffffa1f68589fdf8
      R13: ffff91fdcd1d8000  R14: 0000000000004100  R15: ffff91fdcd1d8000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #7 [ffffa1f68589fc60] native_queued_spin_lock_slowpath at ffffffffadd40eff
   #8 [ffffa1f68589fc90] _raw_spin_lock_irq at ffffffffae9af19a
   #9 [ffffa1f68589fca0] cgroup_can_fork at ffffffffaddb0de8
  #10 [ffffa1f68589fce8] copy_process at ffffffffadcc1938
  #11 [ffffa1f68589fcf0] filemap_map_pages at ffffffffadeb68db
  #12 [ffffa1f68589fdf0] __x64_sys_vfork at ffffffffadcc2a20
  #13 [ffffa1f68589fe70] x64_sys_call at ffffffffadc068a9
  #14 [ffffa1f68589fe80] do_syscall_64 at ffffffffae99a9e4
  #15 [ffffa1f68589fec0] exit_to_user_mode_prepare at ffffffffadd725ad
  #16 [ffffa1f68589ff00] irqentry_exit_to_user_mode at ffffffffae99f43e
  #17 [ffffa1f68589ff10] irqentry_exit at ffffffffae99f46d
  #18 [ffffa1f68589ff18] clear_bhb_loop at ffffffffaea018c5
  #19 [ffffa1f68589ff28] clear_bhb_loop at ffffffffaea018c5
  #20 [ffffa1f68589ff38] clear_bhb_loop at ffffffffaea018c5
  #21 [ffffa1f68589ff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
      RIP: 00007fddfa4cebcc  RSP: 00007fffaa741990  RFLAGS: 00000202
      RAX: ffffffffffffffda  RBX: 000055ea66750428  RCX: 00007fddfa4cebcc
      RDX: 0000000000000000  RSI: 00007fffaa7419c0  RDI: 000055ea663c8866
      RBP: 0000000000000003   R8: 00007fffaa7419c0   R9: 000055ea667505f0
      R10: 0000000000000008  R11: 0000000000000202  R12: 00007fffaa7419c0
      R13: 00007fffaa741ae0  R14: 0000000000000000  R15: 000055ea663de810
      ORIG_RAX: 000000000000003a  CS: 0033  SS: 002b

  
  PID: 20304    TASK: ffff91fb05440000  CPU: 1    COMMAND: "Writer:Driver>C"
   #0 [fffffe6c293d3e10] crash_nmi_callback at ffffffffadc81ec0
   #1 [fffffe6c293d3e48] nmi_handle at ffffffffadc49b03
   #2 [fffffe6c293d3e90] default_do_nmi at ffffffffae99c5a5
   #3 [fffffe6c293d3eb8] exc_nmi at ffffffffae99c7db
   #4 [fffffe6c293d3ef0] end_repeat_nmi at ffffffffaea017f3
      [exception RIP: native_queued_spin_lock_slowpath+63]
      RIP: ffffffffadd40eff  RSP: ffffa1f6853afd00  RFLAGS: 00000002 (interrupt 
disabled!!)
      RAX: 0000000000000001  RBX: ffffffffb0ea5804  RCX: ffff91fa1d0aee00
      RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffffffffb0ea5804
      RBP: ffffa1f6853afd28   R8: 000000000000525a   R9: 000000001d0aee00
      R10: 0000000000000000  R11: 0000000000000000  R12: ffffa1f6853afe98
      R13: ffff91fd8eeea000  R14: 00000000003d0f00  R15: ffff91fd8eeea000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #5 [ffffa1f6853afd00] native_queued_spin_lock_slowpath at ffffffffadd40eff
   #6 [ffffa1f6853afd30] _raw_spin_lock_irq at ffffffffae9af19a
   #7 [ffffa1f6853afd40] cgroup_can_fork at ffffffffaddb0de8
   #8 [ffffa1f6853afd88] copy_process at ffffffffadcc1938
   #9 [ffffa1f6853afe20] kernel_clone at ffffffffadcc262d
  #10 [ffffa1f6853afe90] __do_sys_clone at ffffffffadcc2a9d
  #11 [ffffa1f6853aff10] __x64_sys_clone at ffffffffadcc2ae5
  #12 [ffffa1f6853aff20] x64_sys_call at ffffffffadc05579
  #13 [ffffa1f6853aff30] do_syscall_64 at ffffffffae99a9e4
  #14 [ffffa1f6853aff50] entry_SYSCALL_64_after_hwframe at ffffffffaea00124
      RIP: 00007f0d8bcac9f6  RSP: 00007f0cfabfcc38  RFLAGS: 00000206
      RAX: ffffffffffffffda  RBX: 00007f0cfabfcc90  RCX: 00007f0d8bcac9f6
      RDX: 00007f0ced3ff910  RSI: 00007f0ced3feef0  RDI: 00000000003d0f00
      RBP: ffffffffffffff80   R8: 00007f0ced3ff640   R9: 00007f0ced3ff640
      R10: 00007f0ced3ff910  R11: 0000000000000206  R12: 00007f0ced3ff640
      R13: 0000000000000016  R14: 00007f0d8bc1b7d0  R15: 00007f0cfabfcdf0
      ORIG_RAX: 0000000000000038  CS: 0033  SS: 002b
  ```

  Environment

  ```
  $ uname -a
  Linux ip-172-31-16-171 5.15.0-1072-aws #78~20.04.1-Ubuntu SMP Wed Oct 9 
15:30:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  $ cat /proc/cpuinfo
  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 6
  model           : 106
  model name      : Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
  stepping        : 6
  microcode       : 0xd0003e8
  cpu MHz         : 2900.036
  cache size      : 55296 KB
  physical id     : 0
  siblings        : 8
  core id         : 0
  cpu cores       : 4
  apicid          : 0
  initial apicid  : 0
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 27
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase 
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap 
avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec 
xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes 
vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear 
flush_l1d arch_capabilities
  bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs 
mmio_stale_data eibrs_pbrsb gds bhi
  bogomips        : 5800.07
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 46 bits physical, 48 bits virtual
  ```

  We see this very infrequently, but have experienced it on a variety of
  instanceTypes - r6i.large , r6i.xlarge, r6i.2large at least.

  Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089318/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2089318] Re: kernel hard lockup 5.15.0-1072-aws

Reply via email to