Control: tags -1 + moreinfo Hi Christian,
On Fri, Apr 21, 2023 at 02:21:19PM +0200, Christian Barth wrote: > Subject: linux-image-5.10.0-20-rt-amd64: RT-Kernel freeze during IO load > Package: linux-image-5.10.0-20-rt-amd64 > X-Debbugs-Cc: c.ba...@goepel.com > Version: 5.10.158-2 > Severity: normal > > Dear Maintainer, > > I have a problem with an application that leads to a > possible kernel freeze. > We tried to split the problem, but unfortunately we did not succeed. > > We were able to recreate a system freeze after about 3-15h using simple > echo > transfers over 6 PCIe cards. We ran our PCIe driver without the API part > and > with a simple echo test several times with no problem for more than 60h. > On the other hand, we tested the API part by simulating the interface to > the > driver using predefined responses. > > As you can see in the logs the first error seems to be always the same and > has to do with messagequeues (our server application uses mq as ipc). > Sometimes seconds or hours later there are more errors like in the 2nd > log. > > Kind regards, > Christian > > > [42546.217248] WARNING: CPU: 5 PID: 1717 at kernel/sched/core.c:2500 > set_task_cpu+0x1cd/0x1e0 > [42546.217255] Modules linked in: udp_diag tcp_diag inet_diag Pxi62xx(OE) > rfkill binfmt_misc intel_rapl_msr intel_rapl_common joydev > snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp dell_smm_hwmon > snd_hda_codec_realtek coretemp snd_hda_codec_generic ledtrig_audio > kvm_intel kvm irqbypass dell_wmi ghash_clmulni_intel snd_hda_intel > snd_intel_dspcfg aesni_intel soundwire_intel libaes > soundwire_generic_allocation crypto_simd cryptd glue_helper snd_soc_core > rapl snd_compress soundwire_cadence intel_cstate snd_hda_codec dell_smbios > snd_hda_core dcdbas snd_hwdep intel_uncore sparse_keymap soundwire_bus > dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt mei_hdcp mei_wdt > iTCO_wdt snd_pcm intel_pmc_bxt ee1004 snd_timer iTCO_vendor_support > serio_raw pcspkr sg watchdog snd soundcore mei_me mei intel_pch_thermal > evdev intel_pmc_core acpi_pad parport_pc ppdev lp parport fuse configfs > ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic > raid10 raid456 async_raid6_recov > [42546.217314] async_memcpy async_pq async_xor async_tx xor raid6_pq > libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_microsoft > ff_memless hid_generic sd_mod t10_pi crc_t10dif crct10dif_generic usbhid > hid i915 ahci i2c_algo_bit xhci_pci drm_kms_helper libahci xhci_hcd > crct10dif_pclmul crct10dif_common cec e1000e libata crc32_pclmul > crc32c_intel psmouse i2c_i801 ptp usbcore drm pps_core scsi_mod i2c_smbus > usb_common wmi fan video button > [42546.217346] CPU: 5 PID: 1717 Comm: EchoTest Tainted: G OE > 5.10.0-21-rt-amd64 #1 Debian 5.10.162-1 > [42546.217349] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS > 2.18.0 06/10/2021 > [42546.217350] RIP: 0010:set_task_cpu+0x1cd/0x1e0 > [42546.217353] Code: c9 0f 84 a0 fe ff ff 8b 43 68 83 f8 02 0f 84 94 fe ff > ff 0f 0b e9 8d fe ff ff 0f 0b 66 83 bb c0 07 00 00 00 0f 84 9c fe ff ff > <0f> 0b e9 95 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 > [42546.217356] RSP: 0018:ffffb96002873d48 EFLAGS: 00010002 > [42546.217358] RAX: 0000000000000000 RBX: ffff9245595d8000 RCX: > 0000000000000000 > [42546.217359] RDX: ffff924c0dcf1278 RSI: 0000000000000002 RDI: > ffff9245595d8000 > [42546.217361] RBP: 0000000000000002 R08: ffff924c0dcf1278 R09: > 0000000000000000 > [42546.217362] R10: 0000000000000001 R11: 0000000000000000 R12: > 0000000000000002 > [42546.217363] R13: ffff9245595d8028 R14: 0000000000000000 R15: > ffff924c0dcf09c0 > [42546.217365] FS: 00007f97fa35b740(0000) GS:ffff924c0dd40000(0000) > knlGS:0000000000000000 > [42546.217367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [42546.217369] CR2: 00007fcbfc13d000 CR3: 0000000138674005 CR4: > 00000000003706e0 > [42546.217370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [42546.217371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [42546.217373] Call Trace: > [42546.217377] push_rt_task.part.0+0x1bf/0x420 > [42546.217382] task_woken_rt+0x5d/0x70 > [42546.217385] ttwu_do_wakeup+0x45/0x190 > [42546.217388] try_to_wake_up+0x19b/0x6b0 > [42546.217391] __wake_up_q+0x99/0xd0 > [42546.217395] do_mq_timedsend+0x23b/0x410 > [42546.217400] __x64_sys_mq_timedsend+0x70/0xb0 > [42546.217403] ? exit_to_user_mode_prepare+0x33/0x1a0 > [42546.217407] do_syscall_64+0x30/0x80 > [42546.217410] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [42546.217414] RIP: 0033:0x7f97fa698040 > [42546.217417] Code: 89 4c 24 14 e8 61 04 00 00 4c 8b 44 24 18 44 8b 54 24 > 14 41 89 c1 48 8b 54 24 08 48 8b 34 24 b8 f2 00 00 00 8b 7c 24 10 0f 05 > <48> 3d 00 f0 ff ff 77 30 44 89 cf 89 44 24 10 e8 8c 04 00 00 8b 44 > [42546.217418] RSP: 002b:00007ffc369eeb90 EFLAGS: 00000293 ORIG_RAX: > 00000000000000f2 > [42546.217421] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007f97fa698040 > [42546.217422] RDX: 000000000000003f RSI: 000055b54a7a31c8 RDI: > 0000000000000008 > [42546.217424] RBP: 0000000000000000 R08: 0000000000000000 R09: > 0000000000000000 > [42546.217425] R10: 0000000000000000 R11: 0000000000000293 R12: > 0000000000000007 > [42546.217426] R13: 000055b54a7b4230 R14: 000000000000002b R15: > 00007f97fa9167c0 > [42546.217430] ---[ end trace 0000000000000002 ]--- > [42547.207213] sched: RT throttling activated > > > [23686.539555] WARNING: CPU: 6 PID: 1811 at kernel/sched/core.c:2500 > set_task_cpu+0x1cd/0x1e0 > [23686.539563] Modules linked in: Pxi62xx(OE) rfkill binfmt_misc joydev > intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal dell_smm_hwmon > snd_hda_codec_hdmi intel_powerclamp coretemp snd_hda_codec_realtek > snd_hda_codec_generic ledtrig_audio kvm_intel kvm irqbypass dell_wmi > ghash_clmulni_intel mei_hdcp mei_wdt aesni_intel snd_hda_intel libaes > snd_intel_dspcfg crypto_simd cryptd soundwire_intel glue_helper > soundwire_generic_allocation rapl snd_soc_core snd_compress intel_cstate > soundwire_cadence dell_smbios dcdbas snd_hda_codec iTCO_wdt serio_raw > sparse_keymap dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt > intel_uncore snd_hda_core intel_pmc_bxt snd_hwdep iTCO_vendor_support > ee1004 pcspkr sg watchdog soundwire_bus snd_pcm snd_timer snd mei_me > soundcore mei intel_pch_thermal evdev acpi_pad intel_pmc_core parport_pc > ppdev lp parport fuse configfs ip_tables x_tables autofs4 ext4 crc16 > mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor > [23686.539604] async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 > multipath linear md_mod hid_microsoft ff_memless hid_generic sd_mod t10_pi > crc_t10dif crct10dif_generic usbhid hid i915 i2c_algo_bit xhci_pci > drm_kms_helper ahci xhci_hcd libahci cec libata crct10dif_pclmul > crct10dif_common crc32_pclmul e1000e usbcore crc32c_intel i2c_i801 ptp drm > psmouse scsi_mod pps_core i2c_smbus usb_common wmi fan video button > [23686.539625] CPU: 6 PID: 1811 Comm: EchoTest Tainted: G OE > 5.10.0-21-rt-amd64 #1 Debian 5.10.162-1 > [23686.539627] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS > 2.18.0 06/10/2021 > [23686.539628] RIP: 0010:set_task_cpu+0x1cd/0x1e0 > [23686.539630] Code: c9 0f 84 a0 fe ff ff 8b 43 68 83 f8 02 0f 84 94 fe ff > ff 0f 0b e9 8d fe ff ff 0f 0b 66 83 bb c0 07 00 00 00 0f 84 9c fe ff ff > <0f> 0b e9 95 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 > [23686.539632] RSP: 0018:ffffa0f3c2467d48 EFLAGS: 00010002 > [23686.539634] RAX: 0000000000000000 RBX: ffff94b28ed6e400 RCX: > 0000000000000000 > [23686.539635] RDX: ffff94b68dcf1278 RSI: 0000000000000000 RDI: > ffff94b28ed6e400 > [23686.539636] RBP: 0000000000000000 R08: ffff94b68dcf1278 R09: > 0000000000000000 > [23686.539637] R10: 0000000000000001 R11: 0000000000000000 R12: > 0000000000000000 > [23686.539638] R13: ffff94b28ed6e428 R14: 0000000000000000 R15: > ffff94b68dcf09c0 > [23686.539639] FS: 00007f4d28db7740(0000) GS:ffff94b68dd80000(0000) > knlGS:0000000000000000 > [23686.539640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [23686.539641] CR2: 00007f89d826c000 CR3: 0000000189bd2002 CR4: > 00000000003706e0 > [23686.539642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [23686.539643] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [23686.539644] Call Trace: > [23686.539647] push_rt_task.part.0+0x1bf/0x420 > [23686.539651] task_woken_rt+0x5d/0x70 > [23686.539652] ttwu_do_wakeup+0x45/0x190 > [23686.539655] try_to_wake_up+0x19b/0x6b0 > [23686.539657] __wake_up_q+0x99/0xd0 > [23686.539659] do_mq_timedsend+0x23b/0x410 > [23686.539662] ? pick_next_task_fair+0x182/0x3d0 > [23686.539664] ? _raw_spin_unlock_irq+0x1d/0x60 > [23686.539667] ? __schedule+0x824/0x980 > [23686.539670] __x64_sys_mq_timedsend+0x70/0xb0 > [23686.539672] do_syscall_64+0x30/0x80 > [23686.539674] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [23686.539677] RIP: 0033:0x7f4d290f4040 > [23686.539679] Code: 89 4c 24 14 e8 61 04 00 00 4c 8b 44 24 18 44 8b 54 24 > 14 41 89 c1 48 8b 54 24 08 48 8b 34 24 b8 f2 00 00 00 8b 7c 24 10 0f 05 > <48> 3d 00 f0 ff ff 77 30 44 89 cf 89 44 24 10 e8 8c 04 00 00 8b 44 > [23686.539680] RSP: 002b:00007fffbee9b040 EFLAGS: 00000293 ORIG_RAX: > 00000000000000f2 > [23686.539682] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007f4d290f4040 > [23686.539683] RDX: 0000000000000021 RSI: 000055e8bffa81c8 RDI: > 0000000000000008 > [23686.539683] RBP: 0000000000000000 R08: 0000000000000000 R09: > 0000000000000000 > [23686.539684] R10: 0000000000000000 R11: 0000000000000293 R12: > 0000000000000007 > [23686.539685] R13: 000055e8bffb9230 R14: 000000000000000d R15: > 00007f4d293727c0 > [23686.539689] ---[ end trace 0000000000000002 ]--- > [23687.529423] sched: RT throttling activated > [23687.529423] sched: RT throttling activated > [23687.529438] ------------[ cut here ]------------ > [23687.529440] DEBUG_LOCKS_WARN_ON(l->nestcnt == 0) > [23687.529448] WARNING: CPU: 0 PID: 1825 at > include/linux/local_lock_internal.h:67 refill_obj_stock+0x1ca/0x260 > [23687.529452] Modules linked in: Pxi62xx(OE) rfkill binfmt_misc joydev > intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal dell_smm_hwmon > snd_hda_codec_hdmi intel_powerclamp coretemp snd_hda_codec_realtek > snd_hda_codec_generic ledtrig_audio kvm_intel kvm irqbypass dell_wmi > ghash_clmulni_intel mei_hdcp mei_wdt aesni_intel snd_hda_intel libaes > snd_intel_dspcfg crypto_simd cryptd soundwire_intel glue_helper > soundwire_generic_allocation rapl snd_soc_core snd_compress intel_cstate > soundwire_cadence dell_smbios dcdbas snd_hda_codec iTCO_wdt serio_raw > sparse_keymap dell_wmi_descriptor wmi_bmof intel_wmi_thunderbolt > intel_uncore snd_hda_core intel_pmc_bxt snd_hwdep iTCO_vendor_support > ee1004 pcspkr sg watchdog soundwire_bus snd_pcm snd_timer snd mei_me > soundcore mei intel_pch_thermal evdev acpi_pad intel_pmc_core parport_pc > ppdev lp parport fuse configfs ip_tables x_tables autofs4 ext4 crc16 > mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor > [23687.529504] async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 > multipath linear md_mod hid_microsoft ff_memless hid_generic sd_mod t10_pi > crc_t10dif crct10dif_generic usbhid hid i915 i2c_algo_bit xhci_pci > drm_kms_helper ahci xhci_hcd libahci cec libata crct10dif_pclmul > crct10dif_common crc32_pclmul e1000e usbcore crc32c_intel i2c_i801 ptp drm > psmouse scsi_mod pps_core i2c_smbus usb_common wmi fan video button > [23687.529538] CPU: 0 PID: 1825 Comm: Srv-RequestT Tainted: G W OE > 5.10.0-21-rt-amd64 #1 Debian 5.10.162-1 > [23687.529540] Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS > 2.18.0 06/10/2021 > [23687.529541] RIP: 0010:refill_obj_stock+0x1ca/0x260 > [23687.529544] Code: 07 ff ff ff e8 77 bd 1b 00 85 c0 74 1f 8b 0d b5 07 61 > 01 85 c9 75 15 48 c7 c6 06 42 ad 89 48 c7 c7 e2 f1 ab 89 e8 eb f0 60 00 > <0f> 0b 8b 15 f6 57 cb 01 85 d2 0f 84 1a ff ff ff e9 28 ff ff ff e8 > [23687.529547] RSP: 0018:ffffa0f3c255fdd0 EFLAGS: 00010286 > [23687.529549] RAX: 0000000000000000 RBX: ffff94b68dcedd60 RCX: > 0000000000000000 > [23687.529550] RDX: 0000000000000001 RSI: ffffffff89b10c0c RDI: > 00000000ffffffff > [23687.529552] RBP: ffff94b68dc2dd60 R08: ffffffff8a063d00 R09: > 000000008a992eeb > [23687.529553] R10: ffffffffffffffff R11: 0000000000000001 R12: > ffff94b68dc2dd60 > [23687.529553] R13: 00000000000017d0 R14: ffff94afa89714c0 R15: > ffff94af80043a00 > [23687.529555] FS: 00007f2cba7fc700(0000) GS:ffff94b68dc00000(0000) > knlGS:0000000000000000 > [23687.529556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [23687.529557] CR2: 00007fcbd7797000 CR3: 00000001aa754004 CR4: > 00000000003706f0 > [23687.529560] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [23687.529561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [23687.529562] Call Trace: > [23687.529564] ? free_msg+0x1b/0x40 > [23687.529566] kfree+0x382/0x500 > [23687.529572] free_msg+0x1b/0x40 > [23687.529574] do_mq_timedreceive+0x370/0x550 > [23687.529578] __x64_sys_mq_timedreceive+0x70/0xb0 > [23687.529581] do_syscall_64+0x30/0x80 > [23687.529584] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [23687.529588] RIP: 0033:0x7f2cd31ce0f3 > [23687.529592] Code: 4c 24 20 e8 af 03 00 00 4c 8b 44 24 28 4c 8b 54 24 20 > 41 89 c1 48 8b 54 24 18 48 8b 74 24 10 b8 f3 00 00 00 8b 7c 24 08 0f 05 > <48> 3d 00 f0 ff ff 77 35 44 89 cf 48 89 44 24 08 e8 d8 03 00 00 48 > [23687.529593] RSP: 002b:00007f2cba7fbdd0 EFLAGS: 00000293 ORIG_RAX: > 00000000000000f3 > [23687.529595] RAX: ffffffffffffffda RBX: 000000000000000a RCX: > 00007f2cd31ce0f3 > [23687.529596] RDX: 000000000000800c RSI: 00007f2cc400a530 RDI: > 000000000000001c > [23687.529597] RBP: 00007f2cc400a530 R08: 00007f2cba7fbe10 R09: > 0000000000000000 > [23687.529598] R10: 0000000000000000 R11: 0000000000000293 R12: > 000000000000800c > [23687.529599] R13: 000000000000001c R14: 0000000000000000 R15: > 0000000000000000 > [23687.529603] ---[ end trace 0000000000000003 ]--- If I see it correctly you have as well an out of tree module loaded, involved. Can you reproduce the issue without that beeing loaded in a repsective test setup? 5.10.158-2 is not the newest kernel in bullseye, can you reproduce the issue with 5.10.162-1? Regards, Salvatore The next weekend's point release aims to include as well a rebase to 5.10.178-1.