On 2024/04/10 22:47, Aureau, Georges (Kernel Tools ERT) wrote: > Hello Lianbo, > > In my case, the fix you suggested works properly as I've got a single CPU > with an IRQ exception in a SYSVEC handler. > But it we have multiple CPUs with different IRQ exceptions, then your > suggestion does not work as > x86_64_irq_eframe_link() is changing irq_eframe_link, for example: > > x86_64_irq_eframe_link_init(): > set default irq_eframe_link=-24 since we have sysvec_apic_timer_interrupt > > Then if we do "bt -c cpu" on several cpu X, Y, X, being of their irq_stack: > > x86_64_irq_eframe_link()/cpu X/rip:sysvec_apic_timer_interrupt: > default irq_eframe_link=-24 is matching > -framesize(rip:sysvec_apic_timer_interrupt) > so we are ok > > x86_64_irq_eframe_link()/cpu Y/rip:common_interrupt: > -24 does not work as framesize(common_interrupt)=32 > we try -24-8 = -32, it works, and we set irq_eframe_link=-32 > > x86_64_irq_eframe_link()/cpu Z/rip:sysvec_apic_timer_interrupt: > -32 does not work as framesize(rip:sysvec_apic_timer_interrupt)=24 > we try -32-8 = -40, it does not work, and we are back to square one ! > here we should try -32+8 = -24 > > We would need to have "bt -c cpu" doing runtime per cpu selection between -32 > and -24 depending on framesize(*(pcpu_hot.hardirq_stack_ptr-8)). > So we can't really see "irq_eframe_link" as being a value valid for all cpu's > / IRQ. > The "irq_eframe_link" should be per cpu, depending of what each cpu is > executing (more below). > > Currently x86_64_irq_eframe_link() tries "stkref", then it tries > "stkref-irq_eframe_link" (with irq_eframe_link set in irq_eframe_link_init()), > and then it tries "stkref-irq_eframe_link+8": > > static ulong > x86_64_irq_eframe_link(ulong stkref, struct bt_info *bt, FILE *ofp) > { > ulong irq_eframe; > > if (x86_64_exception_frame(EFRAME_VERIFY, stkref, 0, bt, ofp)) > return stkref; > > irq_eframe = stkref - machdep->machspec->irq_eframe_link; > > if (x86_64_exception_frame(EFRAME_VERIFY, irq_eframe, 0, bt, ofp)) > return irq_eframe; > > if (x86_64_exception_frame(EFRAME_VERIFY, irq_eframe+8, 0, bt, ofp)) > { > machdep->machspec->irq_eframe_link -= 8; > return (irq_eframe + 8); > } > > return irq_eframe; > } > > > I don't have the history as why we try "stkref-irq_eframe_link+8" ? Any idea ? > > But this current design with defaulting irq_eframe_link to -32 (or -24 as per > your suggestion) does not always work. > > We would need to handle both irq_stack rip's from IRQ functions (eg. > common_interrupt): > > #define DEFINE_IDTENTRY_IRQ(func) \ > static void __##func(struct pt_regs *regs, u32 vector); \ > \ > __visible noinstr void func(struct pt_regs *regs, \ > unsigned long error_code) \ > { \ > irqentry_state_t state = irqentry_enter(regs); \ > u32 vector = (u32)(u8)error_code; \ > \ > instrumentation_begin(); \ > kvm_set_cpu_l1tf_flush_l1d(); \ > run_irq_on_irqstack_cond(__##func, regs, vector); \ > instrumentation_end(); \ > irqentry_exit(regs, state); \ > } \ > \ > static noinline void __##func(struct pt_regs *regs, u32 vector) > > > And also to handle irq_stack rip's from SYSVEC functions (eg. > sysvec_apic_timer_interrupt): > > #define DEFINE_IDTENTRY_SYSVEC(func) \ > static void __##func(struct pt_regs *regs); \ > \ > __visible noinstr void func(struct pt_regs *regs) \ > { \ > irqentry_state_t state = irqentry_enter(regs); \ > \ > instrumentation_begin(); \ > kvm_set_cpu_l1tf_flush_l1d(); \ > run_sysvec_on_irqstack_cond(__##func, regs); \ > instrumentation_end(); \ > irqentry_exit(regs, state); \ > } \ > \ > static noinline void __##func(struct pt_regs *regs) Other changes are > required. > > > Clearly we should be adjusting "irq_eframe_link" based on framesize for > either DEFINE_IDTENTRY_IRQ() functions or DEFINE_IDTENTRY_SYSVEC() functions. > Maybe we should have "irq_eframe_link" for DEFINE_IDTENTRY_IRQ() functions, > and another "sysvec_eframe_link" for DEFINE_IDTENTRY_SYSVEC() functions, then > try both links. > Anyhow, the current x86_64_irq_eframe_link_init() code is making assumption > on a single common framesize , and actually there is a couple of assumptions: > - assuming framesizes with hardcoded values > - assuming a single common framesize > > To keep backward compatibility with the current code, which is trying +8 (not > sure why), the patch I proposed was keeping the existing +8, but it is also > trying -8, and also +16 and -16, > as to try having more fallbacks in case we didn't guess the proper framesizes.
Thank you for the fix, I also have a vmcore printing "bogus exception frame" but there has been no good idea to fix. The patch fixed this so Acked-by: Kazuhito Hagio <k-hagio...@nec.com> $ diff -u before.log after.log @@ -577,14 +577,21 @@ #9 [ffff954a4683cfe0] __irq_exit_rcu at ffffffff86b0948f #10 [ffff954a4683cff0] sysvec_apic_timer_interrupt at ffffffff875fe8f2 --- <IRQ stack> --- - RIP: 0000000000000010 RSP: 0000000000000018 RFLAGS: ffff954a4437fe80 - RAX: 000000000000001f RBX: 000000000000a3e7 RCX: 000000000000000e - RDX: 000000003a2e8f1a RSI: 0000000000000000 RDI: ffffffffffffffff - RBP: ffffb53a3f580000 R8: ffff8a51bf5b0a80 R9: 0000001a95a9b7b8 - R10: 0000000000000000 R11: 00000000000022ef R12: 0000000000000004 - R13: ffffffff884acd60 R14: 0000001a95a9b7b8 R15: 0000000000000004 - ORIG_RAX: ffffffff872fe528 CS: 0202 SS: 1a92a013c6 -WARNING: possibly bogus exception frame +#11 [ffff954a4437fdd8] asm_sysvec_apic_timer_interrupt at ffffffff87800d86 + [exception RIP: cpuidle_enter_state+216] + RIP: ffffffff872fe528 RSP: ffff954a4437fe80 RFLAGS: 00000202 + RAX: ffff8a51bf5b0a80 RBX: ffffb53a3f580000 RCX: 000000000000001f + RDX: 000000000000000e RSI: 000000003a2e8f1a RDI: 0000000000000000 + RBP: 0000000000000004 R8: 0000001a95a9b7b8 R9: 0000000000000000 + R10: 00000000000022ef R11: 000000000000a3e7 R12: ffffffff884acd60 + R13: 0000001a95a9b7b8 R14: 0000000000000004 R15: 0000000000000000 + ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 +#12 [ffff954a4437feb8] cpuidle_enter at ffffffff872fe8b9 +#13 [ffff954a4437fed8] cpuidle_idle_call at ffffffff86b624dc +#14 [ffff954a4437ff10] do_idle at ffffffff86b625fb +#15 [ffff954a4437ff28] cpu_startup_entry at ffffffff86b62849 +#16 [ffff954a4437ff38] start_secondary at ffffffff86a62a0f +#17 [ffff954a4437ff50] secondary_startup_64_no_verify at ffffffff86a0015a Thanks, Kazu > > Regards, > Georges > > > > -----Original Message----- > From: Lianbo Jiang <liji...@redhat.com> > Sent: Wednesday, April 10, 2024 6:11 AM > To: devel@lists.crash-utility.osci.io > Cc: Aureau, Georges (Kernel Tools ERT) <georges.aur...@hpe.com> > Subject: Re: [PATCH] x86_64: rhel 9.3: "crash" doesn't handle all IRQ > exception properly > > Hi, Aureau > > Thank you for the fix. > > On 4/5/24 08:38, devel-requ...@lists.crash-utility.osci.io wrote: >> Date: Thu, 4 Apr 2024 14:39:06 +0000 >> From: "Aureau, Georges (Kernel Tools ERT)"<georges.aur...@hpe.com> >> Subject: [Crash-utility] [PATCH] x86_64: rhel 9.3: "crash" doesn't >> handle all IRQ exception properly. >> To:"devel@lists.crash-utility.osci.io" >> <devel@lists.crash-utility.osci.io> >> Message-ID: <sj0pr84mb14821a83d9631672b4f0f9449f...@sj0pr84mb1482.namp >> RD84.PROD.OUTLOOK.COM> >> Content-Type: text/plain; charset="us-ascii" >> >> >> Hello, >> >> On x86_64, with rhel 9.3, there are cases where "crash" can't handle IRQ >> exception frames properly. >> >> For example, "crash" fails with "WARNING: possibly bogus exception frame": >> >> crash> bt -c 30 >> PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" >> #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 >> #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab >> #2 [fffffe4658d88eb0] default_do_nmi at ffffffffa0c56cd0 >> #3 [fffffe4658d88ed0] exc_nmi at ffffffffa0c56ee1 >> #4 [fffffe4658d88ef0] end_repeat_nmi at ffffffffa0e01639 >> [exception RIP: native_queued_spin_lock_slowpath+636] >> RIP: ffffffffa0c6b80c RSP: ff5eba269937ce10 RFLAGS: 00000046 >> RAX: 0000000000000000 RBX: ff4cb12bcfbb2940 RCX: 00000000007c0000 >> RDX: 0000000000000057 RSI: 0000000001600000 RDI: ff4cb12d67205608 >> RBP: ff4cb12d67205608 R8: ff90b9c682475940 R9: 0000000000000000 >> R10: ff4cb0d70ee4e100 R11: 0000000000000000 R12: 0000000000000000 >> R13: 000000000000001e R14: ff90b9c682474918 R15: ff90b9c682477120 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> --- <NMI exception stack> --- >> #5 [ff5eba269937ce10] native_queued_spin_lock_slowpath at ffffffffa0c6b80c >> #6 [ff5eba269937ce30] _raw_spin_lock_irqsave at ffffffffa0c6ad90 >> #7 [ff5eba269937ce40] free_iova at ffffffffa07df716 >> #8 [ff5eba269937ce68] fq_ring_free at ffffffffa07dba15 >> #9 [ff5eba269937ce98] fq_flush_timeout at ffffffffa07dc8fe >> #10 [ff5eba269937ced0] call_timer_fn at ffffffffa01c00a4 >> #11 [ff5eba269937cef0] __run_timers at ffffffffa01c03ae >> #12 [ff5eba269937cf88] run_timer_softirq at ffffffffa01c0476 >> #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 >> #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 >> #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 >> --- <IRQ stack> --- >> RIP: 0000000000000010 RSP: 0000000000000018 RFLAGS: ff5eba26ddc9f7e8 >> RAX: 0000000000000a20 RBX: ff5eba26ddc9f940 RCX: 0000000000001000 >> RDX: ffffffb559980000 RSI: ff4cb12d67207400 RDI: ffffffffffffffff >> RBP: 0000000000001000 R8: ff5eba26ddc9f940 R9: ff5eba26ddc9f8af >> R10: 0000000000000003 R11: 0000000000000a20 R12: ff5eba26ddc9f8b0 >> R13: 000000283c07f000 R14: ff4cb0f5a29a1c00 R15: 0000000000000001 >> ORIG_RAX: ffffffffa07c4e60 CS: 0206 SS: 7000001cf0380001 >> bt: WARNING: possibly bogus exception frame >> >> >> Running "crash" with "--machdep irq_eframe_link=0xffffffffffffffe8" (ie. >> thus irq_eframe_link = -24) works properly: >> >> PID: 2898241 TASK: ff4cb0ce0da0c680 CPU: 30 COMMAND: "star-ccm+" >> #0 [fffffe4658d88e58] crash_nmi_callback at ffffffffa00675e8 >> #1 [fffffe4658d88e68] nmi_handle at ffffffffa002ebab >> #2 [fffffe4658d88eb0] default_do_nmi at ffffffffa0c56cd0 >> #3 [fffffe4658d88ed0] exc_nmi at ffffffffa0c56ee1 >> #4 [fffffe4658d88ef0] end_repeat_nmi at ffffffffa0e01639 >> [exception RIP: native_queued_spin_lock_slowpath+636] >> RIP: ffffffffa0c6b80c RSP: ff5eba269937ce10 RFLAGS: 00000046 >> RAX: 0000000000000000 RBX: ff4cb12bcfbb2940 RCX: 00000000007c0000 >> RDX: 0000000000000057 RSI: 0000000001600000 RDI: ff4cb12d67205608 >> RBP: ff4cb12d67205608 R8: ff90b9c682475940 R9: 0000000000000000 >> R10: ff4cb0d70ee4e100 R11: 0000000000000000 R12: 0000000000000000 >> R13: 000000000000001e R14: ff90b9c682474918 R15: ff90b9c682477120 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> --- <NMI exception stack> --- >> #5 [ff5eba269937ce10] native_queued_spin_lock_slowpath at ffffffffa0c6b80c >> #6 [ff5eba269937ce30] _raw_spin_lock_irqsave at ffffffffa0c6ad90 >> #7 [ff5eba269937ce40] free_iova at ffffffffa07df716 >> #8 [ff5eba269937ce68] fq_ring_free at ffffffffa07dba15 >> #9 [ff5eba269937ce98] fq_flush_timeout at ffffffffa07dc8fe >> #10 [ff5eba269937ced0] call_timer_fn at ffffffffa01c00a4 >> #11 [ff5eba269937cef0] __run_timers at ffffffffa01c03ae >> #12 [ff5eba269937cf88] run_timer_softirq at ffffffffa01c0476 >> #13 [ff5eba269937cf90] __do_softirq at ffffffffa0c6c007 >> #14 [ff5eba269937cfe0] __irq_exit_rcu at ffffffffa010ef61 >> #15 [ff5eba269937cff0] sysvec_apic_timer_interrupt at ffffffffa0c58ca2 >> --- <IRQ stack> --- >> #16 [ff5eba26ddc9f738] asm_sysvec_apic_timer_interrupt at ffffffffa0e00e06 >> [exception RIP: alloc_pte.constprop.0+32] >> RIP: ffffffffa07c4e60 RSP: ff5eba26ddc9f7e8 RFLAGS: 00000206 >> RAX: ff5eba26ddc9f940 RBX: 0000000000001000 RCX: 0000000000000a20 >> RDX: 0000000000001000 RSI: ffffffb559980000 RDI: ff4cb12d67207400 >> RBP: ff5eba26ddc9f8b0 R8: ff5eba26ddc9f8af R9: 0000000000000003 >> R10: 0000000000000a20 R11: ff5eba26ddc9f940 R12: 000000283c07f000 >> R13: ff4cb0f5a29a1c00 R14: 0000000000000001 R15: ff4cb0f5a29a1bf8 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #17 [ff5eba26ddc9f830] iommu_v1_map_pages at ffffffffa07c5648 >> #18 [ff5eba26ddc9f8f8] __iommu_map at ffffffffa07d7803 >> #19 [ff5eba26ddc9f990] iommu_map_sg at ffffffffa07d7b71 >> #20 [ff5eba26ddc9f9f0] iommu_dma_map_sg at ffffffffa07ddcc9 >> #21 [ff5eba26ddc9fa90] __dma_map_sg_attrs at ffffffffa01b5205 >> #22 [ff5eba26ddc9fa98] dma_map_sgtable at ffffffffa01b526d >> #23 [ff5eba26ddc9faa8] ib_umem_get at ffffffffc112f308 [ib_uverbs] >> #24 [ff5eba26ddc9fb00] mlx5_ib_reg_user_mr at ffffffffc11eb991 >> [mlx5_ib] >> #25 [ff5eba26ddc9fb40] ib_uverbs_reg_mr at ffffffffc1123bc4 >> [ib_uverbs] >> #26 [ff5eba26ddc9fbb0] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE at >> ffffffffc112d064 [ib_uverbs] >> #27 [ff5eba26ddc9fbe0] ib_uverbs_run_method at ffffffffc112a121 >> [ib_uverbs] >> #28 [ff5eba26ddc9fc30] ib_uverbs_cmd_verbs at ffffffffc112a332 >> [ib_uverbs] >> #29 [ff5eba26ddc9fe68] ib_uverbs_ioctl at ffffffffc112a494 [ib_uverbs] >> #30 [ff5eba26ddc9fea8] __x64_sys_ioctl at ffffffffa0438f67 >> #31 [ff5eba26ddc9fed8] do_syscall_64 at ffffffffa0c55189 >> #32 [ff5eba26ddc9ff50] entry_SYSCALL_64_after_hwframe at ffffffffa0e000ea >> RIP: 0000146ef3a3ec6b RSP: 00007ffe3b7dc5b8 RFLAGS: 00000246 >> RAX: ffffffffffffffda RBX: 00007ffe3b7dc6a8 RCX: 0000146ef3a3ec6b >> RDX: 00007ffe3b7dc690 RSI: 00000000c0181b01 RDI: 0000000000000011 >> RBP: 00007ffe3b7dc670 R8: 0000146ed9549010 R9: 00007ffe3b7dc7c4 >> R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe3b7dc7c4 >> R13: 000000000000000c R14: 000000000000000c R15: 0000146ed9549150 >> ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b >> >> >> Some background: >> >> asm_common_interrupt: >> callq error_entry >> movq %rax,%rsp >> movq %rsp,%rdi >> movq 0x78(%rsp),%rsi >> movq $-0x1,0x78(%rsp) >> call common_interrupt # rsp pointing to regs >> >> common_interrupt >> pushq %r12 >> pushq %rbp >> pushq %rbx >> [...] >> movq hardirq_stack_ptr,%r11 >> movq %rsp,(%r11) >> movq %r11,%rsp >> [...] >> call __common_interrupt # rip:common_interrupt >> >> So frame_size(rip:common_interrupt) = 32 (3 push + ret). >> >> Hence "machdep->machspec->irq_eframe_link = -32;" (see >> x86_64_irq_eframe_link_init()). >> >> Now: >> >> asm_sysvec_apic_timer_interrupt: >> pushq $-0x1 >> callq error_entry >> movq %rax,%rsp >> movq %rsp,%rdi >> callq sysvec_apic_timer_interrupt >> >> sysvec_apic_timer_interrupt >> pushq %r12 >> pushq %rbp >> [...] >> movq hardirq_stack_ptr,%r11 >> movq %rsp,(%r11) >> movq %r11,%rsp >> [...] >> call __sysvec_apic_timer_interrupt # >> rip:sysvec_apic_timer_interrupt >> >> Here frame_size(rip:sysvec_apic_timer_interrupt) = 24 (2 push + ret) >> >> We should also notice that: >> >> rip = *(hardirq_stack_ptr - 8) >> rsp = *(hardirq_stack_ptr) >> regs = rsp - frame_size(rip) >> >> But x86_64_get_framesize() does not work with IRQ handlers (returns 0). >> So not many options other than hardcoding the most likely value and looking >> around it. >> Actually x86_64_irq_eframe_link() was trying -32 (default), and then -40, >> but not -24 ! >> >> Signed-off-by: Georges Aureau<georges.aur...@hpe.com> >> -- >> diff --git a/x86_64.c b/x86_64.c >> index aec82b0..90c2a91 100644 >> --- a/x86_64.c >> +++ b/x86_64.c >> @@ -6623,13 +6623,14 @@ x86_64_irq_eframe_link_init(void) >> >> /* >> * Calculate and verify the IRQ exception frame location from the >> - * stack reference at the top of the IRQ stack, possibly adjusting >> - * the ms->irq_eframe_link value. >> + * stack reference at the top of the IRQ stack, keep >> + ms->irq_eframe_link >> + * as the most likely value, and try a few sizes around it. >> */ >> static ulong >> x86_64_irq_eframe_link(ulong stkref, struct bt_info *bt, FILE *ofp) >> { >> ulong irq_eframe; >> + int i, try[] = { 8, -8, 16, -16 }; >> >> if (x86_64_exception_frame(EFRAME_VERIFY, stkref, 0, bt, ofp)) >> return stkref; >> @@ -6639,9 +6640,10 @@ x86_64_irq_eframe_link(ulong stkref, struct bt_info >> *bt, FILE *ofp) >> if (x86_64_exception_frame(EFRAME_VERIFY, irq_eframe, 0, bt, ofp)) >> return irq_eframe; >> >> - if (x86_64_exception_frame(EFRAME_VERIFY, irq_eframe+8, 0, bt, ofp)) >> { >> - machdep->machspec->irq_eframe_link -= 8; >> - return (irq_eframe + 8); >> + for (i=0; i<sizeof(try)/sizeof(int); i++) { >> + if (x86_64_exception_frame(EFRAME_VERIFY, irq_eframe+try[i], >> 0, bt, ofp)) { >> + return (irq_eframe + try[i]); >> + } >> } >> >> return irq_eframe; > > > Can you help to try the following changes? It could be good to check the > relevant symbol, but I'm not sure if this can work well for you and cover > some corner cases. > > diff --git a/x86_64.c b/x86_64.c > index aec82b03b62e..251e224c013b 100644 > --- a/x86_64.c > +++ b/x86_64.c > @@ -6574,6 +6574,8 @@ x86_64_irq_eframe_link_init(void) > if (symbol_exists("asm_common_interrupt")) { > if (symbol_exists("asm_call_on_stack")) > machdep->machspec->irq_eframe_link = -64; > + else if (symbol_exists("sysvec_apic_timer_interrupt")) > + machdep->machspec->irq_eframe_link = -24; > else > machdep->machspec->irq_eframe_link = -32; > return; > > > Thanks > > Lianbo > > -- > Crash-utility mailing list -- devel@lists.crash-utility.osci.io > To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ > Contribution Guidelines: https://github.com/crash-utility/crash/wiki -- Crash-utility mailing list -- devel@lists.crash-utility.osci.io To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki