On Wed, Sep 16, 2020 at 08:39:58PM +0800, Feilong Lin wrote:
>
>
> On 2020/9/16 19:35, Corey Minyard wrote:
> > On Wed, Sep 16, 2020 at 04:46:50PM +0800, Feilong Lin wrote:
> >>
> >>
> >> On 2020/9/14 20:13, 河合英宏 / KAWAI,HIDEHIRO wrote:
> >>> Hi, Wu
> >>>
> >>>> From: Wu Bo <[email protected]>
> >>>> In the virtual machine, Use mce_inject to inject errors into the system.
> >>>> After mce-inject injects an uncorrectable error, there is a probability
> >>>> that the virtual machine is not reset immediately,
> >>>> but hangs for more than
> >>>> 3000 seconds. And the write_data array is accessed out of bounds.
> >>>>
> >>>> The real reason is that smi_event_handler lack of lock protection in the
> >>>> multi-threaded scenario, which causes write_pos
> >>>> to exceed the size of the write_data array.
> >>>
> >>> Thank you for the fix, but I wonder why this multi-threaded scenario
> >>> happens.
> >>> If my understanding is correct, only one CPU can run panic routines, and
> >>> this means only one CPU calls flush_messages. Did I miss some call path?
> >>
> >> from the log, we can see mce_inject occurs on each CPU, every one excutes
> >> nmi handler, that will prevent the nmi handler of stopping cpus, so
> >> multiple
> >> CPUs are running that time.
> >
> > Yes, MCEs generally occur on all CPUs. But both mce_panic() and panic()
> > prevent more than one CPU from running the panic code. Please read the
> > code.
> >
> > -corey
> >
> Yes, mce_panic and panic prevent more than one CPU from running the panic
> code.
> But one cpu do wait_for_panic, which enable irq again, allows ipmi irq
> occurs, when
> an other CPU do notifier_call_chain(include ipmi panic_event) in panic. if my
> analysis
> is not right, please correct me, thanks
> CPU A CPU B
>
> |-nmi_handle do mce_panic |-nmi_handle do
> do_machine_check
> | |
> |-panic |-wait_for_panic
> | |
> |-stop cpus ------------------------------------>(no response as
> already in nmi)
> | |
> |-notifier call(ipmi panic_event) |ipmi IRQ occurs,
> do smi_event_handler
Interrupts are disabled on all processors, so you can't get an IPMI
interrupt here.
I'm not discounting that you are having a problem, but I don't think
your analysis is correct. Adding a spinlock here can result in a lockup
in a panic. So there's no way you can do that.
I assume you have had some issue, can you describe that, and give a
reproduction method, if possible?
-corey
> | |
> \|/ \|/
> crash> bt -a
> PID: 0 TASK: ffffffff91c12780 CPU: 0 COMMAND: "swapper/0"
> #0 [fffffe0000007ba0] panic at ffffffff90ab2e4b
> #1 [fffffe0000007c28] mce_panic at ffffffff90a37ec6
> #2 [fffffe0000007c70] do_machine_check at ffffffff90a391ad
> #3 [fffffe0000007d80] raise_exception at ffffffffc0642117 [mce_inject]
> #4 [fffffe0000007e48] mce_raise_notify at ffffffffc0642a92 [mce_inject]
> #5 [fffffe0000007e58] nmi_handle at ffffffff90a21c73
> #6 [fffffe0000007eb0] default_do_nmi at ffffffff90a2213e
> #7 [fffffe0000007ed0] do_nmi at ffffffff90a2231c
> #8 [fffffe0000007ef0] end_repeat_nmi at ffffffff914016b4
> [exception RIP: native_safe_halt+14]
> RIP: ffffffff9127223e RSP: ffffffff91c03e90 RFLAGS: 00000246
> RAX: ffffffff91271f30 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000000 R8: 00000031237ac86a R9: 0000000000000001
> R10: ffff9f7c01397b88 R11: 0000000002a3fabf R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #9 [ffffffff91c03e90] native_safe_halt at ffffffff9127223e
> #10 [ffffffff91c03e90] default_idle at ffffffff91271f4a
> #11 [ffffffff91c03eb0] do_idle at ffffffff90ae959a
> #12 [ffffffff91c03ef0] cpu_startup_entry at ffffffff90ae981f
> #13 [ffffffff91c03f10] start_kernel at ffffffff9219b206
> #14 [ffffffff91c03f50] secondary_startup_64 at ffffffff90a000e7
>
> crash> bt -c 2
> PID: 0 TASK: ffff8e12c77ddf00 CPU: 2 COMMAND: "swapper/2"
> [exception RIP: port_outb+17]
> RIP: ffffffffc02e81a1 RSP: ffff8e12fad03e78 RFLAGS: 00000002
> RAX: 0000000000000008 RBX: ffff8e12c71c1c00 RCX: 0000000000000006
> RDX: 0000000000000ca2 RSI: 0000000000000000 RDI: ffff8e12f0b07040
> RBP: 00000000c02e8101 R8: 00000b9ecd91a1bc R9: 0000000036821fbb
> R10: 0000000000000004 R11: 0000000000000006 R12: 0000000000000000
> R13: 00000000c02e8102 R14: 0000000000000081 R15: 0000000000000000
> CS: 0010 SS: 0018
> #0 [ffff8e12fad03e78] kcs_event at ffffffffc02e729f [ipmi_si]
> #1 [ffff8e12fad03ea8] smi_event_handler at ffffffffc02e58f6 [ipmi_si]
> #2 [ffff8e12fad03ee8] ipmi_si_irq_handler at ffffffffc02e5edc [ipmi_si]
> #3 [ffff8e12fad03f08] __handle_irq_event_percpu at ffffffff90b1dfc0
> #4 [ffff8e12fad03f48] handle_irq_event_percpu at ffffffff90b1e140
> #5 [ffff8e12fad03f70] handle_irq_event at ffffffff90b1e1b6
> #6 [ffff8e12fad03f90] handle_fasteoi_irq at ffffffff90b21b42
> #7 [ffff8e12fad03fb0] handle_irq at ffffffff90a209e8
> #8 [ffff8e12fad03fc0] do_IRQ at ffffffff91401ee9
> --- <IRQ stack> ---
> #9 [fffffe000005db98] ret_from_intr at ffffffff91400a8f
> [exception RIP: delay_tsc+32]
> RIP: ffffffff9125fb50 RSP: fffffe000005dc48 RFLAGS: 00000246
> RAX: 00000000cb09ba94 RBX: 00000000003fab28 RCX: 00039776cb09b5cc
> RDX: 0000000000039776 RSI: 0000000000000002 RDI: 000000000000095b
> RBP: fffffe000005dcc0 R8: 0000000000000004 R9: fffffe000005dc5c
> R10: ffffffff91e05ae0 R11: 0000000000000000 R12: fffffe000005dcb0
> R13: 0000000000000001 R14: fffffe000005def8 R15: ffffffff91a6a2f0
> ORIG_RAX: ffffffffffffffdb CS: 0010 SS: 0018
> #10 [fffffe000005dc48] wait_for_panic at ffffffff90a37c6c
> #11 [fffffe000005dc58] mce_timed_out at ffffffff90a37f5d
> #12 [fffffe000005dc70] do_machine_check at ffffffff90a38db4
> #13 [fffffe000005dd80] raise_exception at ffffffffc0642117 [mce_inject]
> #14 [fffffe000005de48] mce_raise_notify at ffffffffc0642a92 [mce_inject]
> #15 [fffffe000005de58] nmi_handle at ffffffff90a21c73
> #16 [fffffe000005deb0] default_do_nmi at ffffffff90a2213e
> #17 [fffffe000005ded0] do_nmi at ffffffff90a2231c
> #18 [fffffe000005def0] end_repeat_nmi at ffffffff914016b4
> [exception RIP: native_safe_halt+14]
> RIP: ffffffff9127223e RSP: ffff9f7c0069beb0 RFLAGS: 00000246
> RAX: ffffffff91271f30 RBX: 0000000000000002 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000002 R8: 0000003123129836 R9: 0000000000000001
> R10: ffff9f7c00957c60 R11: 00000000054773c3 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #19 [ffff9f7c0069beb0] native_safe_halt at ffffffff9127223e
> #20 [ffff9f7c0069beb0] default_idle at ffffffff91271f4a
> #21 [ffff9f7c0069bed0] do_idle at ffffffff90ae959a
> #22 [ffff9f7c0069bf10] cpu_startup_entry at ffffffff90ae981f
> #23 [ffff9f7c0069bf30] start_secondary at ffffffff90a4e697
> #24 [ffff9f7c0069bf50] secondary_startup_64 at ffffffff90a000e7
>
> >>
> >> Best regards
> >>
> >>>
> >>> Best regards,
> >>>
> >>> Hidehiro Kawai
> >>> Hitachi, Ltd. Research & Development Group
> >>>
> >>>>
> >>>> Call Trace:
> >>>> nmi_handle
> >>>> -> mce_raise_notify
> >>>> -> panic_event
> >>>> -> ipmi_panic_request_and_wait
> >>>> -> flush_messages
> >>>> -> smi_event_handler
> >>>>
> >>>> Test steps:
> >>>> 1. # vim /tmp/uncorrected
> >>>> CPU 1 BANK 4
> >>>> STATUS uncorrected 0xc0
> >>>> MCGSTATUS EIPV MCIP
> >>>> ADDR 0x1234
> >>>> RIP 0xdeadbabe
> >>>> RAISINGCPU 0
> >>>> MCGCAP SER CMCI TES 0x6
> >>>>
> >>>> 2. # modprobe mce_inject
> >>>> 3. # cd /tmp
> >>>> 4. # mce-inject uncorrected
> >>>>
> >>>> The log as follows:
> >>>> [ 29.826892] core: [Hardware Error]: CPU 1: Machine Check Exception: 6
> >>>> Bank 4: b4000000000000c0
> >>>> [ 29.826893] core: [Hardware Error]: RIP 00:<00000000deadbabe>
> >>>> [ 29.826894] core: [Hardware Error]: TSC c5b604e3d3a ADDR 1234
> >>>> [ 29.826896] core: [Hardware Error]: PROCESSOR 0:50654 TIME 1598001986
> >>>> SOCKET 0 APIC 1 microcode 1
> >>>> [ 29.826896] core: [Hardware Error]: Run the above through 'mcelog
> >>>> --ascii'
> >>>> [ 29.826898] core: [Hardware Error]: Machine check: In kernel and no
> >>>> restart IP
> >>>> [ 29.826899] Kernel panic - not syncing: Fatal machine check
> >>>> [ 29.826899] kernel fault(0x5) notification starting on CPU 2
> >>>> [ 29.826904] kernel fault(0x5) notification finished on CPU 2
> >>>> [ 29.889538] ------------[ cut here ]------------
> >>>> [ 29.889547] do_IRQ(): swapper/0 has overflown the kernel stack
> >>>> (cur:ffffffffb5400000,sp:fffffe0000007c48,
> >>>> irq stk top-bottom:ffff98067ac00080-ffff98067ac04000,
> >>>> exception stk
> >>>> top-bottom:fffffe0000009080-fffffe000000b000,ip:delay_tsc+0x34/0x50)
> >>>> [ 29.889569] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/irq_64.c:73
> >>>> handle_irq+0x102/0x110
> >>>> [ 29.889569] Modules linked in: mce_inject core_module_frame(OE)
> >>>> kbox(O) sysmonitor(O) kboxdriver(O) sunrpc nfit
> >>>> [ 29.889604] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >>>> BIOS
> >>>> rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
> >>>> [ 29.889607] RIP: 0010:handle_irq+0x102/0x110
> >>>> [ 29.889608] Code: 00 00 50 65 4c 8b 14 25 80 5c 01 00 57 49 81 c2 e0
> >>>> 0a 00 00 48 c7 c7 f8 53 26 b5 4c 89 d6 c6 05 4e
> >>>> 08 3a 01 01
> >>>> [ 29.889609] RSP: 0018:ffff98067ac03fa0 EFLAGS: 00010086
> >>>> [ 29.889611] RAX: 0000000000000000 RBX: ffff980675568800 RCX:
> >>>> 0000000000000006
> >>>> [ 29.889611] RDX: 0000000000000007 RSI: 0000000000000000 RDI:
> >>>> ffff98067ac16a00
> >>>> [ 29.889612] RBP: 0000000000000027 R08: 0000000c83bdf5cc R09:
> >>>> 00000000000002cf
> >>>> [ 29.889613] R10: ffffffffb5413260 R11: 0000000000000002 R12:
> >>>> fffffe0000007b98
> >>>> [ 29.889614] R13: 0000000000000027 R14: 0000000000000000 R15:
> >>>> 0000000000000000
> >>>> [ 29.889615] FS: 0000000000000000(0000) GS:ffff98067ac00000(0000)
> >>>> knlGS:0000000000000000
> >>>> [ 29.889616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [ 29.889617] CR2: 00007fd391992fb8 CR3: 00000000a0a0a005 CR4:
> >>>> 00000000003606f0
> >>>> [ 29.889624] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>> 0000000000000000
> >>>> [ 29.889624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>>> 0000000000000400
> >>>> [ 29.889625] Call Trace:
> >>>> [ 29.889629] <IRQ>
> >>>> [ 29.889633] ? delay_tsc+0x34/0x50
> >>>> [ 29.889638] do_IRQ+0x49/0xe0
> >>>> [ 29.889641] common_interrupt+0xf/0xf
> >>>> [ 29.889642] </IRQ>
> >>>> [ 29.889642] <NMI>
> >>>> [ 29.889644] RIP: 0010:delay_tsc+0x34/0x50
> >>>> [ 29.889646] Code: 5a 4b 0f ae e8 0f 31 48 c1 e2 20 48 89 d1 48 09 c1
> >>>> 0f ae e8 0f 31 48 c1 e2 20 48 09 d0 48 89 c2 48
> >>>> 29 ca 48 39 fa 73 15 f3 90 <65>
> >>>> [ 29.889647] RSP: 0018:fffffe0000007c48 EFLAGS: 00000287 ORIG_RAX:
> >>>> ffffffffffffffd8
> >>>> [ 29.889648] RAX: 00000c5b6825d6ec RBX: 00000000004b6c49 RCX:
> >>>> 00000c5b6825cf64
> >>>> [ 29.889649] RDX: 0000000000000788 RSI: 0000000000000000 RDI:
> >>>> 0000000000000835
> >>>> [ 29.889649] RBP: fffffe0000007cc0 R08: 0000000000000004 R09:
> >>>> fffffe0000007c5c
> >>>> [ 29.889650] R10: ffffffffb5605ae0 R11: 0000000000000000 R12:
> >>>> fffffe0000007cb0
> >>>> [ 29.889651] R13: 0000000000000001 R14: fffffe0000007ef8 R15:
> >>>> ffffffffb526a280
> >>>> [ 29.889654] ? __ndelay+0x40/0x40
> >>>> [ 29.889657] wait_for_panic+0x1c/0x60
> >>>> [ 29.889662] mce_timed_out+0x6d/0x90
> >>>> [ 29.889665] do_machine_check+0x8d4/0xdb0
> >>>> [ 29.889673] ? native_safe_halt+0xe/0x10
> >>>> [ 29.889679] raise_exception+0x47/0xb0 [mce_inject]
> >>>> [ 29.889684] ? __intel_pmu_enable_all+0x47/0x80
> >>>> [ 29.889689] ? native_apic_msr_write+0x27/0x30
> >>>> [ 29.889691] ? intel_pmu_handle_irq+0x10d/0x160
> >>>> [ 29.889692] mce_raise_notify+0x62/0x70 [mce_inject]
> >>>> [ 29.889694] ? raise_local+0xc0/0xc0 [mce_inject]
> >>>> [ 29.889696] nmi_handle+0x63/0x110
> >>>> [ 29.889699] default_do_nmi+0x4e/0x100
> >>>> [ 29.889702] do_nmi+0x12c/0x190
> >>>> [ 29.889704] end_repeat_nmi+0x16/0x6a
> >>>> [ 29.889707] RIP: 0010:native_safe_halt+0xe/0x10
> >>>> [ 29.889708] Code: eb bd 90 90 90 90 90 90 90 90 90 90 e9 07 00 00 00
> >>>> 0f 00 2d 46 6b 59 00 f4 c3 66 90
> >>>> [ 29.889709] RSP: 0018:ffffffffb5403e90 EFLAGS: 00000246
> >>>> [ 29.889710] RAX: ffffffffb4a71f30 RBX: 0000000000000000 RCX:
> >>>> 0000000000000000
> >>>> [ 29.889711] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> >>>> 0000000000000000
> >>>> [ 29.889712] RBP: 0000000000000000 R08: 0000000c7fd07643 R09:
> >>>> 0000000000000001
> >>>> [ 29.889712] R10: 0000000000000401 R11: 00000000000003f4 R12:
> >>>> 0000000000000000
> >>>> [ 29.889715] R13: 0000000000000000 R14: 0000000000000000 R15:
> >>>> 0000000000000000
> >>>> [ 29.889719] ? __cpuidle_text_start+0x8/0x8
> >>>> [ 29.889721] ? first_nmi+0x1e/0x1e
> >>>> [ 29.889723] ? delay_tsc+0x3b/0x50
> >>>> [ 29.889724] </NMI>
> >>>> [ 29.889724] WARNING: stack recursion on stack type 6
> >>>> [ 29.889725] ---[ end trace e03eaf0134c6fcd6 ]---
> >>>> [ 3086.488900] BUG: unable to handle kernel paging request at
> >>>> ffff980680000001 [ 3086.488902] PGD a1801067 P4D a1801067
> >>>> PUD 0 [ 3086.488916] Oops: 0000 [#1] SMP PTI
> >>>> [ 3086.488919] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G
> >>>> M W OE
> >>>> [ 3086.488920] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >>>> BIOS
> >>>> rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014 [
> >>>> 3086.488927] RIP: 0010:kcs_event+0x3c2/0x890 [ipmi_si]
> >>>> [ 3086.488928] Code: 74 0e 48 8b 7b 08 31 f6 48 8b 07 e8 68 0f 9b f4 83
> >>>> bb 24 01 00 00 01 48 8b 7b 08 [ 3086.488930] RSP:
> >>>> 0018:fffffe000005d658 EFLAGS: 00010046 [ 3086.488931] RAX:
> >>>> 0000000009a24ff1 RBX: ffff9806765db000 RCX: 0000000000000000
> >>>> [ 3086.488931] RDX: 0000000000000ca2 RSI: 0000000000000000 RDI:
> >>>> ffff98067551fc40 [ 3086.488932] RBP: 0000000000000001
> >>>> R08: 0000000000000000 R09: ffffffff00080204 [ 3086.488933] R10:
> >>>> fffffe000005dae8 R11: fffffe000005db00 R12:
> >>>> 0000000000000000 [ 3086.488933] R13: fffffe000005db28 R14:
> >>>> ffff980674870000 R15: 0000000000000000 [ 3086.488934] FS:
> >>>> 0000000000000000(0000) GS:ffff98067ad00000(0000) knlGS:0000000000000000
> >>>> [ 3086.488935] CS: 0010 DS: 0000 ES: 0000 CR0:
> >>>> 0000000080050033 [ 3086.488936] CR2: ffff980680000001 CR3:
> >>>> 00000000a0a0a006 CR4: 00000000003606e0 [ 3086.488942] DR0:
> >>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [
> >>>> 3086.488942] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> >>>> DR7: 0000000000000400 [ 3086.488943] Call Trace:
> >>>> [ 3086.488946] <NMI>
> >>>> [ 3086.488950] smi_event_handler+0x4f/0x690 [ipmi_si] [ 3086.488952]
> >>>> flush_messages+0x10/0x40 [ipmi_si] [ 3086.488957]
> >>>> ipmi_panic_request_and_wait+0xf5/0x100 [ipmi_msghandler] [ 3086.488960]
> >>>> ? dummy_smi_done_handler+0x10/0x10
> >>>> [ipmi_msghandler] [ 3086.488967] ? kvm_sched_clock_read+0xd/0x20 [
> >>>> 3086.488970] ? sched_clock+0x5/0x10
> >>>> [ 3086.488975] ? sched_clock_cpu+0xc/0xa0 [ 3086.488980] ?
> >>>> log_store+0x278/0x2c0 [ 3086.488982] ?
> >>>> ipmi_si_platform_shutdown+0x20/0x20 [ipmi_si] [ 3086.488984] ?
> >>>> kcs_event+0x20/0x890 [ipmi_si] [ 3086.488986] ?
> >>>> ipmi_addr_length+0x40/0x40 [ipmi_msghandler] [ 3086.488988] ?
> >>>> panic_event+0x1d4/0x3e0 [ipmi_msghandler] [ 3086.488990]
> >>>> panic_event+0x1d4/0x3e0 [ipmi_msghandler] [ 3086.488993] ?
> >>>> vprintk_deferred+0x3a/0x40 [ 3086.488994] ?
> >>>> printk+0x52/0x6e [ 3086.488998] ? cpumask_next+0x17/0x20 [ 3086.489000]
> >>>> notifier_call_chain+0x47/0x70 [ 3086.489004]
> >>>> panic+0x17d/0x2c2 [ 3086.489007] mce_panic+0x216/0x240 [ 3086.489009]
> >>>> do_machine_check+0xccd/0xdb0 [ 3086.489012] ?
> >>>> native_safe_halt+0xe/0x10 [ 3086.489018] raise_exception+0x47/0xb0
> >>>> [mce_inject] [ 3086.489021] ?
> >>>> __intel_pmu_enable_all+0x47/0x80 [ 3086.489024] ?
> >>>> native_apic_msr_write+0x27/0x30 [ 3086.489025] ?
> >>>> intel_pmu_handle_irq+0x10d/0x160 [ 3086.489027]
> >>>> mce_raise_notify+0x62/0x70 [mce_inject] [ 3086.489029] ?
> >>>> raise_local+0xc0/0xc0 [mce_inject] [ 3086.489032] nmi_handle+0x63/0x110
> >>>> [ 3086.489035] default_do_nmi+0x4e/0x100
> >>>> [ 3086.489037] do_nmi+0x12c/0x190 [ 3086.489038]
> >>>> end_repeat_nmi+0x16/0x6a [ 3086.489041] RIP:
> >>>> 0010:native_safe_halt+0xe/0x10 [ 3086.489042] Code: eb bd 90 90 90 90 90
> >>>> 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 46 6b
> >>>> [ 3086.489043] RSP: 0018:ffffb545c069beb0 EFLAGS: 00000246 [
> >>>> 3086.489044] RAX: ffffffffb4a71f30 RBX: 0000000000000002
> >>>> RCX: 0000000000000000 [ 3086.489044] RDX: 0000000000000001 RSI:
> >>>> 0000000000000000 RDI: 0000000000000000 [ 3086.489045]
> >>>> RBP: 0000000000000002 R08: 0000000c7fa43cae R09: 0000000000000027 [
> >>>> 3086.489045] R10: ffffb545c067bce0 R11:
> >>>> 00000000000001ba R12: 0000000000000000 [ 3086.489046] R13:
> >>>> 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >>>> [ 3086.489048] ? __cpuidle_text_start+0x8/0x8 [ 3086.489051] ?
> >>>> native_safe_halt+0xe/0x10 [ 3086.489052] ?
> >>>> native_safe_halt+0xe/0x10 [ 3086.489053] </NMI> [ 3086.489054]
> >>>> default_idle+0x1a/0xd0 [ 3086.489056]
> >>>> do_idle+0x1aa/0x250 [ 3086.489057] cpu_startup_entry+0x6f/0x80 [
> >>>> 3086.489060] start_secondary+0x1a7/0x200
> >>>> [ 3086.489064] secondary_startup_64+0xb7/0xc0 [ 3086.489066] Modules
> >>>> linked in: mce_inject core_module_frame(OE) kbox(O)
> >>>> sysmonitor(O) [ 3086.489089] kernel fault(0x1) notification starting on
> >>>> CPU 2 [ 3086.489094] kernel fault(0x1) notification
> >>>> finished on CPU 2
> >>>>
> >>>> crash> struct si_sm_data 0xffff8b06f08bec00
> >>>> struct si_sm_data {
> >>>> state = KCS_WAIT_WRITE,
> >>>> io = 0xffff8b06f0bd5e40,
> >>>> write_data = "\020\002A\003 Fo\241at\....
> >>>> write_pos = 18331179, <----
> >>>> write_count = -18331372,
> >>>> orig_write_count = 10,
> >>>> read_data = "\034B\000\000\001\...
> >>>> read_pos = 0,
> >>>> truncated = 0,
> >>>> error_retries = 1,
> >>>> ibf_timeout = 5000000,
> >>>> obf_timeout = 5000000,
> >>>> error0_timeout = 4294734898
> >>>> }
> >>>>
> >>>> Fixes: 82802f968bd31 ("ipmi: Don't flush messages in sender() in
> >>>> run-to-completion mode")
> >>>> Signed-off-by: Wu Bo <[email protected]>
> >>>> ---
> >>>> drivers/char/ipmi/ipmi_si_intf.c | 17 ++++++++++-------
> >>>> 1 file changed, 10 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c
> >>>> b/drivers/char/ipmi/ipmi_si_intf.c
> >>>> index 77b8d551ae7f..961e7b0b11fe 100644
> >>>> --- a/drivers/char/ipmi/ipmi_si_intf.c
> >>>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> >>>> @@ -882,15 +882,18 @@ static void flush_messages(void *send_info) {
> >>>> struct smi_info *smi_info = send_info;
> >>>> enum si_sm_result result;
> >>>> + unsigned long flags = 0;
> >>>> + int time = 0;
> >>>>
> >>>> - /*
> >>>> - * Currently, this function is called only in run-to-completion
> >>>> - * mode. This means we are single-threaded, no need for locks.
> >>>> - */
> >>>> - result = smi_event_handler(smi_info, 0);
> >>>> - while (result != SI_SM_IDLE) {
> >>>> +restart:
> >>>> + spin_lock_irqsave(&smi_info->si_lock, flags);
> >>>> + result = smi_event_handler(smi_info, time);
> >>>> + spin_unlock_irqrestore(&smi_info->si_lock, flags);
> >>>> +
> >>>> + if (result != SI_SM_IDLE) {
> >>>> udelay(SI_SHORT_TIMEOUT_USEC);
> >>>> - result = smi_event_handler(smi_info,
> >>>> SI_SHORT_TIMEOUT_USEC);
> >>>> + time = SI_SHORT_TIMEOUT_USEC;
> >>>> + goto restart;
> >>>> }
> >>>> }
> >>>>
> >>>> --
> >>>> 2.23.0
> > .
> >
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer