On 2020/9/14 20:13, 河合英宏 / KAWAI,HIDEHIRO wrote:
Hi, Wu
From: Wu Bo <wub...@huawei.com>
In the virtual machine, Use mce_inject to inject errors into the system.
After mce-inject injects an uncorrectable error, there is a probability
that the virtual machine is not reset immediately,
but hangs for more than
3000 seconds. And the write_data array is accessed out of bounds.
The real reason is that smi_event_handler lack of lock protection in the
multi-threaded scenario, which causes write_pos
to exceed the size of the write_data array.
Thank you for the fix, but I wonder why this multi-threaded scenario happens.
If my understanding is correct, only one CPU can run panic routines, and
this means only one CPU calls flush_messages. Did I miss some call path?
Best regards,
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group
Hi,
You're right, only one CPU can run panic routines.
Sorry, I missed another call path. when the panic occurred and
interruption occurred at the same time.
CPU0 CPU3
->nmi_handle handle_irq
-> mce_raise_notify -> handle_fasteoi_irq
-> panic_event -> handle_irq_event
-> ipmi_panic_request_and_wait -> handle_irq_event_percpu
->flush_messages -> ipmi_si_irq_handler
-> smi_event_handler -> smi_event_handler
->kcs_event() ->kcs_event()
There is a simultaneous call to the smi_event_handler() function.
In the kcs_event():
case KCS_WAIT_WRITE:
...
if (kcs->write_count == 1) {
write_cmd(kcs, KCS_WRITE_END);
kcs->state = KCS_WAIT_WRITE_END;
} else {
write_next_byte(kcs);
}
...
The interrupt call path has been locked and protected.
But the panic call path is not protected by lock.
There may be kcs->write_count == 1 is not invalid. So will appear
call write_next_byte() repeatedly, resulting in the write_data array is
accessed out of bounds.
crash> bt -a
PID: 0 TASK: ffffffff96812780 CPU: 0 COMMAND: "swapper/0"
#0 [fffffe0000007bc0] panic at ffffffff956b2d3e
#1 [fffffe0000007c48] wait_for_panic at ffffffff95637ca2
#2 [fffffe0000007c58] mce_timed_out at ffffffff95637f5d
#3 [fffffe0000007c70] do_machine_check at ffffffff95638db4
#4 [fffffe0000007d80] raise_exception at ffffffffc05b6117 [mce_inject]
#5 [fffffe0000007e48] mce_raise_notify at ffffffffc05b6a92 [mce_inject]
#6 [fffffe0000007e58] nmi_handle at ffffffff95621c73
#7 [fffffe0000007eb0] default_do_nmi at ffffffff9562213e
#8 [fffffe0000007ed0] do_nmi at ffffffff9562231c
#9 [fffffe0000007ef0] end_repeat_nmi at ffffffff960016b4
[exception RIP: native_safe_halt+14]
RIP: ffffffff95e7223e RSP: ffffffff96803e90 RFLAGS: 00000246
RAX: ffffffff95e71f30 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R8: 00000018cf7c068a R9: 0000000000000001
R10: ffffa222c0b17b88 R11: 0000000001e2f8fb R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#10 [ffffffff96803e90] native_safe_halt at ffffffff95e7223e
#11 [ffffffff96803e90] default_idle at ffffffff95e71f4a
#12 [ffffffff96803eb0] do_idle at ffffffff956e959a
#13 [ffffffff96803ef0] cpu_startup_entry at ffffffff956e981f
#14 [ffffffff96803f10] start_kernel at ffffffff96d9b206
#15 [ffffffff96803f50] secondary_startup_64 at ffffffff956000e7
PID: 0 TASK: ffff8b06c77dc740 CPU: 3 COMMAND: "swapper/3"
[exception RIP: port_outb+17]
RIP: ffffffffc035f1a1 RSP: ffff8b06fad83e90 RFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8b06f08bec00 RCX: 0000000000000010
RDX: 0000000000000ca2 RSI: 0000000000000000 RDI: ffff8b06f0bd5e40
RBP: 0000000000000001 R8: ffff8b06fad80080 R9: ffff8b06fad84000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8b06fad83f54 R14: 0000000000000000 R15: 0000000000000000
CS: 0010 SS: 0018
#0 [ffff8b06fad83e90] kcs_event at ffffffffc035c2c7 [ipmi_si]
#1 [ffff8b06fad83eb0] smi_event_handler at ffffffffc035aa3f [ipmi_si]
#2 [ffff8b06fad83ee8] ipmi_si_irq_handler at ffffffffc035b0cc [ipmi_si]
#3 [ffff8b06fad83f08] __handle_irq_event_percpu at ffffffff9571dfc0
#4 [ffff8b06fad83f48] handle_irq_event_percpu at ffffffff9571e140
#5 [ffff8b06fad83f70] handle_irq_event at ffffffff9571e1b6
#6 [ffff8b06fad83f90] handle_fasteoi_irq at ffffffff95721b42
#7 [ffff8b06fad83fb0] handle_irq at ffffffff956209e8
#8 [ffff8b06fad83fc0] do_IRQ at ffffffff96001ee9
--- <IRQ stack> ---
#9 [fffffe0000088b98] ret_from_intr at ffffffff96000a8f
[exception RIP: delay_tsc+52]
RIP: ffffffff95e5fb64 RSP: fffffe0000088c48 RFLAGS: 00000287
RAX: 000023fb5edf4b14 RBX: 00000000003e0451 RCX: 000023fb5edf4798
RDX: 000000000000037c RSI: 0000000000000003 RDI: 000000000000095b
RBP: fffffe0000088cc0 R8: 0000000000000004 R9: fffffe0000088c5c
R10: ffffffff96a05ae0 R11: 0000000000000000 R12: fffffe0000088cb0
R13: 0000000000000001 R14: fffffe0000088ef8 R15: ffffffff9666a2f0
ORIG_RAX: ffffffffffffffd9 CS: 0010 SS: 0018
#10 [fffffe0000088c48] wait_for_panic at ffffffff95637c6c
#11 [fffffe0000088c58] mce_timed_out at ffffffff95637f5d
#12 [fffffe0000088c70] do_machine_check at ffffffff95638db4
#13 [fffffe0000088d80] raise_exception at ffffffffc05b6117 [mce_inject]
#14 [fffffe0000088e48] mce_raise_notify at ffffffffc05b6a92 [mce_inject]
#15 [fffffe0000088e58] nmi_handle at ffffffff95621c73
#16 [fffffe0000088eb0] default_do_nmi at ffffffff9562213e
#17 [fffffe0000088ed0] do_nmi at ffffffff9562231c
#18 [fffffe0000088ef0] end_repeat_nmi at ffffffff960016b4
[exception RIP: native_safe_halt+14]
RIP: ffffffff95e7223e RSP: ffffa222c06a3eb0 RFLAGS: 00000246
RAX: ffffffff95e71f30 RBX: 0000000000000003 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000083 RDI: 0000000000000000
RBP: 0000000000000003 R8: 00000018cf7cd9a0 R9: 0000000000000001
R10: 0000000000000400 R11: 00000000000003fb R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#19 [ffffa222c06a3eb0] native_safe_halt at ffffffff95e7223e
#20 [ffffa222c06a3eb0] default_idle at ffffffff95e71f4a
#21 [ffffa222c06a3ed0] do_idle at ffffffff956e959a
#22 [ffffa222c06a3f10] cpu_startup_entry at ffffffff956e981f
#23 [ffffa222c06a3f30] start_secondary at ffffffff9564e697
#24 [ffffa222c06a3f50] secondary_startup_64 at ffffffff956000e7
_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer