On 2020/9/16 23:38, Corey Minyard wrote:
On Wed, Sep 16, 2020 at 10:52:31PM +0800, Wu Bo wrote:
On 2020/9/14 22:29, Corey Minyard wrote:
On Mon, Sep 14, 2020 at 09:33:07PM +0800, Wu Bo wrote:
On 2020/9/14 20:13, 河合英宏 / KAWAI,HIDEHIRO wrote:
Hi, Wu

From: Wu Bo <wub...@huawei.com>
In the virtual machine, Use mce_inject to inject errors into the system.
After mce-inject injects an uncorrectable error, there is a probability
that the virtual machine is not reset immediately,
but hangs for more than
3000 seconds. And the write_data array is accessed out of bounds.

The real reason is that smi_event_handler lack of lock protection in the
multi-threaded scenario, which causes write_pos
to exceed the size of the write_data array.

Thank you for the fix, but I wonder why this multi-threaded scenario happens.
If my understanding is correct, only one CPU can run panic routines, and
this means only one CPU calls flush_messages.  Did I miss some call path?

Best regards,

Hidehiro Kawai
Hitachi, Ltd. Research & Development Group

Hi,

You're right, only one CPU can run panic routines.
Sorry, I missed another call path. when the panic occurred and interruption
occurred at the same time.

CPU0                                CPU3
   ->nmi_handle                           handle_irq
     -> mce_raise_notify                   -> handle_fasteoi_irq
       -> panic_event                        -> handle_irq_event
         -> ipmi_panic_request_and_wait      -> handle_irq_event_percpu
            ->flush_messages                   -> ipmi_si_irq_handler
                -> smi_event_handler              -> smi_event_handler
                ->kcs_event()                       ->kcs_event()

There is a simultaneous call to the smi_event_handler() function.

With your patch, this will result in spinning waiting for the lock
forever. So that's not what is happening here.

The panic event should stop the other processors before flush_messages()
happens, right?  If so, it is possible that the IPMI state machine is
running, but this should result in the state machine being reset.

-corey


Hi, corey

Thanks for your reply. MCE appears on all CPUs,
Currently all CPUs are in the NMI interrupt context.
cpu0 is the first to seize the opportunity to run panic routines, and panic
event should stop the other processors before flush_messages(). but cpu1,
cpu2 and cpu3 has already in NMI interrupt context, So the Second NMI
interrupt will not be processed again by cpu1, cpu2 and cpu3.
at this time, cpu1,cpu2 and cpu3 did not stoped.

cpu1, cpu2 and cpu3 are waitting for cpu0 to finish the panic process. (do
wait_for_panic()), the irq is enabled in the wait_for_panic() function.

ipmi IRQ occurs on the cpu3, and the cpu0 is doing the panic,
they have the opportunity to call the smi_event_handler() function
concurrently.

Ah, I understand now.  This sounds more like a bug in the x86 exception
handling.  My understanding is that panic() runs with only one
operational CPU in the system.  I think there is other code (like
printk) that also relies on this assumption.  See bust_spinlocks() and
oops_in_progress.

I can't find anything that says this, but it's obvious that both panic()
and mce_panic() make special efforts to ensure this.

The trouble is that if you add the spinlocks in the IPMI panic handler,
if an exception occurs while a CPU holds that spinlock, the system will
hang forever, not just for 3000 seconds.  The trouble is, I don't know
of a good solution besides making the panic handing work like I think it
should.

Thanks,

-corey


Hi, corey

It is guessed that there may be a cpu executing the panic process, and other cpus are not in the state of stopping the core, not just the mce_panic scenario.

So is it possible to use spin_trylock_irqsave in the panic routine ?

Best regards,

Wu Bo



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to