[SeaBIOS] Re: [PATCH 1/1] stacks: fix missing NMI_DISABLE_BIT on call32_post()

Kevin O'Connor Thu, 27 May 2021 15:49:42 -0700

On Thu, May 27, 2021 at 03:43:51PM -0300, Heitor Alves de Siqueira wrote:
> On Thu, May 27, 2021 at 11:44 AM Kevin O'Connor <ke...@koconnor.net> wrote:
> >
> > The purpose of this code is to restore the NMI_DISABLE_BIT to what it
> > was prior to call32_prep().  If something calls the bios without the
> > NMI_DISABLE_BIT set, it's this code that makes sure SeaBIOS returns to
> > that calling code with NMI_DISABLE_BIT also not set.
> >
> > If you've run into some bug, I think it would help if you could
> > further describe that bug.
> >
> > Cheers,
> > -Kevin
> 
> Hi Kevin,
> 
> Thank you for the explanation! Sorry, it seems I misunderstood this part of 
> the
> code as I thought every access to PORT_CMOS_DATA should be performed with NMIs
> disabled. Maybe giving some of the background on the issue will help me
> understand this a bit better, indeed!
> 
> This has been originally reported by some Ubuntu users running specific VMs on
> older versions of seabios, where they would occasionally see KVM emulation
> failures and VMs going into "PAUSED" state (and being unable to resume 
> without a
> full VM reboot afterwards). Inspecting the ASM dumps [0] on those VMs revealed
> that the last actions performed were accesses to PORT_CMOS_DATA, and those
> seemed to be caused by rtc_mask(). Since these were on old versions of 
> seabios,
> they looked like a result of our builds missing patch 3156b71a535e (rtc: 
> Disable
> NMI in rtc_mask()) [1], which we tried to address initially.
> 
> After providing new packages of seabios with the rtc_mask() patch, some users
> noticed that a few VMs still continued to present similar symptoms, but with a
> different ASM dump this time. This was also seen on "newer" versions of our
> seabios packages based on upstream 1.10.2, which should already include the
> rtc_mask() patches by default (git describe --contains reports this patch 
> being
> introduced with rel-1.9.0~47). These new failed instances lead us to believe
> that call32_post() was the culprit, since the trapping instruction was still 
> the
> same access to PORT_CMOS_DATA which was "unguarded" by an NMI_DISABLE_BIT.


It's not necessary to disable NMI (non-maskable interrupts) to access
the CMOS bits.  It's just that IO port 0x71 (PORT_CMOS_DATA) is used
for both the CMOS bits and for the NMI disable flag.  It's two totally
separate functions merged into the same IO register (it seems that was
considered great fun back in the '80s).

>We
> then provided another package for testing, implementing the patch I've 
> proposed
> originally in this thread, and our users reported no further KVM emulation
> failures.
> 
> Unfortunately, I'm not entirely sure what originally causes the KVM emulation
> failures, as I've been unable to reproduce these issues in test VMs. Our users
> reported that the commands below can trigger the emulation failures, but I 
> have
> no details on the exact platform those are running on or any details of the
> specific file system in use:
> 
> root@vsfo-2[]:/root> date; fsfreeze --freeze /flash
> root@vsfo-2[]:/root> date; dd if=/dev/zero of=/flash/test bs=1 count=0 seek=1G
> 
> In any case, I'm somewhat puzzled by CMOS port accesses causing KVM emulation
> failures. Could it be that an NMI comes in between outb/inb and we end up 
> trying
> to read from a nonsense CMOS index?

No, there's no invalid values in the CMOS index.  Likely what's
occuring is that an NMI is run in a state that it is not expecting and
it corrupts the cpu/stack.  KVM probably only observes the problem (or
reports the problem) when the NMI returns.

> If I understand it correctly, my proposed patch effectively turns off NMIs
> unconditionally which sounds like it should cause horrible breakage.

Linux, almost assuredly, restores NMIs to their appropriate state when
it regains control of the processor.  So, the temporary disabling of
NMIs likely goes unnoticed.

>Could you
> help me understand why that doesn't happen with the rtc_read/write/mask
> functions in src/hw/rtc.c as well?

SeaBIOS needs to disable NMIs when in 32bit mode.  It's not related to
the CMOS bits - it's just that we can't permit an NMI when in 32bit
mode.  So, any time the code writes to IO port 0x71 (PORT_CMOS_DATA)
when in 32bit mode it needs to make sure it doesn't mistakenly enable
NMIs.  Thus the rtc_xxx() functions make sure the NMI bit is always
set.

The reason that NMIs are disabled in 32bit mode is because old 3rd
party irq handlers are expecting to be invoked in 16bit mode and will
not function correctly if they are invoked when the cpu is in 32bit
mode.  Thus, we can't transition to 32bit mode without first disabling
all irqs (both regular irqs and NMIs).

In contrast, the call32_prep() and call32_post() code is invoked when
the CPU is in 16bit mode.  There, at least in theory, it should be
possible for NMIs to run.

It would seem, though, that an NMI is running and causing havoc.

> Hopefully the above helps contextualize the issue a bit better, Kevin. 
> Apologies
> for asking so many questions, but would you have any suggestions on how we 
> could
> try to get more information on the seabios side of this?

Without being able to reproduce the issue it would be very difficult
to diagnose.  If it can be reproduced, I'd be curious if removing the
call to "ENTRY handle_02" in src/romlayout.S changes the behavior.

-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org

[SeaBIOS] Re: [PATCH 1/1] stacks: fix missing NMI_DISABLE_BIT on call32_post()

Reply via email to