On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <[email protected]> wrote:
>
> On 11/13/2021 12:47 PM, Dan Williams wrote:
> <snip>
> >>> It should know because the MCE that unmapped the page will have
> >>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
> >>> consult the badblocks list to try to read the remaining good data it
> >>> will see that every single cacheline is covered by badblocks, so
> >>> nothing to read, and no need to establish the UC mapping. So the the
> >>> "Tony fix" was incomplete in retrospect. It neglected to update the
> >>> NVDIMM badblocks tracking for the whole page case.
> >>
> >> So the call in nfit_handle_mce():
> >>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> >>                   ALIGN(mce->addr, L1_CACHE_BYTES),
> >>                   L1_CACHE_BYTES);
> >> should be replaced by
> >>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> >>                   ALIGN(mce->addr, L1_CACHE_BYTES),
> >>                   (1 << MCI_MISC_ADDR_LSB(m->misc)));
> >> right?
> >
> > Yes.
> >
> >>
> >> And when dax_recovery_read() calls
> >>     badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
> >> it should always, in case of 'NP', discover that 'first_bad'
> >> is the first sector in the poisoned page,  hence no need
> >> to switch to 'UC', right?
> >
> > Yes.
> >
> >>
> >> In case the 'first_bad' is in the middle of the poisoned page,
> >> that is, dax_recover_read() could potentially read some clean
> >> sectors, is there problem to
> >>     call _set_memory_UC(pfn, 1),
> >>     do the mc_safe read,
> >>     and then call set_memory_NP(pfn, 1)
> >> ?
> >> Why do we need to call ioremap() or vmap()?
> >
> > I'm worried about concurrent operations and enabling access to threads
> > outside of the one currently in dax_recovery_read(). If a local vmap()
> > / ioremap() is used it effectively makes the access thread local.
> > There might still need to be an rwsem to allow dax_recovery_write() to
> > fixup the pfn access and syncrhonize with dax_recovery_read()
> > operations.
> >
>
> <snip>
> >> I didn't even know that guest could clear poison by trapping hypervisor
> >> with the ClearError DSM method,
> >
> > The guest can call the Clear Error DSM if the virtual BIOS provides
> > it. Whether that actually clears errors or not is up to the
> > hypervisor.
> >
> >> I thought guest isn't privileged with that.
> >
> > The guest does not have access to the bare metal DSM path, but the
> > hypervisor can certainly offer translation service for that operation.
> >
> >> Would you mind to elaborate about the mechanism and maybe point
> >> out the code, and perhaps if you have test case to share?
> >
> > I don't have a test case because until Tony's fix I did not realize
> > that a virtual #MC would allow the guest to learn of poisoned
> > locations without necessarily allowing the guest to trigger actual
> > poison consumption.
> >
> > In other words I was operating under the assumption that telling
> > guests where poison is located is potentially handing the guest a way
> > to DoS the VMM. However, Tony's fix shows that when the hypervisor
> > unmaps the guest physical page it can prevent the guest from accessing
> > it again. So it follows that it should be ok to inject virtual #MC to
> > the guest, and unmap the guest physical range, but later allow that
> > guest physical range to be repaired if the guest asks the hypervisor
> > to repair the page.
> >
> > Tony, does this match your understanding?
> >
> >>
> >> but I'm not sure what to do about
> >>> guests that later want to use MOVDIR64B to clear errors.
> >>>
> >>
> >> Yeah, perhaps there is no way to prevent guest from accidentally
> >> clear error via MOVDIR64B, given some application rely on MOVDIR64B
> >> for fast data movement (straight to the media). I guess in that case,
> >> the consequence is false alarm, nothing disastrous, right?
> >
> > You'll just continue to get false positive failures because the error
> > tracking will be out-of-sync with reality.
> >
> >> How about allowing the potential bad-block bookkeeping gap, and
> >> manage to close the gap at certain checkpoints? I guess one of
> >> the checkpoints might be when page fault discovers a poisoned
> >> page?
> >
> > Not sure how that would work... it's already the case that new error
> > entries are appended to the list at #MC time, the problem is knowing
> > when to clear those stale entries. I still think that needs to be at
> > dax_recovery_write() time.
> >
>
> Thanks Dan for taking the time elaborating so much details!
>
> After some amount of digging, I have a feel that we need to take
> dax error handling in phases.
>
> Phase-1: the simplest dax_recovery_write on page granularity, along
>           with fix to set poisoned page to 'NP', serialize
>           dax_recovery_write threads.

You mean special case PAGE_SIZE overwrites when dax_direct_access()
fails, but leave out the sub-page error handling and
read-around-poison support?

That makes sense to me. Incremental is good.

> Phase-2: provide dax_recovery_read support and hence shrink the error
>           recovery granularity.  As ioremap returns __iomem pointer
>           that is only allowed to be referenced with helpers like
>           readl() which do not have a mc_safe variant, and I'm
>           not sure whether there should be.  Also the synchronization
>           between dax_recovery_read and dax_recovery_write threads.

You can just use memremap() like the driver does to drop the iomem annotation.

> Phase-3: the hypervisor error-record keeping issue, suppose there is
>           an issue, I'll need to figure out how to setup a test case.
> Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.

My expectation is that CXL supports MOVDIR64B error clearing without
needing to send the Clear Poison command. I think this can be phase3,
phase4 is the more difficult question about how / if to coordinate
with VMM poison tracking. Right now I don't see a choice but to make
it paravirtualized.

>
> Right now, it seems to me providing Phase-1 solution is urgent, to give
> something that customers can rely on.
>
> How does this sound to you?

Sounds good.

Reply via email to