On 11/13/2021 12:47 PM, Dan Williams wrote:
<snip>
>>> It should know because the MCE that unmapped the page will have
>>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
>>> consult the badblocks list to try to read the remaining good data it
>>> will see that every single cacheline is covered by badblocks, so
>>> nothing to read, and no need to establish the UC mapping. So the the
>>> "Tony fix" was incomplete in retrospect. It neglected to update the
>>> NVDIMM badblocks tracking for the whole page case.
>>
>> So the call in nfit_handle_mce():
>> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>> ALIGN(mce->addr, L1_CACHE_BYTES),
>> L1_CACHE_BYTES);
>> should be replaced by
>> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>> ALIGN(mce->addr, L1_CACHE_BYTES),
>> (1 << MCI_MISC_ADDR_LSB(m->misc)));
>> right?
>
> Yes.
>
>>
>> And when dax_recovery_read() calls
>> badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
>> it should always, in case of 'NP', discover that 'first_bad'
>> is the first sector in the poisoned page, hence no need
>> to switch to 'UC', right?
>
> Yes.
>
>>
>> In case the 'first_bad' is in the middle of the poisoned page,
>> that is, dax_recover_read() could potentially read some clean
>> sectors, is there problem to
>> call _set_memory_UC(pfn, 1),
>> do the mc_safe read,
>> and then call set_memory_NP(pfn, 1)
>> ?
>> Why do we need to call ioremap() or vmap()?
>
> I'm worried about concurrent operations and enabling access to threads
> outside of the one currently in dax_recovery_read(). If a local vmap()
> / ioremap() is used it effectively makes the access thread local.
> There might still need to be an rwsem to allow dax_recovery_write() to
> fixup the pfn access and syncrhonize with dax_recovery_read()
> operations.
>
<snip>
>> I didn't even know that guest could clear poison by trapping hypervisor
>> with the ClearError DSM method,
>
> The guest can call the Clear Error DSM if the virtual BIOS provides
> it. Whether that actually clears errors or not is up to the
> hypervisor.
>
>> I thought guest isn't privileged with that.
>
> The guest does not have access to the bare metal DSM path, but the
> hypervisor can certainly offer translation service for that operation.
>
>> Would you mind to elaborate about the mechanism and maybe point
>> out the code, and perhaps if you have test case to share?
>
> I don't have a test case because until Tony's fix I did not realize
> that a virtual #MC would allow the guest to learn of poisoned
> locations without necessarily allowing the guest to trigger actual
> poison consumption.
>
> In other words I was operating under the assumption that telling
> guests where poison is located is potentially handing the guest a way
> to DoS the VMM. However, Tony's fix shows that when the hypervisor
> unmaps the guest physical page it can prevent the guest from accessing
> it again. So it follows that it should be ok to inject virtual #MC to
> the guest, and unmap the guest physical range, but later allow that
> guest physical range to be repaired if the guest asks the hypervisor
> to repair the page.
>
> Tony, does this match your understanding?
>
>>
>> but I'm not sure what to do about
>>> guests that later want to use MOVDIR64B to clear errors.
>>>
>>
>> Yeah, perhaps there is no way to prevent guest from accidentally
>> clear error via MOVDIR64B, given some application rely on MOVDIR64B
>> for fast data movement (straight to the media). I guess in that case,
>> the consequence is false alarm, nothing disastrous, right?
>
> You'll just continue to get false positive failures because the error
> tracking will be out-of-sync with reality.
>
>> How about allowing the potential bad-block bookkeeping gap, and
>> manage to close the gap at certain checkpoints? I guess one of
>> the checkpoints might be when page fault discovers a poisoned
>> page?
>
> Not sure how that would work... it's already the case that new error
> entries are appended to the list at #MC time, the problem is knowing
> when to clear those stale entries. I still think that needs to be at
> dax_recovery_write() time.
>
Thanks Dan for taking the time elaborating so much details!
After some amount of digging, I have a feel that we need to take
dax error handling in phases.
Phase-1: the simplest dax_recovery_write on page granularity, along
with fix to set poisoned page to 'NP', serialize
dax_recovery_write threads.
Phase-2: provide dax_recovery_read support and hence shrink the error
recovery granularity. As ioremap returns __iomem pointer
that is only allowed to be referenced with helpers like
readl() which do not have a mc_safe variant, and I'm
not sure whether there should be. Also the synchronization
between dax_recovery_read and dax_recovery_write threads.
Phase-3: the hypervisor error-record keeping issue, suppose there is
an issue, I'll need to figure out how to setup a test case.
Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.
Right now, it seems to me providing Phase-1 solution is urgent, to give
something that customers can rely on.
How does this sound to you?
thanks,
-jane