On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <[email protected]> wrote: > > On 11/13/2021 12:47 PM, Dan Williams wrote: > <snip> > >>> It should know because the MCE that unmapped the page will have > >>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to > >>> consult the badblocks list to try to read the remaining good data it > >>> will see that every single cacheline is covered by badblocks, so > >>> nothing to read, and no need to establish the UC mapping. So the the > >>> "Tony fix" was incomplete in retrospect. It neglected to update the > >>> NVDIMM badblocks tracking for the whole page case. > >> > >> So the call in nfit_handle_mce(): > >> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus, > >> ALIGN(mce->addr, L1_CACHE_BYTES), > >> L1_CACHE_BYTES); > >> should be replaced by > >> nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus, > >> ALIGN(mce->addr, L1_CACHE_BYTES), > >> (1 << MCI_MISC_ADDR_LSB(m->misc))); > >> right? > > > > Yes. > > > >> > >> And when dax_recovery_read() calls > >> badblocks_check(bb, sector, len / 512, &first_bad, &num_bad) > >> it should always, in case of 'NP', discover that 'first_bad' > >> is the first sector in the poisoned page, hence no need > >> to switch to 'UC', right? > > > > Yes. > > > >> > >> In case the 'first_bad' is in the middle of the poisoned page, > >> that is, dax_recover_read() could potentially read some clean > >> sectors, is there problem to > >> call _set_memory_UC(pfn, 1), > >> do the mc_safe read, > >> and then call set_memory_NP(pfn, 1) > >> ? > >> Why do we need to call ioremap() or vmap()? > > > > I'm worried about concurrent operations and enabling access to threads > > outside of the one currently in dax_recovery_read(). If a local vmap() > > / ioremap() is used it effectively makes the access thread local. > > There might still need to be an rwsem to allow dax_recovery_write() to > > fixup the pfn access and syncrhonize with dax_recovery_read() > > operations. > > > > <snip> > >> I didn't even know that guest could clear poison by trapping hypervisor > >> with the ClearError DSM method, > > > > The guest can call the Clear Error DSM if the virtual BIOS provides > > it. Whether that actually clears errors or not is up to the > > hypervisor. > > > >> I thought guest isn't privileged with that. > > > > The guest does not have access to the bare metal DSM path, but the > > hypervisor can certainly offer translation service for that operation. > > > >> Would you mind to elaborate about the mechanism and maybe point > >> out the code, and perhaps if you have test case to share? > > > > I don't have a test case because until Tony's fix I did not realize > > that a virtual #MC would allow the guest to learn of poisoned > > locations without necessarily allowing the guest to trigger actual > > poison consumption. > > > > In other words I was operating under the assumption that telling > > guests where poison is located is potentially handing the guest a way > > to DoS the VMM. However, Tony's fix shows that when the hypervisor > > unmaps the guest physical page it can prevent the guest from accessing > > it again. So it follows that it should be ok to inject virtual #MC to > > the guest, and unmap the guest physical range, but later allow that > > guest physical range to be repaired if the guest asks the hypervisor > > to repair the page. > > > > Tony, does this match your understanding? > > > >> > >> but I'm not sure what to do about > >>> guests that later want to use MOVDIR64B to clear errors. > >>> > >> > >> Yeah, perhaps there is no way to prevent guest from accidentally > >> clear error via MOVDIR64B, given some application rely on MOVDIR64B > >> for fast data movement (straight to the media). I guess in that case, > >> the consequence is false alarm, nothing disastrous, right? > > > > You'll just continue to get false positive failures because the error > > tracking will be out-of-sync with reality. > > > >> How about allowing the potential bad-block bookkeeping gap, and > >> manage to close the gap at certain checkpoints? I guess one of > >> the checkpoints might be when page fault discovers a poisoned > >> page? > > > > Not sure how that would work... it's already the case that new error > > entries are appended to the list at #MC time, the problem is knowing > > when to clear those stale entries. I still think that needs to be at > > dax_recovery_write() time. > > > > Thanks Dan for taking the time elaborating so much details! > > After some amount of digging, I have a feel that we need to take > dax error handling in phases. > > Phase-1: the simplest dax_recovery_write on page granularity, along > with fix to set poisoned page to 'NP', serialize > dax_recovery_write threads.
You mean special case PAGE_SIZE overwrites when dax_direct_access() fails, but leave out the sub-page error handling and read-around-poison support? That makes sense to me. Incremental is good. > Phase-2: provide dax_recovery_read support and hence shrink the error > recovery granularity. As ioremap returns __iomem pointer > that is only allowed to be referenced with helpers like > readl() which do not have a mc_safe variant, and I'm > not sure whether there should be. Also the synchronization > between dax_recovery_read and dax_recovery_write threads. You can just use memremap() like the driver does to drop the iomem annotation. > Phase-3: the hypervisor error-record keeping issue, suppose there is > an issue, I'll need to figure out how to setup a test case. > Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue. My expectation is that CXL supports MOVDIR64B error clearing without needing to send the Clear Poison command. I think this can be phase3, phase4 is the more difficult question about how / if to coordinate with VMM poison tracking. Right now I don't see a choice but to make it paravirtualized. > > Right now, it seems to me providing Phase-1 solution is urgent, to give > something that customers can rely on. > > How does this sound to you? Sounds good.
