On 11/24/2021 4:16 PM, Dan Williams wrote: > On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <[email protected]> wrote: >> >> On 11/13/2021 12:47 PM, Dan Williams wrote: >> <snip> <snip> >> >> Thanks Dan for taking the time elaborating so much details! >> >> After some amount of digging, I have a feel that we need to take >> dax error handling in phases. >> >> Phase-1: the simplest dax_recovery_write on page granularity, along >> with fix to set poisoned page to 'NP', serialize >> dax_recovery_write threads. > > You mean special case PAGE_SIZE overwrites when dax_direct_access() > fails, but leave out the sub-page error handling and > read-around-poison support?
Yes. > > That makes sense to me. Incremental is good. Thanks! > >> Phase-2: provide dax_recovery_read support and hence shrink the error >> recovery granularity. As ioremap returns __iomem pointer >> that is only allowed to be referenced with helpers like >> readl() which do not have a mc_safe variant, and I'm >> not sure whether there should be. Also the synchronization >> between dax_recovery_read and dax_recovery_write threads. > > You can just use memremap() like the driver does to drop the iomem annotation. Okay, will investigate in phase 2. > >> Phase-3: the hypervisor error-record keeping issue, suppose there is >> an issue, I'll need to figure out how to setup a test case. >> Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue. > > My expectation is that CXL supports MOVDIR64B error clearing without > needing to send the Clear Poison command. I think this can be phase3, > phase4 is the more difficult question about how / if to coordinate > with VMM poison tracking. Right now I don't see a choice but to make > it paravirtualized. > >> >> Right now, it seems to me providing Phase-1 solution is urgent, to give >> something that customers can rely on. >> >> How does this sound to you? > > Sounds good. > Thanks! -jane
