On 11/24/2021 4:16 PM, Dan Williams wrote:
> On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <[email protected]> wrote:
>>
>> On 11/13/2021 12:47 PM, Dan Williams wrote:
>> <snip>
<snip>
>>
>> Thanks Dan for taking the time elaborating so much details!
>>
>> After some amount of digging, I have a feel that we need to take
>> dax error handling in phases.
>>
>> Phase-1: the simplest dax_recovery_write on page granularity, along
>>            with fix to set poisoned page to 'NP', serialize
>>            dax_recovery_write threads.
> 
> You mean special case PAGE_SIZE overwrites when dax_direct_access()
> fails, but leave out the sub-page error handling and
> read-around-poison support?

Yes.
> 
> That makes sense to me. Incremental is good.

Thanks!
> 
>> Phase-2: provide dax_recovery_read support and hence shrink the error
>>            recovery granularity.  As ioremap returns __iomem pointer
>>            that is only allowed to be referenced with helpers like
>>            readl() which do not have a mc_safe variant, and I'm
>>            not sure whether there should be.  Also the synchronization
>>            between dax_recovery_read and dax_recovery_write threads.
> 
> You can just use memremap() like the driver does to drop the iomem annotation.

Okay, will investigate in phase 2.

> 
>> Phase-3: the hypervisor error-record keeping issue, suppose there is
>>            an issue, I'll need to figure out how to setup a test case.
>> Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.
> 
> My expectation is that CXL supports MOVDIR64B error clearing without
> needing to send the Clear Poison command. I think this can be phase3,
> phase4 is the more difficult question about how / if to coordinate
> with VMM poison tracking. Right now I don't see a choice but to make
> it paravirtualized.
> 
>>
>> Right now, it seems to me providing Phase-1 solution is urgent, to give
>> something that customers can rely on.
>>
>> How does this sound to you?
> 
> Sounds good.
> 

Thanks!
-jane

Reply via email to