Hi, Vishal,

Vishal Verma <[email protected]> writes:

> The mce handler for 'nfit' devices is called for memory errors on a
> Non-Volatile DIMM, and adds the error location to a 'badblocks' list.
> This list is used by the various NVDIMM drivers to avoid consuming known
> poison locations during IO.
>
> The mce handler gets called for both corrected and uncorrectable errors.

Sorry for necroposting.  I thought the point of the CEC was to make sure
that the other registered decoders only ever saw uncorrected errors.
How do we end up getting called with a correctable error?

Thanks,
Jeff

> Until now, both kinds of errors have been added to the badblocks list.
> However, corrected memory errors indicate that the problem has already
> been fixed by hardware, and the resulting interrupt is merely a
> notification to Linux. As far as future accesses to that location are
> concerned, it is perfectly fine to use, and thus doesn't need to be
> included in the above badblocks list.
>
> Add a check in the nfit mce handler to filter out corrected mce events,
> and only process uncorrectable errors.
>
> Reported-by: Omar Avelar <[email protected]>
> Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error")
> Cc: [email protected]
> Cc: Dan Williams <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
>  arch/x86/include/asm/mce.h       | 1 +
>  arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
>  drivers/acpi/nfit/mce.c          | 4 ++--
>  3 files changed, 5 insertions(+), 3 deletions(-)
>
> v3: Unchanged from v2
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 3a17107594c8..3111b3cee2ee 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -216,6 +216,7 @@ static inline int umc_normaddr_to_sysaddr(u64 norm_addr, 
> u16 nid, u8 umc, u64 *s
>  
>  int mce_available(struct cpuinfo_x86 *c);
>  bool mce_is_memory_error(struct mce *m);
> +bool mce_is_correctable(struct mce *m);
>  
>  DECLARE_PER_CPU(unsigned, mce_exception_count);
>  DECLARE_PER_CPU(unsigned, mce_poll_count);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c 
> b/arch/x86/kernel/cpu/mcheck/mce.c
> index 953b3ce92dcc..27015948bc41 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -534,7 +534,7 @@ bool mce_is_memory_error(struct mce *m)
>  }
>  EXPORT_SYMBOL_GPL(mce_is_memory_error);
>  
> -static bool mce_is_correctable(struct mce *m)
> +bool mce_is_correctable(struct mce *m)
>  {
>       if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
>               return false;
> @@ -544,6 +544,7 @@ static bool mce_is_correctable(struct mce *m)
>  
>       return true;
>  }
> +EXPORT_SYMBOL_GPL(mce_is_correctable);
>  
>  static bool cec_add_mce(struct mce *m)
>  {
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index e9626bf6ca29..7a51707f87e9 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -25,8 +25,8 @@ static int nfit_handle_mce(struct notifier_block *nb, 
> unsigned long val,
>       struct acpi_nfit_desc *acpi_desc;
>       struct nfit_spa *nfit_spa;
>  
> -     /* We only care about memory errors */
> -     if (!mce_is_memory_error(mce))
> +     /* We only care about uncorrectable memory errors */
> +     if (!mce_is_memory_error(mce) || mce_is_correctable(mce))
>               return NOTIFY_DONE;
>  
>       /*
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

Reply via email to