from:"Jane Chu"

Re: [PATCH v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-26 Thread Jane Chu


On 6/24/2023 11:25 PM, Markus Elfring wrote:

Change from v4:

…

I suggest to omit the cover letter for a single patch.

Will any patch series evolve for your proposed changes?



No. The thought was to put descriptions unsuitable for commit header in 
the cover letter.


thanks,
jane


Regards,
Markus

[PATCH v5 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-15 Thread Jane Chu

When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.

If user level block IO syscalls fail due to poison, the errno will
be converted to EIO to maintain block API consistency.

Signed-off-by: Jane Chu 
---
 drivers/dax/super.c  |  5 -
 drivers/nvdimm/pmem.c|  2 +-
 drivers/s390/block/dcssblk.c |  3 ++-
 fs/dax.c | 11 ++-
 fs/fuse/virtio_fs.c  |  3 ++-
 include/linux/dax.h  | 13 +
 include/linux/mm.h   |  2 ++
 7 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..0da9232ea175 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -203,6 +203,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t 
pgoff, void *addr,
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages)
 {
+   int ret;
+
if (!dax_alive(dax_dev))
return -ENXIO;
/*
@@ -213,7 +215,8 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t 
pgoff,
if (nr_pages != 1)
return -EIO;
 
-   return dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
+   ret = dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
+   return dax_mem2blk_err(ret);
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
 
if (mode != DAX_RECOVERY_WRITE)
-   return -EIO;
+   return -EHWPOISON;
 
/*
 * Set the recovery stride is set to kernel page size because
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index c09f2e053bf8..ee47ac520cd4 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -54,7 +54,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device 
*dax_dev,
rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
, NULL);
if (rc < 0)
-   return rc;
+   return dax_mem2blk_err(rc);
+
memset(kaddr, 0, nr_pages << PAGE_SHIFT);
dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
return 0;
diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..a26eb5abfdc0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,7 +1148,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t 
length, size_t align_size,
if (!zero_edge) {
ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
if (ret)
-   return ret;
+   return dax_mem2blk_err(ret);
}
 
if (copy_all) {
@@ -1310,7 +1310,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter)
 
 out_unlock:
dax_read_unlock(id);
-   return ret;
+   return dax_mem2blk_err(ret);
 }
 
 int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
@@ -1342,7 +1342,8 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
pos, size_t size)
ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, ,
NULL);
if (ret < 0)
-   return ret;
+   return dax_mem2blk_err(ret);
+
memset(kaddr + offset, 0, size);
if (iomap->flags & IOMAP_F_SHARED)
ret = dax_iomap_copy_around(pos, size, PAGE_SIZE, srcmap,
@@ -1498,7 +1499,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1506,7 +1507,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter

[PATCH v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-15 Thread Jane Chu

Change from v4:
Add comments describing when and why dax_mem2blk_err() is used.
Suggested by Dan.

Change from v3:
Prevent leaking EHWPOISON to user level block IO calls such as
zero_range_range, and truncate.  Suggested by Dan.

Change from v2:
Convert EHWPOISON to EIO to prevent EHWPOISON errno from leaking
out to block read(2). Suggested by Matthew.

Jane Chu (1):
  dax: enable dax fault handler to report VM_FAULT_HWPOISON

 drivers/dax/super.c  |  5 -
 drivers/nvdimm/pmem.c|  2 +-
 drivers/s390/block/dcssblk.c |  3 ++-
 fs/dax.c | 11 ++-
 fs/fuse/virtio_fs.c  |  3 ++-
 include/linux/dax.h  | 13 +
 include/linux/mm.h   |  2 ++
 7 files changed, 30 insertions(+), 9 deletions(-)

-- 
2.18.4

Re: [PATCH v4 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-14 Thread Jane Chu


On 6/8/2023 8:16 PM, Dan Williams wrote:
[..]
  
+static inline int dax_mem2blk_err(int err)

+{
+   return (err == -EHWPOISON) ? -EIO : err;
+}


I think it is worth a comment on this function to indicate where this
helper is *not* used. I.e. it's easy to grep for where the error code is
converted for file I/O errors, but the subtlety of when the
dax_direct_acces() return value is not translated for fault handling
deserves a comment when we wonder why this function exists in a few
years time.


Good point!  v5 coming up next.

thanks!
-jane

Re: [PATCH v4 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-30 Thread Jane Chu


Ping... Is there any further concern?

-jane

On 5/8/2023 10:47 PM, Jane Chu wrote:

When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.

If user level block IO syscalls fail due to poison, the errno will
be converted to EIO to maintain block API consistency.

Signed-off-by: Jane Chu 
---
  drivers/dax/super.c  |  5 -
  drivers/nvdimm/pmem.c|  2 +-
  drivers/s390/block/dcssblk.c |  3 ++-
  fs/dax.c | 11 ++-
  fs/fuse/virtio_fs.c  |  3 ++-
  include/linux/dax.h  |  5 +
  include/linux/mm.h   |  2 ++
  7 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..0da9232ea175 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -203,6 +203,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t 
pgoff, void *addr,
  int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages)
  {
+   int ret;
+
if (!dax_alive(dax_dev))
return -ENXIO;
/*
@@ -213,7 +215,8 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t 
pgoff,
if (nr_pages != 1)
return -EIO;
  
-	return dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);

+   ret = dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
+   return dax_mem2blk_err(ret);
  }
  EXPORT_SYMBOL_GPL(dax_zero_page_range);
  
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c

index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
  
  		if (mode != DAX_RECOVERY_WRITE)

-   return -EIO;
+   return -EHWPOISON;
  
  		/*

 * Set the recovery stride is set to kernel page size because
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index c09f2e053bf8..ee47ac520cd4 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -54,7 +54,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device 
*dax_dev,
rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
, NULL);
if (rc < 0)
-   return rc;
+   return dax_mem2blk_err(rc);
+
memset(kaddr, 0, nr_pages << PAGE_SHIFT);
dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
return 0;
diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..a26eb5abfdc0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,7 +1148,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t 
length, size_t align_size,
if (!zero_edge) {
ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
if (ret)
-   return ret;
+   return dax_mem2blk_err(ret);
}
  
  	if (copy_all) {

@@ -1310,7 +1310,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter)
  
  out_unlock:

dax_read_unlock(id);
-   return ret;
+   return dax_mem2blk_err(ret);
  }
  
  int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,

@@ -1342,7 +1342,8 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
pos, size_t size)
ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, ,
NULL);
if (ret < 0)
-   return ret;
+   return dax_mem2blk_err(ret);
+
memset(kaddr + offset, 0, size);
if (iomap->flags & IOMAP_F_SHARED)
ret = dax_iomap_copy_around(pos, size, PAGE_SIZE, srcmap,
@@ -1498,7 +1499,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
  
  		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),

DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1506,7 +1507,7 @@ sta

[PATCH v4 1/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-08 Thread Jane Chu

When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.

If user level block IO syscalls fail due to poison, the errno will
be converted to EIO to maintain block API consistency.

Signed-off-by: Jane Chu 
---
 drivers/dax/super.c  |  5 -
 drivers/nvdimm/pmem.c|  2 +-
 drivers/s390/block/dcssblk.c |  3 ++-
 fs/dax.c | 11 ++-
 fs/fuse/virtio_fs.c  |  3 ++-
 include/linux/dax.h  |  5 +
 include/linux/mm.h   |  2 ++
 7 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..0da9232ea175 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -203,6 +203,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t 
pgoff, void *addr,
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages)
 {
+   int ret;
+
if (!dax_alive(dax_dev))
return -ENXIO;
/*
@@ -213,7 +215,8 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t 
pgoff,
if (nr_pages != 1)
return -EIO;
 
-   return dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
+   ret = dax_dev->ops->zero_page_range(dax_dev, pgoff, nr_pages);
+   return dax_mem2blk_err(ret);
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
 
if (mode != DAX_RECOVERY_WRITE)
-   return -EIO;
+   return -EHWPOISON;
 
/*
 * Set the recovery stride is set to kernel page size because
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index c09f2e053bf8..ee47ac520cd4 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -54,7 +54,8 @@ static int dcssblk_dax_zero_page_range(struct dax_device 
*dax_dev,
rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
, NULL);
if (rc < 0)
-   return rc;
+   return dax_mem2blk_err(rc);
+
memset(kaddr, 0, nr_pages << PAGE_SHIFT);
dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
return 0;
diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..a26eb5abfdc0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,7 +1148,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t 
length, size_t align_size,
if (!zero_edge) {
ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
if (ret)
-   return ret;
+   return dax_mem2blk_err(ret);
}
 
if (copy_all) {
@@ -1310,7 +1310,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter)
 
 out_unlock:
dax_read_unlock(id);
-   return ret;
+   return dax_mem2blk_err(ret);
 }
 
 int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
@@ -1342,7 +1342,8 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
pos, size_t size)
ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, ,
NULL);
if (ret < 0)
-   return ret;
+   return dax_mem2blk_err(ret);
+
memset(kaddr + offset, 0, size);
if (iomap->flags & IOMAP_F_SHARED)
ret = dax_iomap_copy_around(pos, size, PAGE_SIZE, srcmap,
@@ -1498,7 +1499,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1506,7 +1507,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter

[PATCH v4 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-08 Thread Jane Chu

Change from v3:
Prevent leaking EHWPOISON to user level block IO calls such as
zero_range_range, and truncate.  Suggested by Dan.

Change from v2:
Convert EHWPOISON to EIO to prevent EHWPOISON errno from leaking
out to block read(2). Suggested by Matthew.

Jane Chu (1):
  dax: enable dax fault handler to report VM_FAULT_HWPOISON

 drivers/dax/super.c  |  5 -
 drivers/nvdimm/pmem.c|  2 +-
 drivers/s390/block/dcssblk.c |  3 ++-
 fs/dax.c | 11 ++-
 fs/fuse/virtio_fs.c  |  3 ++-
 include/linux/dax.h  |  5 +
 include/linux/mm.h   |  2 ++
 7 files changed, 22 insertions(+), 9 deletions(-)

-- 
2.18.4

Re: [PATCH v3] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-08 Thread Jane Chu


On 5/4/2023 7:32 PM, Dan Williams wrote:

Jane Chu wrote:

When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.


I do think it is interesting that this will start returning SIGBUS with
BUS_MCEERR_AR for stores whereas it is only signalled for loads in the
direct consumption path, but I can't think of a scenario where that
would confuse existing software.


Yes indeed, nice catch, thank you!




Change from v2:
   Convert -EHWPOISON to -EIO to prevent EHWPOISON errno from leaking
out to block read(2) - suggested by Matthew.


For next time these kind of changelog notes belong...


Signed-off-by: Jane Chu 
---


...here after the "---".


I'll move the change log to a cover letter.




  drivers/nvdimm/pmem.c | 2 +-
  fs/dax.c  | 4 ++--
  include/linux/mm.h| 2 ++
  3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
  
  		if (mode != DAX_RECOVERY_WRITE)

-   return -EIO;
+   return -EHWPOISON;
  
  		/*

 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..18f9598951f1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1498,7 +1498,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
  
  		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),

DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1506,7 +1506,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
recovery = true;
}
if (map_len < 0) {
-   ret = map_len;
+   ret = (map_len == -EHWPOISON) ? -EIO : map_len;


This fixup leaves out several other places where EHWPOISON could leak as
the errno for read(2)/write(2). I think I want to see something like
this:

diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..ec17f9977bcb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,14 +1077,13 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
  }
  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
  
-static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,

-   size_t size, void **kaddr, pfn_t *pfnp)
+static int __dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+size_t size, void **kaddr, pfn_t *pfnp)
  {
 pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
-   int id, rc = 0;
 long length;
+   int rc = 0;
  
-   id = dax_read_lock();

 length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, kaddr, pfnp);
 if (length < 0) {
@@ -1113,6 +1112,36 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
 return rc;
  }
  
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,

+  size_t size, void **kaddr, pfn_t *pfnp)
+{
+
+   int id;
+
+   id = dax_read_lock();
+   rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp);
+   dax_read_unlock(id);
+
+   /* don't leak a memory access error code to I/O syscalls */
+   if (rc == -EHWPOISON)
+   return -EIO;
+   return rc;
+}
+
+static int dax_fault_direct_access(const struct iomap *iomap, loff_t pos,
+  size_t size, void **kaddr, pfn_t *pfnp)
+{
+
+   int id;
+
+   id = dax_read_lock();
+   rc = __dax_iomap_direct_access(iomap, pos, size, kaddr, pfnp);
+   dax_read_unlock(id);
+
+   /* -EHWPOISON return ok */
+   return rc;
+}
+
  /**
   * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page
   * b

[PATCH v3] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-05-04 Thread Jane Chu

When multiple processes mmap() a dax file, then at some point,
a process issues a 'load' and consumes a hwpoison, the process
receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb
set for the poison scope. Soon after, any other process issues
a 'load' to the poisoned page (that is unmapped from the kernel
side by memory_failure), it receives a SIGBUS with
si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due
to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver's poison detection to the filesystem
such that instead of reporting VM_FAULT_SIGBUS, it could report
VM_FAULT_HWPOISON.

Change from v2:
  Convert -EHWPOISON to -EIO to prevent EHWPOISON errno from leaking
out to block read(2) - suggested by Matthew.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 2 +-
 fs/dax.c  | 4 ++--
 include/linux/mm.h| 2 ++
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
 
if (mode != DAX_RECOVERY_WRITE)
-   return -EIO;
+   return -EHWPOISON;
 
/*
 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 2ababb89918d..18f9598951f1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1498,7 +1498,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1506,7 +1506,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
recovery = true;
}
if (map_len < 0) {
-   ret = map_len;
+   ret = (map_len == -EHWPOISON) ? -EIO : map_len;
break;
}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..e4c974587659 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3217,6 +3217,8 @@ static inline vm_fault_t vmf_error(int err)
 {
if (err == -ENOMEM)
return VM_FAULT_OOM;
+   else if (err == -EHWPOISON)
+   return VM_FAULT_HWPOISON;
return VM_FAULT_SIGBUS;
 }
 
-- 
2.18.4

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Jane Chu


On 4/27/2023 4:48 PM, Matthew Wilcox wrote:

On Thu, Apr 27, 2023 at 04:36:58PM -0700, Jane Chu wrote:

This change results in EHWPOISON leaking to usersapce in the case of
read(2), that's not a return code that block I/O applications have ever
had to contend with before. Just as badblocks cause EIO to be returned,
so should poisoned cachelines for pmem.


The read(2) man page (https://man.archlinux.org/man/read.2) says
"On error, -1 is returned, and errno is set to indicate the error. In this
case, it is left unspecified whether the file position (if any) changes."

If read(2) users haven't dealt with EHWPOISON before, they may discover that
with pmem backed dax file, it's possible.


I don't think they should.  While syscalls are allowed to return errnos
other than the ones listed in POSIX, I don't think this is a worthwhile
difference.  We should be abstracting from the user that this is pmem
rather than spinning rust or nand.  So we should convert the EHWPOISON
to EIO as Dan suggests.


Got it, I'll add errno conversion in the respin.

thanks,
-jane

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Jane Chu


Hi, Dan,

On 4/27/2023 2:36 PM, Dan Williams wrote:

Jane Chu wrote:

When dax fault handler fails to provision the fault page due to
hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
detection on hwpoison to the filesystem to provide the precise reason
for the fault.


It's not yet clear to me by this description why this is an improvement
or will not cause other confusion. In this case the reason for the
SIGBUS is because the driver wants to prevent access to poison, not that
the CPU consumed poison. Can you clarify what is lost by *not* making
this change?


Elsewhere when hwpoison is detected by page fault handler and helpers as 
the direct cause to failure, VM_FAULT_HWPOISON or 
VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is 
produced, such as wp_page_copy() in COW case, do_swap_page() from 
handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the 
huge fault size would be indicated in the payload.


But dax fault has been an exception in that the SIGBUS payload does not 
indicate poison, nor fault size.  I don't see why it should be though,
recall an internal user expressing confusion regarding the different 
SIGBUS payloads.






Signed-off-by: Jane Chu 
---
  drivers/nvdimm/pmem.c | 2 +-
  fs/dax.c  | 2 +-
  include/linux/mm.h| 2 ++
  3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
  
  		if (mode != DAX_RECOVERY_WRITE)

-   return -EIO;
+   return -EHWPOISON;
  
  		/*

 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 3e457a16c7d1..c93191cd4802 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
  
  		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),

DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);


This change results in EHWPOISON leaking to usersapce in the case of
read(2), that's not a return code that block I/O applications have ever
had to contend with before. Just as badblocks cause EIO to be returned,
so should poisoned cachelines for pmem.


The read(2) man page (https://man.archlinux.org/man/read.2) says
"On error, -1 is returned, and errno is set to indicate the error. In 
this case, it is left unspecified whether the file position (if any) 
changes."


If users haven't dealt with EHWPOISON before, they may discover that 
with pmem backed dax, it's possible.


Thanks!
-jane

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Jane Chu


Hi, Dan,

On 4/27/2023 2:36 PM, Dan Williams wrote:

Jane Chu wrote:

When dax fault handler fails to provision the fault page due to
hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
detection on hwpoison to the filesystem to provide the precise reason
for the fault.


It's not yet clear to me by this description why this is an improvement
or will not cause other confusion. In this case the reason for the
SIGBUS is because the driver wants to prevent access to poison, not that
the CPU consumed poison. Can you clarify what is lost by *not* making
this change?


Elsewhere when hwpoison is detected by page fault handler and helpers as 
the direct cause to failure, VM_FAULT_HWPOISON or 
VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is 
produced, such as wp_page_copy() in COW case, do_swap_page() from 
handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the 
huge fault size would be indicated in the payload.


But dax fault has been an exception in that the SIGBUS payload does not 
indicate poison, nor fault size.  I don't see why it should be though,
recall an internal user expressing confusion regarding the different 
SIGBUS payloads.






Signed-off-by: Jane Chu 
---
  drivers/nvdimm/pmem.c | 2 +-
  fs/dax.c  | 2 +-
  include/linux/mm.h| 2 ++
  3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
  
  		if (mode != DAX_RECOVERY_WRITE)

-   return -EIO;
+   return -EHWPOISON;
  
  		/*

 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 3e457a16c7d1..c93191cd4802 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
  
  		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),

DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);


This change results in EHWPOISON leaking to usersapce in the case of
read(2), that's not a return code that block I/O applications have ever
had to contend with before. Just as badblocks cause EIO to be returned,
so should poisoned cachelines for pmem.


The read(2) man page (https://man.archlinux.org/man/read.2) says
"On error, -1 is returned, and errno is set to indicate the error. In 
this case, it is left unspecified whether the file position (if any) 
changes."


If read(2) users haven't dealt with EHWPOISON before, they may discover 
that with pmem backed dax file, it's possible.


Thanks!
-jane

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-18 Thread Jane Chu


Ping, any comment?

thanks,
-jane

On 4/6/2023 4:01 PM, Jane Chu wrote:

When dax fault handler fails to provision the fault page due to
hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
detection on hwpoison to the filesystem to provide the precise reason
for the fault.

Signed-off-by: Jane Chu 
---
  drivers/nvdimm/pmem.c | 2 +-
  fs/dax.c  | 2 +-
  include/linux/mm.h| 2 ++
  3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
  
  		if (mode != DAX_RECOVERY_WRITE)

-   return -EIO;
+   return -EHWPOISON;
  
  		/*

 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 3e457a16c7d1..c93191cd4802 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
  
  		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),

DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..e4c974587659 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3217,6 +3217,8 @@ static inline vm_fault_t vmf_error(int err)
  {
if (err == -ENOMEM)
return VM_FAULT_OOM;
+   else if (err == -EHWPOISON)
+   return VM_FAULT_HWPOISON;
return VM_FAULT_SIGBUS;
  }

[PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-06 Thread Jane Chu

When dax fault handler fails to provision the fault page due to
hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
detection on hwpoison to the filesystem to provide the precise reason
for the fault.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 2 +-
 fs/dax.c  | 2 +-
 include/linux/mm.h| 2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
 
if (mode != DAX_RECOVERY_WRITE)
-   return -EIO;
+   return -EHWPOISON;
 
/*
 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 3e457a16c7d1..c93191cd4802 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..e4c974587659 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3217,6 +3217,8 @@ static inline vm_fault_t vmf_error(int err)
 {
if (err == -ENOMEM)
return VM_FAULT_OOM;
+   else if (err == -EHWPOISON)
+   return VM_FAULT_HWPOISON;
return VM_FAULT_SIGBUS;
 }
 
-- 
2.18.4

Re: [PATCH] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-06 Thread Jane Chu


On 4/6/2023 12:32 PM, Matthew Wilcox wrote:

On Thu, Apr 06, 2023 at 11:55:56AM -0600, Jane Chu wrote:

  static vm_fault_t dax_fault_return(int error)
  {
if (error == 0)
return VM_FAULT_NOPAGE;
-   return vmf_error(error);
+   else if (error == -ENOMEM)
+   return VM_FAULT_OOM;
+   else if (error == -EHWPOISON)
+   return VM_FAULT_HWPOISON;
+   return VM_FAULT_SIGBUS;
  }


Why would we want to handle it here instead of changing vmf_error()?


I think it's related to the comment about the the corrupted range of
a hwpoison caused fault - something no need to worry about now.

I will move the change to vmf_error() in a respin.

Thanks!
-jane

[PATCH] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-06 Thread Jane Chu

When dax fault handler fails to provision the fault page due to
hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
detection on hwpoison to the filesystem to provide the precise reason
for the fault.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c |  2 +-
 fs/dax.c  | 14 --
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ceea55f621cc..46e094e56159 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
long actual_nr;
 
if (mode != DAX_RECOVERY_WRITE)
-   return -EIO;
+   return -EHWPOISON;
 
/*
 * Set the recovery stride is set to kernel page size because
diff --git a/fs/dax.c b/fs/dax.c
index 3e457a16c7d1..3f22686abc88 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1456,7 +1456,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
DAX_ACCESS, , NULL);
-   if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+   if (map_len == -EHWPOISON && iov_iter_rw(iter) == WRITE) {
map_len = dax_direct_access(dax_dev, pgoff,
PHYS_PFN(size), DAX_RECOVERY_WRITE,
, NULL);
@@ -1550,11 +1550,21 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+/*
+ * Concerning hwpoison triggered page fault: dax is THP, a pmd
+ * level fault handler will fallback (VM_FAULT_FALLBACK) before
+ * give up, hence return VM_FAULT_HWPOISON which implies
+ * corrupted range is PAGE_SIZE.
+ */
 static vm_fault_t dax_fault_return(int error)
 {
if (error == 0)
return VM_FAULT_NOPAGE;
-   return vmf_error(error);
+   else if (error == -ENOMEM)
+   return VM_FAULT_OOM;
+   else if (error == -EHWPOISON)
+   return VM_FAULT_HWPOISON;
+   return VM_FAULT_SIGBUS;
 }
 
 /*
-- 
2.18.4

[PATCH v8] x86/mce: retrieve poison range from hardware

2022-08-26 Thread Jane Chu

When memory poison consumption machine checks fire,
mce-notifier-handlers like nfit_handle_mce() record the impacted
physical address range which is reported by the hardware in the
MCi_MISC MSR. The error information includes data about blast radius,
i.e. how many cachelines did the hardware determine are impacted.
A recent change, commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to
determine poison granularity"), updated nfit_handle_mce() to stop hard
coding the blast radius value of 1 cacheline, and instead rely on the
blast radius reported in 'struct mce' which can be up to 4K (64 cachelines).

It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Reviewed-by: Ingo Molnar 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..8ed341714686 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,26 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fall back to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
+   lsb = find_first_bit((void *)_err->physical_addr_mask, 
PAGE_SHIFT);
+   else
+   lsb = PAGE_SHIFT;
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-26 Thread Jane Chu

On 8/26/2022 11:09 AM, Borislav Petkov wrote:
> On Fri, Aug 26, 2022 at 10:54:31AM -0700, Dan Williams wrote:
>> How about:
>>
>> ---
>>
>> When memory poison consumption machine checks fire,
>> mce-notifier-handlers like nfit_handle_mce() record the impacted
>> physical address range.
> 
> ... which is reported by the hardware in the MCi_MISC MSR.
> 
>> The error information includes data about blast
>> radius, i.e. how many cachelines did the hardware determine are
>> impacted.
> 
> Yap, nice.
> 
>> A recent change, commit 7917f9cdb503 ("acpi/nfit: rely on
>> mce->misc to determine poison granularity"), updated nfit_handle_mce()
>> to stop hard coding the blast radius value of 1 cacheline, and instead
>> rely on the blast radius reported in 'struct mce' which can be up to 4K
>> (64 cachelines).
>>
>> It turns out that apei_mce_report_mem_error() had a similar problem in
>> that it hard coded a blast radius of 4K rather than checking the blast
> 
> s/checking/reading/
> 
>> radius in the error information. Fix apei_mce_report_mem_error() to
> 
> s/in/from/
> 
>> convey the proper poison granularity.
>>
>> ---
> 
> Yap, that's a lot better.
> 
> Thanks!


Got it and points taken.  Thank you both, Boris and Dan.

v8 coming up.

thanks,
-jane

Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-25 Thread Jane Chu

On 8/23/2022 9:51 AM, Borislav Petkov wrote:
> On Tue, Aug 02, 2022 at 01:50:53PM -0600, Jane Chu wrote:
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
> 
> What I'm missing from this text here is, what *is* the mce->misc LSB
> field in human speak? What does that field denote?
> 
> What effect does that field have on error injection
Tony has replied.

> 
> And so on.
> 
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
> 
> Rely on hardware? You're changing this to rely on what the firmware
> reports.
> 
> That mem_err thing comes from a BIOS table AFAICT.
> 

Would fix the comment to indicate "relying on firmware" help?
Is there other concern?

thanks!
-jane

> ...
> 
> Thx.
>

Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-23 Thread Jane Chu

 >>> I suppose this wants to go upstream via the tree the bug came from 
(NVDIMM
 >>> tree? ACPI tree?), or should we pick it up into the x86 tree?
 >>
 >> No idea.  Maintainers?
 >
 > There's no real NVDIMM dependency here, just a general cleanup of how
 > APEI error granularities are managed. So I think it is appropriate for
 > this to go through the x86 tree via the typical path for mce related
 > topics.

+ Huang, Ying.

x86 maintainers,

Please let me know if you need another revision.

thanks,
-jane


On 8/8/2022 4:30 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 8/3/2022 1:53 AM, Ingo Molnar wrote:
>>>
>>> * Jane Chu  wrote:
>>>
>>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>>
>>> s/Commit/commit
>>
>> Maintainers,
>> Would you prefer a v8, or take care the comment upon accepting the patch?
>>
>>>
>>>> poison granularity") that changed nfit_handle_mce() callback to report
>>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>>> because 0x1000 bytes is 8 512-byte.
>>>>
>>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>>> support is available.
>>>>
>>>> Link: 
>>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>>>
>>>> Reviewed-by: Dan Williams 
>>>> Reviewed-by: Ingo Molnar 
>>>> Signed-off-by: Jane Chu 
>>>> ---
>>>>arch/x86/kernel/cpu/mce/apei.c | 13 -
>>>>1 file changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
>>>> b/arch/x86/kernel/cpu/mce/apei.c
>>>> index 717192915f28..8ed341714686 100644
>>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>>> @@ -29,15 +29,26 @@
>>>>void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>>>> *mem_err)
>>>>{
>>>>struct mce m;
>>>> +  int lsb;
>>>>
>>>>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>>return;
>>>>
>>>> +  /*
>>>> +   * Even if the ->validation_bits are set for address mask,
>>>> +   * to be extra safe, check and reject an error radius '0',
>>>> +   * and fall back to the default page size.
>>>> +   */
>>>> +  if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>>>> +  lsb = find_first_bit((void *)_err->physical_addr_mask, 
>>>> PAGE_SHIFT);
>>>> +  else
>>>> +  lsb = PAGE_SHIFT;
>>>> +
>>>>mce_setup();
>>>>m.bank = -1;
>>>>/* Fake a memory read error with unknown channel */
>>>>m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
>>>> MCI_STATUS_MISCV | 0x9f;
>>>> -  m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
>>>> +  m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
>>>
>>> LGTM.
>>>
>>> I suppose this wants to go upstream via the tree the bug came from (NVDIMM
>>> tree? ACPI tree?), or should we pick it up into the x86 tree?
>>
>> No idea.  Maintainers?
> 
> There's no real NVDIMM dependency here, just a general cleanup of how
> APEI error granularities are managed. So I think it is appropriate for
> this to go through the x86 tree via the typical path for mce related
> topics.

Re: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-08 Thread Jane Chu

On 8/3/2022 1:53 AM, Ingo Molnar wrote:
> 
> * Jane Chu  wrote:
> 
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> 
> s/Commit/commit

Maintainers,
Would you prefer a v8, or take care the comment upon accepting the patch?

> 
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> Link: 
>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>
>> Reviewed-by: Dan Williams 
>> Reviewed-by: Ingo Molnar 
>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 13 -
>>   1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..8ed341714686 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,15 +29,26 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int lsb;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>>   
>> +/*
>> + * Even if the ->validation_bits are set for address mask,
>> + * to be extra safe, check and reject an error radius '0',
>> + * and fall back to the default page size.
>> + */
>> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>> +lsb = find_first_bit((void *)_err->physical_addr_mask, 
>> PAGE_SHIFT);
>> +else
>> +lsb = PAGE_SHIFT;
>> +
>>  mce_setup();
>>  m.bank = -1;
>>  /* Fake a memory read error with unknown channel */
>>  m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
>> MCI_STATUS_MISCV | 0x9f;
>> -m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
>> +m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
> 
> LGTM.
> 
> I suppose this wants to go upstream via the tree the bug came from (NVDIMM
> tree? ACPI tree?), or should we pick it up into the x86 tree?

No idea.  Maintainers?

thanks!
-jane

> 
> Thanks,
> 
>   Ingo

[PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-02 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Reviewed-by: Ingo Molnar 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..8ed341714686 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,26 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fall back to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
+   lsb = find_first_bit((void *)_err->physical_addr_mask, 
PAGE_SHIFT);
+   else
+   lsb = PAGE_SHIFT;
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v6] x86/mce: retrieve poison range from hardware

2022-08-02 Thread Jane Chu

On 8/2/2022 3:59 AM, Ingo Molnar wrote:
> 
> * Jane Chu  wrote:
> 
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> Link: 
>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>
>> Reviewed-by: Dan Williams 
>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..e2439c7872ba 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,15 +29,27 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int lsb;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>>   
>> +/*
>> + * Even if the ->validation_bits are set for address mask,
>> + * to be extra safe, check and reject an error radius '0',
>> + * and fallback to the default page size.
>> + */
> 
> speling nit:
> 
>s/fallback/fall back
> 
>> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>> +lsb = find_first_bit((const unsigned long *)
>> +_err->physical_addr_mask, PAGE_SHIFT);
> 
> I think we can write this in a shorter form and in a single line:
> 
>   lsb = find_first_bit((void *)_err->physical_addr_mask, 
> PAGE_SHIFT);
> 
> (Ignore checkpatch if it wants to break the line.)
> 
> Untested.
> 
> Assuming my suggestion is correct and with those addressed:
> 
>Reviewed-by: Ingo Molnar 

Thanks!  Just tested the change, v7 coming next.

-jane

> 
> Thanks,
> 
>   Ingo

[PATCH v6] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..e2439c7872ba 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,27 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fallback to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
+   lsb = find_first_bit((const unsigned long *)
+   _err->physical_addr_mask, PAGE_SHIFT);
+   else
+   lsb = PAGE_SHIFT;
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Jane Chu

On 8/1/2022 2:20 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 8/1/2022 9:44 AM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>>> poison granularity") that changed nfit_handle_mce() callback to report
>>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>>> because 0x1000 bytes is 8 512-byte.
>>>>
>>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>>> support is available.
>>>>
>>>> Link: 
>>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>>>
>>>> Reviewed-by: Dan Williams 
>>>> Signed-off-by: Jane Chu 
>>>> ---
>>>>arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>>>1 file changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
>>>> b/arch/x86/kernel/cpu/mce/apei.c
>>>> index 717192915f28..2c7ea0ba9dd7 100644
>>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>>> @@ -29,15 +29,27 @@
>>>>void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>>>> *mem_err)
>>>>{
>>>>struct mce m;
>>>> +  int lsb = PAGE_SHIFT;
>>>>
>>>>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>>return;
>>>>
>>>> +  /*
>>>> +   * Even if the ->validation_bits are set for address mask,
>>>> +   * to be extra safe, check and reject an error radius '0',
>>>> +   * and fallback to the default page size.
>>>> +   */
>>>> +  if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>>>> +  lsb = __ffs64(mem_err->physical_addr_mask);
>>>> +  if (lsb == 1)
>>>
>>> This was the reason I recommended hweight64 and min_not_zero() as
>>> hweight64 does not have the undefined behavior. However, an even better
>>> option is to just do:
>>>
>>>   find_first_bit(_err->physical_addr_mask, PAGE_SHIFT)
>>>
>>> ...as that trims the result to the PAGE_SHIFT max and handles the zero
>>> case.
>>
>> Thanks Dan!  However it looks like find_first_bit() could call into
>> __ffs(x) which has the same limitation as __ffs64(x), as Tony pointed out.
> 
> Not quite, no. __ffs() behavior is *undefined* if the input is zero.
> find_first_bit() is *defined* and returns @size is the input is zero.
> Which is the behavior this wants to default to PAGE_SHIFT in the absence
> of any smaller granularity information.
> 

You're right, because of this line -
return val ? __ffs(val) : size;

Thanks!
-jane

Re: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Jane Chu

On 8/1/2022 9:44 AM, Dan Williams wrote:
> Jane Chu wrote:
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> Link: 
>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>
>> Reviewed-by: Dan Williams 
>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..2c7ea0ba9dd7 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,15 +29,27 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int lsb = PAGE_SHIFT;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>>   
>> +/*
>> + * Even if the ->validation_bits are set for address mask,
>> + * to be extra safe, check and reject an error radius '0',
>> + * and fallback to the default page size.
>> + */
>> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>> +lsb = __ffs64(mem_err->physical_addr_mask);
>> +if (lsb == 1)
> 
> This was the reason I recommended hweight64 and min_not_zero() as
> hweight64 does not have the undefined behavior. However, an even better
> option is to just do:
> 
>  find_first_bit(_err->physical_addr_mask, PAGE_SHIFT)
> 
> ...as that trims the result to the PAGE_SHIFT max and handles the zero
> case.

Thanks Dan!  However it looks like find_first_bit() could call into 
__ffs(x) which has the same limitation as __ffs64(x), as Tony pointed out.

I'll post v6 shortly.

thanks,
-jane

Re: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Jane Chu

On 8/1/2022 8:58 AM, Luck, Tony wrote:
>>  struct mce m;
>> +int lsb = PAGE_SHIFT;
> 
> Some maintainers like to order local declaration lines from longest to 
> shortest
>   
>> + /*
>> +  * Even if the ->validation_bits are set for address mask,
>> +  * to be extra safe, check and reject an error radius '0',
>> +  * and fallback to the default page size.
>> +  */
>> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>> + lsb = __ffs64(mem_err->physical_addr_mask);
>> + if (lsb == 1)
>> + lsb = PAGE_SHIFT;
>> + }
> 
> 
> The comment above __ffs64() says:
> 
> * The result is not defined if no bits are set, so check that @word
>   * is non-zero before calling this.
> 
> So if the intent is "extra safe" should check for that:
> 
>  if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK &&
>  mem_err->physical_addr_mask) {
>  lsb = __ffs64(mem_err->physical_addr_mask);
>  if (lsb == 1)
>  lsb = PAGE_SHIFT;
>  }

Indeed, thanks a lot!

-jane


> 
> -Tony
> 
>   
>

[PATCH v5] x86/mce: retrieve poison range from hardware

2022-07-30 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..2c7ea0ba9dd7 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,27 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb = PAGE_SHIFT;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fallback to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
+   lsb = __ffs64(mem_err->physical_addr_mask);
+   if (lsb == 1)
+   lsb = PAGE_SHIFT;
+   }
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-28 Thread Jane Chu

On 7/28/2022 11:46 AM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/27/2022 1:01 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> On 7/27/2022 12:30 PM, Jane Chu wrote:
>>>>> On 7/27/2022 12:24 PM, Jane Chu wrote:
>>>>>> On 7/27/2022 11:56 AM, Dan Williams wrote:
>>>>>>> Jane Chu wrote:
>>>>>>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>>>>>>> poison granularity") that changed nfit_handle_mce() callback to report
>>>>>>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>>>>>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence
>>>>>>>> injecting
>>>>>>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>>>>>>> because 0x1000 bytes is 8 512-byte.
>>>>>>>>
>>>>>>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>>>>>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>>>>>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>>>>>>> support is available.
>>>>>>>>
>>>>>>>> Link:
>>>>>>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>>>>>>>
>>>>>>>>
>>>>>>>> Reviewed-by: Dan Williams 
>>>>>>>> Signed-off-by: Jane Chu 
>>>>>>>> ---
>>>>>>>>     arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>>>>>>>     1 file changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c
>>>>>>>> b/arch/x86/kernel/cpu/mce/apei.c
>>>>>>>> index 717192915f28..26d63818b2de 100644
>>>>>>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>>>>>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>>>>>>> @@ -29,15 +29,27 @@
>>>>>>>>     void apei_mce_report_mem_error(int severity, struct
>>>>>>>> cper_sec_mem_err *mem_err)
>>>>>>>>     {
>>>>>>>>     struct mce m;
>>>>>>>> +    int grain = PAGE_SHIFT;
>>>>>>>>     if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>>>>>>     return;
>>>>>>>> +    /*
>>>>>>>> + * Even if the ->validation_bits are set for address mask,
>>>>>>>> + * to be extra safe, check and reject an error radius '0',
>>>>>>>> + * and fallback to the default page size.
>>>>>>>> + */
>>>>>>>> +    if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>>>>>>>> +    grain = ~mem_err->physical_addr_mask + 1;
>>>>>>>> +    if (grain == 1)
>>>>>>>> +    grain = PAGE_SHIFT;
>>>>>>>
>>>>>>> Wait, if @grain is the number of bits to mask off the address, shouldn't
>>>>>>> this be something like:
>>>>>>>
>>>>>>>    grain = min_not_zero(PAGE_SHIFT,
>>>>>>> hweight64(~mem_err->physical_addr_mask));
>>>>>>
>>>>>> I see. I guess what you meant is
>>>>>>       grain = min(PAGE_SHIFT, (1 +
>>>>>> hweight64(~mem_err->physical_addr_mask)));
>>>>>
>>>>> Sorry, take that back, it won't work either.
>>>>
>>>> This will work,
>>>>  grain = min_not_zero(PAGE_SHIFT - 1,
>>>> hweight64(~mem_err->physical_addr_mask));
>>>>  grain++;
>>>> but too sophisticated?  I guess I prefer the simple "if" expression.
>>>
>>> An "if" is fine, I was more pointing out that:
>>>
>>>   hweight64(~mem_err->physical_addr_mask) + 1
>>>
>>> ...and:
>>>
>>>   ~mem_err->physical_addr_mask + 1;
>>>
>>> ...give different results.
>>
>> They are different indeed.  hweight64 returns the count of set bit while
>> ~mem_err->physical_addr_mask returns a negated value.
>>
>> According to the definition of "Physical Address Mask" -
>>
>> https://uefi.org/sites/default/files/resources/UEFI_Spec_2_9_2021_03_18.pdf
>>
>> Table N-31 Memory Error Record
>>
>> Physical Address Mask 24 8   Defines the valid address bits in the
>> Physical Address field. The mask specifies the granularity of the
>> physical address which is dependent on the hw/ implementation factors
>> such as interleaving.
>>
>> It appears that "Physical Address Mask" is defined more like PAGE_MASK
>> rather than in bitops hweight64() ofter used to count the set bits as
>> an indication of (e.g.) how many registers are in use.
>>
>> Ans similar to PAGE_MASK, a valid "Physical Address Mask" should
>> consist of a contiguous low 0 bits, not 1's and 0's mixed up.
>>
>> So far, as far as I can see, the v4 patch still looks correct to me.
>> Please let me know if I'm missing anything.
> 
> The v4 patch looks broken to me. If the address mask is
> 0xffc0 to indicate a cacheline error then:
> 
>  ~mem_err->physical_addr_mask + 1;
> 
> ...results in a grain of 64 when it should be 6.

Right, it's the exponent that's needed, so back to __ffs64().
Sorry for the detour.  v5 is coming next.

thanks!
-jane

Re: [PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-28 Thread Jane Chu

On 7/27/2022 1:01 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/27/2022 12:30 PM, Jane Chu wrote:
>>> On 7/27/2022 12:24 PM, Jane Chu wrote:
>>>> On 7/27/2022 11:56 AM, Dan Williams wrote:
>>>>> Jane Chu wrote:
>>>>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>>>>> poison granularity") that changed nfit_handle_mce() callback to report
>>>>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>>>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence
>>>>>> injecting
>>>>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>>>>> because 0x1000 bytes is 8 512-byte.
>>>>>>
>>>>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>>>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>>>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>>>>> support is available.
>>>>>>
>>>>>> Link:
>>>>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>>>>>
>>>>>>
>>>>>> Reviewed-by: Dan Williams 
>>>>>> Signed-off-by: Jane Chu 
>>>>>> ---
>>>>>>    arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>>>>>    1 file changed, 13 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c
>>>>>> b/arch/x86/kernel/cpu/mce/apei.c
>>>>>> index 717192915f28..26d63818b2de 100644
>>>>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>>>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>>>>> @@ -29,15 +29,27 @@
>>>>>>    void apei_mce_report_mem_error(int severity, struct
>>>>>> cper_sec_mem_err *mem_err)
>>>>>>    {
>>>>>>    struct mce m;
>>>>>> +    int grain = PAGE_SHIFT;
>>>>>>    if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>>>>    return;
>>>>>> +    /*
>>>>>> + * Even if the ->validation_bits are set for address mask,
>>>>>> + * to be extra safe, check and reject an error radius '0',
>>>>>> + * and fallback to the default page size.
>>>>>> + */
>>>>>> +    if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>>>>>> +    grain = ~mem_err->physical_addr_mask + 1;
>>>>>> +    if (grain == 1)
>>>>>> +    grain = PAGE_SHIFT;
>>>>>
>>>>> Wait, if @grain is the number of bits to mask off the address, shouldn't
>>>>> this be something like:
>>>>>
>>>>>   grain = min_not_zero(PAGE_SHIFT,
>>>>> hweight64(~mem_err->physical_addr_mask));
>>>>
>>>> I see. I guess what you meant is
>>>>      grain = min(PAGE_SHIFT, (1 +
>>>> hweight64(~mem_err->physical_addr_mask)));
>>>
>>> Sorry, take that back, it won't work either.
>>
>> This will work,
>> grain = min_not_zero(PAGE_SHIFT - 1,
>> hweight64(~mem_err->physical_addr_mask));
>> grain++;
>> but too sophisticated?  I guess I prefer the simple "if" expression.
> 
> An "if" is fine, I was more pointing out that:
> 
>  hweight64(~mem_err->physical_addr_mask) + 1
> 
> ...and:
> 
>  ~mem_err->physical_addr_mask + 1;
> 
> ...give different results.

They are different indeed.  hweight64 returns the count of set bit while
~mem_err->physical_addr_mask returns a negated value.

According to the definition of "Physical Address Mask" -

https://uefi.org/sites/default/files/resources/UEFI_Spec_2_9_2021_03_18.pdf

Table N-31 Memory Error Record

Physical Address Mask 24 8   Defines the valid address bits in the 
Physical Address field. The mask specifies the granularity of the 
physical address which is dependent on the hw/ implementation factors 
such as interleaving.

It appears that "Physical Address Mask" is defined more like PAGE_MASK
rather than in bitops hweight64() ofter used to count the set bits as
an indication of (e.g.) how many registers are in use.

Ans similar to PAGE_MASK, a valid "Physical Address Mask" should
consist of a contiguous low 0 bits, not 1's and 0's mixed up.

So far, as far as I can see, the v4 patch still looks correct to me.
Please let me know if I'm missing anything.

thanks!
-jane

Re: [PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-27 Thread Jane Chu

On 7/27/2022 12:30 PM, Jane Chu wrote:
> On 7/27/2022 12:24 PM, Jane Chu wrote:
>> On 7/27/2022 11:56 AM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>>> poison granularity") that changed nfit_handle_mce() callback to report
>>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence 
>>>> injecting
>>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>>> because 0x1000 bytes is 8 512-byte.
>>>>
>>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>>> support is available.
>>>>
>>>> Link: 
>>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com 
>>>>
>>>>
>>>> Reviewed-by: Dan Williams 
>>>> Signed-off-by: Jane Chu 
>>>> ---
>>>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
>>>> b/arch/x86/kernel/cpu/mce/apei.c
>>>> index 717192915f28..26d63818b2de 100644
>>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>>> @@ -29,15 +29,27 @@
>>>>   void apei_mce_report_mem_error(int severity, struct 
>>>> cper_sec_mem_err *mem_err)
>>>>   {
>>>>   struct mce m;
>>>> +    int grain = PAGE_SHIFT;
>>>>   if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>>   return;
>>>> +    /*
>>>> + * Even if the ->validation_bits are set for address mask,
>>>> + * to be extra safe, check and reject an error radius '0',
>>>> + * and fallback to the default page size.
>>>> + */
>>>> +    if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>>>> +    grain = ~mem_err->physical_addr_mask + 1;
>>>> +    if (grain == 1)
>>>> +    grain = PAGE_SHIFT;
>>>
>>> Wait, if @grain is the number of bits to mask off the address, shouldn't
>>> this be something like:
>>>
>>>  grain = min_not_zero(PAGE_SHIFT, 
>>> hweight64(~mem_err->physical_addr_mask));
>>
>> I see. I guess what you meant is
>>     grain = min(PAGE_SHIFT, (1 + 
>> hweight64(~mem_err->physical_addr_mask)));
> 
> Sorry, take that back, it won't work either.

This will work,
   grain = min_not_zero(PAGE_SHIFT - 1, 
hweight64(~mem_err->physical_addr_mask));
   grain++;
but too sophisticated?  I guess I prefer the simple "if" expression.

thanks,
-jane

> 
> -jane
> 
>> so that in the pmem poison case, 'grain' would be 8, not 7.
>>
>> thanks,
>> -jane
>>
>>>
>>> ...?
>>
>

Re: [PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-27 Thread Jane Chu

On 7/27/2022 12:24 PM, Jane Chu wrote:
> On 7/27/2022 11:56 AM, Dan Williams wrote:
>> Jane Chu wrote:
>>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>>> poison granularity") that changed nfit_handle_mce() callback to report
>>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>>> because 0x1000 bytes is 8 512-byte.
>>>
>>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>>> the LSB field to PAGE_SHIFT instead of consulting the input
>>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>>> support is available.
>>>
>>> Link: 
>>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com 
>>>
>>>
>>> Reviewed-by: Dan Williams 
>>> Signed-off-by: Jane Chu 
>>> ---
>>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mce/apei.c 
>>> b/arch/x86/kernel/cpu/mce/apei.c
>>> index 717192915f28..26d63818b2de 100644
>>> --- a/arch/x86/kernel/cpu/mce/apei.c
>>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>>> @@ -29,15 +29,27 @@
>>>   void apei_mce_report_mem_error(int severity, struct 
>>> cper_sec_mem_err *mem_err)
>>>   {
>>>   struct mce m;
>>> +    int grain = PAGE_SHIFT;
>>>   if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>>   return;
>>> +    /*
>>> + * Even if the ->validation_bits are set for address mask,
>>> + * to be extra safe, check and reject an error radius '0',
>>> + * and fallback to the default page size.
>>> + */
>>> +    if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>>> +    grain = ~mem_err->physical_addr_mask + 1;
>>> +    if (grain == 1)
>>> +    grain = PAGE_SHIFT;
>>
>> Wait, if @grain is the number of bits to mask off the address, shouldn't
>> this be something like:
>>
>>  grain = min_not_zero(PAGE_SHIFT, 
>> hweight64(~mem_err->physical_addr_mask));
> 
> I see. I guess what you meant is
>     grain = min(PAGE_SHIFT, (1 + hweight64(~mem_err->physical_addr_mask)));

Sorry, take that back, it won't work either.

-jane

> so that in the pmem poison case, 'grain' would be 8, not 7.
> 
> thanks,
> -jane
> 
>>
>> ...?
>

Re: [PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-27 Thread Jane Chu

On 7/27/2022 11:56 AM, Dan Williams wrote:
> Jane Chu wrote:
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> Link: 
>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>
>> Reviewed-by: Dan Williams 
>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..26d63818b2de 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,15 +29,27 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int grain = PAGE_SHIFT;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>>   
>> +/*
>> + * Even if the ->validation_bits are set for address mask,
>> + * to be extra safe, check and reject an error radius '0',
>> + * and fallback to the default page size.
>> + */
>> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
>> +grain = ~mem_err->physical_addr_mask + 1;
>> +if (grain == 1)
>> +grain = PAGE_SHIFT;
> 
> Wait, if @grain is the number of bits to mask off the address, shouldn't
> this be something like:
> 
>  grain = min_not_zero(PAGE_SHIFT, 
> hweight64(~mem_err->physical_addr_mask));

I see. I guess what you meant is
grain = min(PAGE_SHIFT, (1 + hweight64(~mem_err->physical_addr_mask)));
so that in the pmem poison case, 'grain' would be 8, not 7.

thanks,
-jane

> 
> ...?

[PATCH v4] x86/mce: retrieve poison range from hardware

2022-07-27 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..26d63818b2de 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,27 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int grain = PAGE_SHIFT;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fallback to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
+   grain = ~mem_err->physical_addr_mask + 1;
+   if (grain == 1)
+   grain = PAGE_SHIFT;
+   }
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | grain;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v3] x86/mce: retrieve poison range from hardware

2022-07-18 Thread Jane Chu

On 7/18/2022 12:22 PM, Luck, Tony wrote:
>> It appears the kernel is trusting that ->physical_addr_mask is non-zero
>> in other paths. So this is at least equally broken in the presence of a
>> broken BIOS. The impact is potentially larger though with this change,
>> so it might be a good follow-on patch to make sure that
>> ->physical_addr_mask gets fixed up to a minimum mask value.
> 
> Agreed. Separate patch to sanitize early, so other kernel code can just use 
> it.
> 

Is it possible that with
   if (mem->validation_bits & CPER_MEM_VALID_PA_MASK)
the ->physical_addr_mask is still untrustworthy?

include/ras/ras_event.h has this
   if (mem->validation_bits & CPER_MEM_VALID_PA_MASK)
 __entry->pa_mask_lsb = 
(u8)__ffs64(mem->physical_addr_mask);
 else
 __entry->pa_mask_lsb = ~0;
which hints otherwise.

apei_mce_report_mem_error() already checks mem->validation_bits
up front.

thanks!
-jane

> -Tony

[PATCH v3] x86/mce: retrieve poison range from hardware

2022-07-17 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..a8274fd57add 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -37,7 +37,7 @@ void apei_mce_report_mem_error(int severity, struct 
cper_sec_mem_err *mem_err)
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | 
__ffs64(mem_err->physical_addr_mask);
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH v2] x86/mce: retrieve poison range from hardware whenever supported

2022-07-16 Thread Jane Chu

On 7/15/2022 9:50 PM, Dan Williams wrote:
> Jane Chu wrote:
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> v1: https://lkml.org/lkml/2022/7/15/1040
> 
> This should be "Link:" and it should reference lore.kernel.org by
> msg-id. Lore is maintained by LF infrastructure and the message-id can
> be used to search lore.kernel.org mirrors even if the LF infra is down.
> 
> Link: 
> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
> 

Got it, thanks!

>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 6 +-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..a4d5893632e0 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,6 +29,7 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int lsb;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>> @@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct 
>> cper_sec_mem_err *mem_err)
>>  m.bank = -1;
>>  /* Fake a memory read error with unknown channel */
>>  m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
>> MCI_STATUS_MISCV | 0x9f;
>> -m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
>> +lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1;
> 
> Just use the kernel's __ffs64() which does not require the substraction

Will do.

> fixup, and you can assume that physical_address_mask is non-zero just
> like trace_extlog_mem_event() does.

I guess to me, a more convincing evidence is
 /* Error grain */
 if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
 e->grain = ~mem_err->physical_addr_mask + 1;
in ghes_edac_report_mem_error().

> 
>> +if (lsb <= 0)
>> +lsb = PAGE_SHIFT;
> 
> This fixup can then be removed as well.
> 
> After the above comments are addressed you can add:
> 
> Reviewed-by: Dan Williams 

Thanks!
-jane

[PATCH v2] x86/mce: retrieve poison range from hardware whenever supported

2022-07-15 Thread Jane Chu

With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

v1: https://lkml.org/lkml/2022/7/15/1040
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..a4d5893632e0 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,6 +29,7 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
@@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct 
cper_sec_mem_err *mem_err)
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1;
+   if (lsb <= 0)
+   lsb = PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu

On 7/15/2022 12:17 PM, Dan Williams wrote:
> [ add Tony ]
> 
> Jane Chu wrote:
>> On 7/14/2022 6:19 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>>>> one call for each poison with accurate address.
>>>>
>>>> Also, short ARS would find 2 poisons.
>>>>
>>>> I attached the console output, my annotation is prefixed with "<==".
>>>
>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600  
>>> <== 2nd poison @ 0x600
>>> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 
>>> 0xff00
>>>
>>> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
>>> is seeing a 256-byte address mask?
>>
>> Good question!  One would think both GHES reporting and
>> nfit_handle_mce() are consuming the same mce record...
>> Who might know?
> 
> Did some grepping...
> 
> Have a look at: apei_mce_report_mem_error()
> 
> "The call is coming from inside the house!"
> 
> Luckily we do not need to contact a BIOS engineer to get this fixed.

Great, thank you!
Just put together a quick fix for review after I tested it.

> 
>>> Sigh, is this "firmware-first" causing the kernel to get bad information
>>> via the native mechanisms >
>>> I would expect that if this test was truly worried about minimizing BIOS
>>> latency it would disable firmware-first error reporting. I wonder if
>>> that fixes the observed problem?
>>
>> Could you elaborate on firmware-first error please?  What are the
>> possible consequences disabling it? and how to disable it?
> 
> With my Linux kernel developer hat on, firmware-first error handling is
> really only useful for supporting legacy operating systems that do not
> have native machine check handling, or for platforms that have bugs that
> would otherwise cause OS native error handling to fail. Otherwise, for
> modern Linux, firmware-first error handling is pure overhead and a
> source of bugs.
> 
> In this case the bug is in the Linux code that translates the ACPI event
> back into an MCE record.

Thanks!

-jane

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu

On 7/14/2022 5:58 PM, Dan Williams wrote:
[..]
>>>
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.

 You mean for nfit_handle_mce() callback to issue a short ARS per each
 poison report over a 4K range
>>>
>>> Over a L1_CACHE_BYTES range...
>>>
[..]
>>>
>>> For the badrange tracking, no. So this would just be a check to say
>>> "Yes, CPU I see you think the whole 4K is gone, but lets double check
>>> with more precise information for what gets placed in the badrange
>>> tracking".
>>
>> Okay, process-wise, this is what I am seeing -
>>
>> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
>> 64bytes)
> 
> Why would the short-ARS be performed over a 64-byte span when the MCE
> reported a 4K aligned event?

Cuz you said so, see above. :)  Yes, 4K range as reported by the MCE 
makes sense.

> 
>> - and short ARS returns to say that's actually (addr, 256bytes),
>> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
>> anyway.
> 
> Right, I am reacting to the fact that the patch is picking 512 as an
> arbtitrary blast radius. It's ok to expand the blast radius from
> hardware when, for example, recording a 64-byte MCE in badrange which
> only understands 512 byte records, but it's not ok to take a 4K MCE and
> trim it to 512 bytes without asking hardware for a more precise report.

Agreed.

> 
> Recall that the NFIT driver supports platforms that may not offer ARS.
> In that case the 4K MCE from the CPU is all that the driver gets and
> there is no data source for a more precise answer.
> 
> So the ask is to avoid trimming the blast radius of MCE reports unless
> and until a short-ARS says otherwise.
> 

What happens to short ARS on a platform that doesn't support ARS?
-EOPNOTSUPPORTED ?

>> The precise badrange from short ARS is lost in the process, given the
>> time spent visiting the BIOS, what's the gain?
> 
> Generic support for not under-recording poison on platforms that do not
> support ARS.
> 
>> Could we defer the precise badrange until there is consumer of the
>> information?
> 
> Ideally the consumer is immediate and this precise information can make
> it to the filesystem which might be able to make a better decision about
> what data got clobbered.
> 
> See dax_notify_failure() infrastructure currently in linux-next that can
> convey poison events to filesystems. That might be a path to start
> tracking and reporting precise failure information to address the
> constraints of the badrange implementation.

Yes, I'm aware of dax_notify_failure(), but would appreciate if you 
don't mind to elaborate on how the code path could be leveraged for 
precise badrange implementation.
My understanding is that dax_notify_failure() is in the path of 
synchronous fault accompanied by SIGBUS with BUS_MCEERR_AR.
But badrange could be recorded without poison being consumed, even 
without DAX filesystem in the picture.

thanks,
-jane

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu

On 7/14/2022 6:19 PM, Dan Williams wrote:
> Jane Chu wrote:
>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>> one call for each poison with accurate address.
>>
>> Also, short ARS would find 2 poisons.
>>
>> I attached the console output, my annotation is prefixed with "<==".
> 
> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
> <== 2nd poison @ 0x600
> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 
> 0xff00
> 
> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> is seeing a 256-byte address mask?

Good question!  One would think both GHES reporting and 
nfit_handle_mce() are consuming the same mce record...
Who might know?

> 
> Sigh, is this "firmware-first" causing the kernel to get bad information
> via the native mechanisms >
> I would expect that if this test was truly worried about minimizing BIOS
> latency it would disable firmware-first error reporting. I wonder if
> that fixes the observed problem?

Could you elaborate on firmware-first error please?  What are the 
possible consequences disabling it? and how to disable it?

thanks!
-jane

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-14 Thread Jane Chu

On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>>>> granularity") changed nfit_handle_mce() callback to report badrange for
>>>> each poison at an alignment indicated by 1ULL << 
>>>> MCI_MISC_ADDR_LSB(mce->misc)
>>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>>>> populated with Intel DCPMEM v2 dimms, it appears that
>>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte 
>>>> blocks.
>>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>>>> 8 poisons.
>>>>
>>>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x0040a0602400
>>>> [..]
>>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: 
>>>> Recovered
>>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
>>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
>>>> 0x1000)
>>>> [..]
>>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
>>>> [..]
>>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
>>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
>>>> 0x1000)
>>>> [..]
>>>> {
>>>> "dev":"namespace0.0",
>>>> "mode":"fsdax",
>>>> "map":"dev",
>>>> "size":33820770304,
>>>> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>>> "sector_size":512,
>>>> "align":2097152,
>>>> "blockdev":"pmem0",
>>>> "badblock_count":8,
>>>> "badblocks":[
>>>>   {
>>>> "offset":8208,
>>>> "length":8,
>>>> "dimms":[
>>>>   "nmem0"
>>>> ]
>>>>   }
>>>> ]
>>>> }
>>>>
>>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for 
>>>> poison
>>>> radius and shouldn't be used.  More over, as each injected poison is being
>>>> reported independently, any alignment under 512-byte appear works:
>>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>>>> or 512-byte.
>>>>
>>>> To get around this issue, 512-bytes is chosen as the alignment because
>>>> a. it happens to be the badblock granularity,
>>>> b. ndctl inject-error cannot inject more than one poison to a 512-byte 
>>>> block,
>>>> c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
> 
> Why would there be 8 reports for just one poison consumption event?

I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.

Also, short ARS would find 2 poisons.

I attached the console output, my annotation is prefixed with "<==".

It is from these information I concluded that no poison will be missed
in terms of reporting.

> 
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> 
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.

Okay, that confirms my understanding of your proposal. More below.

> 
>> In the above 2 poison example, the poison in 0x0040a0602400 and in
>> 0x0040a0602600 were separately re

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-13 Thread Jane Chu

On 7/12/2022 5:48 PM, Dan Williams wrote:
> Jane Chu wrote:
>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>> granularity") changed nfit_handle_mce() callback to report badrange for
>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>> populated with Intel DCPMEM v2 dimms, it appears that
>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>> 8 poisons.
>>
>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x0040a0602400
>> [..]
>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: 
>> Recovered
>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
>> 0x1000)
>> [..]
>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
>> [..]
>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
>> 0x1000)
>> [..]
>> {
>>"dev":"namespace0.0",
>>"mode":"fsdax",
>>"map":"dev",
>>"size":33820770304,
>>"uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>"sector_size":512,
>>"align":2097152,
>>"blockdev":"pmem0",
>>"badblock_count":8,
>>"badblocks":[
>>  {
>>"offset":8208,
>>"length":8,
>>"dimms":[
>>  "nmem0"
>>]
>>  }
>>]
>> }
>>
>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>> radius and shouldn't be used.  More over, as each injected poison is being
>> reported independently, any alignment under 512-byte appear works:
>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>> or 512-byte.
>>
>> To get around this issue, 512-bytes is chosen as the alignment because
>>a. it happens to be the badblock granularity,
>>b. ndctl inject-error cannot inject more than one poison to a 512-byte 
>> block,
>>c. architecture agnostic
> 
> I am failing to see the kernel bug? Yes, you injected less than 8
> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> that's not the kernel's fault, that's the hardware. What happens when
> hardware really does detect 8 blocks of consective poison and this
> implementation decides to only record 1 at a time?

In that case, there will be 8 reports of the poisons by APEI GHES, and
ARC scan will also report 8 poisons, each will get to be added to the 
bad range via nvdimm_bus_add_badrange(), so none of them will be missed.

In the above 2 poison example, the poison in 0x0040a0602400 and in 
0x0040a0602600 were separately reported.

> 
> It seems the fix you want is for the hardware to report the precise
> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> that precision in this case.

That field describes a 4K range even for a single poison, it confuses 
people unnecessarily.

> 
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.

You mean for nfit_handle_mce() callback to issue a short ARS per each 
poison report over a 4K range in order to decide the precise range as a 
workaround of the hardware issue?  if there are 8 poisoned detected, 
there will be 8 short ARS, sure we want to do that?  also, for now, is 
it possible to log more than 1 poison per 512byte block?

thanks!
-jane

[PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-11 Thread Jane Chu

Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
granularity") changed nfit_handle_mce() callback to report badrange for
each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
instead of the hardcoded L1_CACHE_BYTES. However recently on a server
populated with Intel DCPMEM v2 dimms, it appears that
1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
8 poisons.

[29076.590281] {3}[Hardware Error]:   physical_address: 0x0040a0602400
[..]
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: 
Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
0x1000)
[..]
[29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
[..]
[29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
0x1000)
[..]
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":33820770304,
  "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0",
  "badblock_count":8,
  "badblocks":[
{
  "offset":8208,
  "length":8,
  "dimms":[
"nmem0"
  ]
}
  ]
}

So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
radius and shouldn't be used.  More over, as each injected poison is being
reported independently, any alignment under 512-byte appear works:
L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
or 512-byte.

To get around this issue, 512-bytes is chosen as the alignment because
  a. it happens to be the badblock granularity,
  b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
  c. architecture agnostic

Fixes: 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison 
granularity")
Signed-off-by: Jane Chu 
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..eeacc8eb807f 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,7 +32,6 @@ static int nfit_handle_mce(struct notifier_block *nb, 
unsigned long val,
 */
mutex_lock(_desc_lock);
list_for_each_entry(acpi_desc, _descs, list) {
-   unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;
 
@@ -64,7 +63,8 @@ static int nfit_handle_mce(struct notifier_block *nb, 
unsigned long val,
 
/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-   ALIGN_DOWN(mce->addr, align), align);
+   ALIGN(mce->addr, SECTOR_SIZE),
+   SECTOR_SIZE);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);
 

base-commit: e35e5b6f695d241ffb1d223207da58a1fbcdff4b
-- 
2.18.4

Re: [PATCH v2] pmem: fix a name collision

2022-06-30 Thread Jane Chu

On 6/30/2022 11:29 AM, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig 

Thank you!
-jane

[PATCH v2] pmem: fix a name collision

2022-06-30 Thread Jane Chu

Kernel test robot detected name collision when compiled on 'um'
architecture.  Rename "to_phys()"  to "pmem_to_phys()".

>> drivers/nvdimm/pmem.c:48:20: error: conflicting types for 'to_phys'; have 
>> 'phys_addr_t(struct pmem_device *, phys_addr_t)' {aka 'long long unsigned 
>> int(struct pmem_device *, long long unsigned int)'}
  48 | static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t 
offset)
 |^~~
   In file included from arch/um/include/asm/page.h:98,
from arch/um/include/asm/thread_info.h:15,
from include/linux/thread_info.h:60,
from include/asm-generic/preempt.h:5,
from ./arch/um/include/generated/asm/preempt.h:1,

   arch/um/include/shared/mem.h:12:29: note: previous definition of 'to_phys' 
with type 'long unsigned int(void *)'
  12 | static inline unsigned long to_phys(void *virt)
 | ^~~

vim +48 drivers/nvdimm/pmem.c
47
  > 48  static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
49  {
50  return pmem->phys_addr + offset;
51  }
52

Fixes: 9409c9b6709e (pmem: refactor pmem_clear_poison())
Reported-by: kernel test robot 
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 629d10fcf53b..b9f1a8e9f88c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,7 +45,7 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
 }
 
-static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
+static phys_addr_t pmem_to_phys(struct pmem_device *pmem, phys_addr_t offset)
 {
return pmem->phys_addr + offset;
 }
@@ -63,7 +63,7 @@ static phys_addr_t to_offset(struct pmem_device *pmem, 
sector_t sector)
 static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
unsigned int len)
 {
-   phys_addr_t phys = to_phys(pmem, offset);
+   phys_addr_t phys = pmem_to_phys(pmem, offset);
unsigned long pfn_start, pfn_end, pfn;
 
/* only pmem in the linear map supports HWPoison */
@@ -97,7 +97,7 @@ static void pmem_clear_bb(struct pmem_device *pmem, sector_t 
sector, long blks)
 static long __pmem_clear_poison(struct pmem_device *pmem,
phys_addr_t offset, unsigned int len)
 {
-   phys_addr_t phys = to_phys(pmem, offset);
+   phys_addr_t phys = pmem_to_phys(pmem, offset);
long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
 
if (cleared > 0) {
-- 
2.18.4

Re: [PATCH] pmem: fix a name collision

2022-06-30 Thread Jane Chu

On 6/30/2022 11:04 AM, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 11:51:55AM -0600, Jane Chu wrote:
>> -static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
>> +static phys_addr_t _to_phys(struct pmem_device *pmem, phys_addr_t offset)
> 
> I'd rather call this pmem_to_phys as that is a much nicer name.

Okay, thanks!

-jane

[PATCH] pmem: fix a name collision

2022-06-30 Thread Jane Chu

Kernel test robot detected name collision when compiled on 'um'
architecture.  Rename "to_phys()"  to "_to_phys()".

>> drivers/nvdimm/pmem.c:48:20: error: conflicting types for 'to_phys'; have 
>> 'phys_addr_t(struct pmem_device *, phys_addr_t)' {aka 'long long unsigned 
>> int(struct pmem_device *, long long unsigned int)'}
  48 | static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t 
offset)
 |^~~
   In file included from arch/um/include/asm/page.h:98,
from arch/um/include/asm/thread_info.h:15,
from include/linux/thread_info.h:60,
from include/asm-generic/preempt.h:5,
from ./arch/um/include/generated/asm/preempt.h:1,

   arch/um/include/shared/mem.h:12:29: note: previous definition of 'to_phys' 
with type 'long unsigned int(void *)'
  12 | static inline unsigned long to_phys(void *virt)
 | ^~~

vim +48 drivers/nvdimm/pmem.c
47
  > 48  static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
49  {
50  return pmem->phys_addr + offset;
51  }
52

Fixes: 9409c9b6709e (pmem: refactor pmem_clear_poison())
Reported-by: kernel test robot 
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 629d10fcf53b..26c81ed7c0fe 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -45,7 +45,7 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
 }
 
-static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset)
+static phys_addr_t _to_phys(struct pmem_device *pmem, phys_addr_t offset)
 {
return pmem->phys_addr + offset;
 }
@@ -63,7 +63,7 @@ static phys_addr_t to_offset(struct pmem_device *pmem, 
sector_t sector)
 static void pmem_mkpage_present(struct pmem_device *pmem, phys_addr_t offset,
unsigned int len)
 {
-   phys_addr_t phys = to_phys(pmem, offset);
+   phys_addr_t phys = _to_phys(pmem, offset);
unsigned long pfn_start, pfn_end, pfn;
 
/* only pmem in the linear map supports HWPoison */
@@ -97,7 +97,7 @@ static void pmem_clear_bb(struct pmem_device *pmem, sector_t 
sector, long blks)
 static long __pmem_clear_poison(struct pmem_device *pmem,
phys_addr_t offset, unsigned int len)
 {
-   phys_addr_t phys = to_phys(pmem, offset);
+   phys_addr_t phys = _to_phys(pmem, offset);
long cleared = nvdimm_clear_poison(to_dev(pmem), phys, len);
 
if (cleared > 0) {
-- 
2.18.4

Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-05 Thread Jane Chu

On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
>>> As the code I pasted before, pmem driver will subtract its ->data_offset,
>>> which is byte-based. And the filesystem who implements ->notify_failure()
>>> will calculate the offset in unit of byte again.
>>>
>>> So, leave its function signature byte-based, to avoid repeated conversions.
>>
>> I'm actually fine either way, so I'll wait for Dan to comment.
> 
> FWIW I'd convinced myself that the reason for using byte units is to
> make it possible to reduce the pmem failure blast radius to subpage
> units... but then I've also been distracted for months. :/
> 

Yes, thanks Darrick!  I recall that.
Maybe just add a comment about why byte unit is used?

thanks!
-jane

> --D

Re: Phantom PMEM poison issue

2022-01-21 Thread Jane Chu

On 1/21/2022 5:51 PM, Tsaur, Erwin wrote:
> Hi Jane,
> 
> Is phantom error, an poison that was injected and then cleared, but somehow 
> shows up again?
> How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
> Also how are you doing ARS?

The phantom show up as soon as this console message show up
[Hardware Error]: Hardware error from APEI Generic Hardware Error 
Source: 1
from 'ghes'.

The poisons were clear via pmem_clear_poison().

ARS was run as
   "ndctl start-scrub; ndctl wait-scrub -p 30"

thanks,
-jane


> 
> Erwin
> 
> -Original Message-
> From: Luck, Tony 
> Sent: Friday, January 21, 2022 5:27 PM
> To: chu, jane 
> Cc: Williams, Dan J ; b...@alien8.de >> Borislav 
> Petkov ; djw...@kernel.org; wi...@infradead.org; 
> nvd...@lists.linux.dev; linux-kernel@vger.kernel.org
> Subject: Re: Phantom PMEM poison issue
> 
> On Sat, Jan 22, 2022 at 12:40:18AM +, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user
>>> thread, and then daxfs takes action and clears the poison, confirmed
>>> by "ndctl -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+
>>> hours) the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom
>>> poison away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>>  # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
> 
> You say "in one page" here. What is the page size?
>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access
>>> at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption [
>>> 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
>>> memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
>>> uncached-minus, req uncached-minus [ 2439.769985] Could not
>>> invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
>>> fsdax_poison_v1:8457 freeing invalid memtype [mem
>>> 0x1850602000-0x1850602fff]
> 
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)
> 
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>>  "dev":"namespace0.0",
>>>  "mode":"fsdax",
>>>  "map":"dev",
>>>  "size":"15.75 GiB (16.91 GB)",
>>>  "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>>  "sector_size":4096,
>>>  "align":2097152,
>>>  "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
>>> severity: recoverable [21352.007838] {2}[Hardware Error]:  Error 0,
>>> type: recoverable
>>> [21352.014156] {2}[Hardware Error]:   section_type: memory error
>>> [21352.020572] {2}[Hardware Error]:   physical_address: 0x001850603200
> 
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
> 
> Is that what you mean by "phantom error" ... from a different address from 
> those that were injected?
> 
> -Tony
>

Re: Phantom PMEM poison issue

2022-01-21 Thread Jane Chu

On 1/21/2022 5:27 PM, Luck, Tony wrote:
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user thread,
>>> and then daxfs takes action and clears the poison, confirmed by "ndctl
>>> -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
>>> the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom poison
>>> away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>>  # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
> 
> You say "in one page" here. What is the page size?

The page size is 4K, the size of base page on x86.
I said "one page", as 8 (poisons) * 256B = 2KiB, only half page.

>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption
>>> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back
>>> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
>>> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
>>> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
>>> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
>>> [mem 0x1850602000-0x1850602fff]
> 
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)

yes.
> 
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>>  "dev":"namespace0.0",
>>>  "mode":"fsdax",
>>>  "map":"dev",
>>>  "size":"15.75 GiB (16.91 GB)",
>>>  "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>>  "sector_size":4096,
>>>  "align":2097152,
>>>  "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1
>>> [21352.001528] {2}[Hardware Error]: event severity: recoverable
>>> [21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
>>> [21352.014156] {2}[Hardware Error]:   section_type: memory error
>>> [21352.020572] {2}[Hardware Error]:   physical_address: 0x001850603200
> 
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
> 

I see, this is the next page... the issue is reproducible with
a single poison injection.

> Is that what you mean by "phantom error" ... from a different
> address from those that were injected?

All 8 poisons were cleared by the driver via DSM, and verified
by "ndctl -NMu -r 0", that covers every page in the 16GiB /dev/pmem.

It's phantom because unload->reload libnvdimm, followed by a full ARS
scan confirms the poison isn't there, hence phantom.

thanks,
-jane

> 
> -Tony

Re: Phantom PMEM poison issue

2022-01-21 Thread Jane Chu

On 1/21/2022 4:31 PM, Jane Chu wrote:
> On baremetal Intel platform with DCPMEM installed and configured to
> provision daxfs, say a poison was consumed by a load from a user thread,
> and then daxfs takes action and clears the poison, confirmed by "ndctl
> -NM".
> 
> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> the ghost of the previous poison will surface, and it takes
> unload/reload the libnvdimm drivers in order to drive the phantom poison
> away, confirmed by ARS.
> 
> Turns out, the issue is quite reproducible with the latest stable Linux.
> 
> Here is the relevant console message after injected 8 poisons in one
> page via
> # ndctl inject-error namespace0.0 -n 2 -B 8210

There is a cut-n-paste error, the above line should be
   "# ndctl inject-error namespace0.0 -n 8 -B 8210"

-jane

> then, cleared them all, and wait for 5+ hours, notice the time stamp.
> BTW, the system is idle otherwise.
> 
> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> 1850602400
> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> fsdax_poison_v1:8457 due to hardware memory corruption
> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> Recovered
> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> -1850603000 uncached-minus<->write-back
> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> [mem 0x1850602000-0x1850602fff]
> 
> At this point,
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0"
> }
> 
> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [21352.001528] {2}[Hardware Error]: event severity: recoverable
> [21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
> [21352.014156] {2}[Hardware Error]:   section_type: memory error
> [21352.020572] {2}[Hardware Error]:   physical_address: 0x001850603200
> [21352.027958] {2}[Hardware Error]:   physical_address_mask:
> 0xff00
> [21352.035827] {2}[Hardware Error]:   node: 0 module: 1
> [21352.041466] {2}[Hardware Error]:   DIMM location: /SYS/MB/P0 D6
> [21352.048277] Memory failure: 0x1850603: recovery action for dax page:
> Recovered
> [21352.056346] mce: [Hardware Error]: Machine check events logged
> [21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
> [21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
> 0xbc9f
> [21352.056894] EDAC skx MC0: TSC 0x0
> [21352.056895] EDAC skx MC0: ADDR 0x1850603200
> [21352.056897] EDAC skx MC0: MISC 0x8c
> [21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
> 0 APIC 0x0
> [21352.056909] EDAC MC0: 1 UE memory read error on
> CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
> offset:0x200 grain:32 -  err_code:0x:0x009f [..]
> 
> And now,
> 
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0",
> "badblock_count":1,
> "badblocks":[
>   {
> "offset":8217,
> "length":1,
> "dimms":[
>   "nmem0"
> ]
>   }
> ]
> }
> 
> According to my limited research, when ghes_proc_in_irq() is fired to
> report a delayed UE and it calls memory_failure() to take the page out
> and causes driver to record a badblock record, and that's how the
> phantom poison appeared.
> 
> Note, 1 phantom poison for 8 injected poisons, so, not an accurate
> phantom representation.
> 
> But that aside, it seems that the GHES mechanism and the synchronous MCE
> handling is totally at odds with each other, and that cannot be correct.
> 
> What is the right thing to do to fix the issue? Should memory_failure
> handler second-guess the GHES report?  Should the synchronous MCE
> handling mechanism manage to tell the firmware that so-and-so memory UE
> has been cleared and hence clear the record in firmware?  Other ideas?
> 
> 
> Thanks!
> -jane

Phantom PMEM poison issue

2022-01-21 Thread Jane Chu

On baremetal Intel platform with DCPMEM installed and configured to 
provision daxfs, say a poison was consumed by a load from a user thread, 
and then daxfs takes action and clears the poison, confirmed by "ndctl 
-NM".

Now, depends on the luck, after sometime(from a few seconds to 5+ hours) 
the ghost of the previous poison will surface, and it takes 
unload/reload the libnvdimm drivers in order to drive the phantom poison 
away, confirmed by ARS.

Turns out, the issue is quite reproducible with the latest stable Linux.

Here is the relevant console message after injected 8 poisons in one 
page via
   # ndctl inject-error namespace0.0 -n 2 -B 8210
then, cleared them all, and wait for 5+ hours, notice the time stamp. 
BTW, the system is idle otherwise.

[ 2439.742296] mce: Uncorrected hardware memory error in user-access at 
1850602400
[ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to 
fsdax_poison_v1:8457 due to hardware memory corruption
[ 2439.761866] Memory failure: 0x1850602: recovery action for dax page: 
Recovered
[ 2439.769949] mce: [Hardware Error]: Machine check events logged
-1850603000 uncached-minus<->write-back
[ 2439.769984] x86/PAT: memtype_reserve failed [mem 
0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
[ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
[ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype 
[mem 0x1850602000-0x1850602fff]

At this point,
# ndctl list -NMu -r 0
{
   "dev":"namespace0.0",
   "mode":"fsdax",
   "map":"dev",
   "size":"15.75 GiB (16.91 GB)",
   "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
   "sector_size":4096,
   "align":2097152,
   "blockdev":"pmem0"
}

[21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 1
[21352.001528] {2}[Hardware Error]: event severity: recoverable
[21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
[21352.014156] {2}[Hardware Error]:   section_type: memory error
[21352.020572] {2}[Hardware Error]:   physical_address: 0x001850603200
[21352.027958] {2}[Hardware Error]:   physical_address_mask: 
0xff00
[21352.035827] {2}[Hardware Error]:   node: 0 module: 1
[21352.041466] {2}[Hardware Error]:   DIMM location: /SYS/MB/P0 D6
[21352.048277] Memory failure: 0x1850603: recovery action for dax page: 
Recovered
[21352.056346] mce: [Hardware Error]: Machine check events logged
[21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 
0xbc9f
[21352.056894] EDAC skx MC0: TSC 0x0
[21352.056895] EDAC skx MC0: ADDR 0x1850603200
[21352.056897] EDAC skx MC0: MISC 0x8c
[21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET 
0 APIC 0x0
[21352.056909] EDAC MC0: 1 UE memory read error on 
CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603 
offset:0x200 grain:32 -  err_code:0x:0x009f [..]

And now,

# ndctl list -NMu -r 0
{
   "dev":"namespace0.0",
   "mode":"fsdax",
   "map":"dev",
   "size":"15.75 GiB (16.91 GB)",
   "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
   "sector_size":4096,
   "align":2097152,
   "blockdev":"pmem0",
   "badblock_count":1,
   "badblocks":[
 {
   "offset":8217,
   "length":1,
   "dimms":[
 "nmem0"
   ]
 }
   ]
}

According to my limited research, when ghes_proc_in_irq() is fired to 
report a delayed UE and it calls memory_failure() to take the page out 
and causes driver to record a badblock record, and that's how the 
phantom poison appeared.

Note, 1 phantom poison for 8 injected poisons, so, not an accurate 
phantom representation.

But that aside, it seems that the GHES mechanism and the synchronous MCE 
handling is totally at odds with each other, and that cannot be correct.

What is the right thing to do to fix the issue? Should memory_failure 
handler second-guess the GHES report?  Should the synchronous MCE 
handling mechanism manage to tell the firmware that so-and-so memory UE 
has been cleared and hence clear the record in firmware?  Other ideas?


Thanks!
-jane

Re: [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

2021-11-04 Thread Jane Chu

On 11/4/2021 10:55 AM, Christoph Hellwig wrote:
> On Tue, Sep 14, 2021 at 05:31:32PM -0600, Jane Chu wrote:
>> +static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
>> +size_t nr_pages)
>> +{
>> +unsigned int len = PFN_PHYS(nr_pages);
>> +sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> +struct pmem_device *pmem = dax_get_private(dax_dev);
>> +phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>> +blk_status_t ret;
>> +
>> +if (!is_bad_pmem(>bb, sector, len))
>> +return 0;
>> +
>> +ret = pmem_clear_poison(pmem, pmem_off, len);
>> +return (ret == BLK_STS_OK) ? 0 : -EIO;
> 
> No need for the braces here (and I'd prefer a good old if anyway).
> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig 
> 

Thanks a lot!  I'll keep in mind your comments.

-jane

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-23 Thread Jane Chu


On 9/15/2021 9:15 AM, Darrick J. Wong wrote:

On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:

Hi, Dan,

On 9/14/2021 9:44 PM, Dan Williams wrote:

On Tue, Sep 14, 2021 at 4:32 PM Jane Chu  wrote:


If pwrite(2) encounters poison in a pmem range, it fails with EIO.
This is unecessary if hardware is capable of clearing the poison.

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Jane Chu (3):
dax: introduce dax_operation dax_clear_poison


The problem with new dax operations is that they need to be plumbed
not only through fsdax and pmem, but also through device-mapper.

In this case I think we're already covered by dax_zero_page_range().
That will ultimately trigger pmem_clear_poison() and it is routed
through device-mapper properly.

Can you clarify why the existing dax_zero_page_range() is not sufficient?


fallocate ZERO_RANGE is in itself a functionality that applied to dax
should lead to zero out the media range.  So one may argue it is part
of a block operations, and not something explicitly aimed at clearing
poison.


Yeah, Christoph suggested that we make the clearing operation explicit
in a related thread a few weeks ago:
https://lore.kernel.org/linux-fsdevel/yrtnlperhfmz2...@infradead.org/

I like Jane's patchset far better than the one that I sent, because it
doesn't require a block device wrapper for the pmem, and it enables us
to tell application writers that they can handle media errors by
pwrite()ing the bad region, just like they do for nvme and spinners.


I'm also thinking about the MOVEDIR64B instruction and how it
might be used to clear poison on the fly with a single 'store'.
Of course, that means we need to figure out how to narrow down the
error blast radius first.


That was one of the advantages of Shiyang Ruan's NAKed patchset to
enable byte-granularity media errors to pass upwards through the stack
back to the filesystem, which could then tell applications exactly what
they lost.

I want to get back to that, though if Dan won't withdraw the NAK then I
don't know how to move forward...


With respect to plumbing through device-mapper, I thought about that,
and wasn't sure. I mean the clear-poison work will eventually fall on
the pmem driver, and thru the DM layers, how does that play out thru
DM?


Each of the dm drivers has to add their own ->clear_poison operation
that remaps the incoming (sector, len) parameters as appropriate for
that device and then calls the lower device's ->clear_poison with the
translated parameters.

This (AFAICT) has already been done for dax_zero_page_range, so I sense
that Dan is trying to save you a bunch of code plumbing work by nudging
you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
and then you only need patches 2-3.


Thanks Darrick for the explanation!
I don't mind to add DM layer support, it sounds straight forward.
I also like your latest patch and am wondering if the clear_poison API
is still of value.

thanks,
-jane




BTW, our customer doesn't care about creating dax volume thru DM, so.


They might not care, but anything going upstream should work in the
general case.

--D


thanks!
-jane





dax: introduce dax_clear_poison to dax pwrite operation
libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

   drivers/dax/super.c   | 13 +
   drivers/nvdimm/pmem.c | 17 +
   fs/dax.c  |  9 +
   include/linux/dax.h   |  6 ++
   4 files changed, 45 insertions(+)

--
2.18.4

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-23 Thread Jane Chu


On 9/15/2021 1:27 PM, Dan Williams wrote:

I'm also thinking about the MOVEDIR64B instruction and how it
might be used to clear poison on the fly with a single 'store'.
Of course, that means we need to figure out how to narrow down the
error blast radius first.

It turns out the MOVDIR64B error clearing idea runs into problem with
the device poison tracking. Without the explicit notification that
software wanted the error cleared the device may ghost report errors
that are not there anymore. I think we should continue explicit error
clearing and notification of the device that the error has been
cleared (by asking the device to clear it).



Sorry for the late response, I was out for several days.

Your concern is understood.  I wasn't thinking of an out-of-band
MOVDIR64B to clear poison, I was thinking about adding a case to
pmem_clear_poison(), such that if CPUID feature shows that
MOVDIR64B is supported, instead of calling the BIOS interface
to clear poison, MOVDIR64B could be called. The advantage is
a. a lot faster; b. smaller radius.  And the driver has a chance
to update its ->bb record.

thanks,
-jane

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-15 Thread Jane Chu


Hi, Dan,

On 9/14/2021 9:44 PM, Dan Williams wrote:

On Tue, Sep 14, 2021 at 4:32 PM Jane Chu  wrote:


If pwrite(2) encounters poison in a pmem range, it fails with EIO.
This is unecessary if hardware is capable of clearing the poison.

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Jane Chu (3):
   dax: introduce dax_operation dax_clear_poison


The problem with new dax operations is that they need to be plumbed
not only through fsdax and pmem, but also through device-mapper.

In this case I think we're already covered by dax_zero_page_range().
That will ultimately trigger pmem_clear_poison() and it is routed
through device-mapper properly.

Can you clarify why the existing dax_zero_page_range() is not sufficient?


fallocate ZERO_RANGE is in itself a functionality that applied to dax
should lead to zero out the media range.  So one may argue it is part
of a block operations, and not something explicitly aimed at clearing
poison. I'm also thinking about the MOVEDIR64B instruction and how it
might be used to clear poison on the fly with a single 'store'.
Of course, that means we need to figure out how to narrow down the
error blast radius first.

With respect to plumbing through device-mapper, I thought about that,
and wasn't sure. I mean the clear-poison work will eventually fall on
the pmem driver, and thru the DM layers, how does that play out thru
DM?  BTW, our customer doesn't care about creating dax volume thru DM, so.

thanks!
-jane





   dax: introduce dax_clear_poison to dax pwrite operation
   libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

  drivers/dax/super.c   | 13 +
  drivers/nvdimm/pmem.c | 17 +
  fs/dax.c  |  9 +
  include/linux/dax.h   |  6 ++
  4 files changed, 45 insertions(+)

--
2.18.4

[PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

2021-09-14 Thread Jane Chu

Provide pmem_dax_clear_poison() to struct dax_operations.clear_poison.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1e0615b8565e..307a53aa3432 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,6 +294,22 @@ static int pmem_dax_zero_page_range(struct dax_device 
*dax_dev, pgoff_t pgoff,
   PAGE_SIZE));
 }
 
+static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   unsigned int len = PFN_PHYS(nr_pages);
+   sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+   struct pmem_device *pmem = dax_get_private(dax_dev);
+   phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+   blk_status_t ret;
+
+   if (!is_bad_pmem(>bb, sector, len))
+   return 0;
+
+   ret = pmem_clear_poison(pmem, pmem_off, len);
+   return (ret == BLK_STS_OK) ? 0 : -EIO;
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -326,6 +342,7 @@ static const struct dax_operations pmem_dax_ops = {
.copy_from_iter = pmem_copy_from_iter,
.copy_to_iter = pmem_copy_to_iter,
.zero_page_range = pmem_dax_zero_page_range,
+   .clear_poison = pmem_dax_clear_poison,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
-- 
2.18.4

[PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-14 Thread Jane Chu

If pwrite(2) encounters poison in a pmem range, it fails with EIO.
This is unecessary if hardware is capable of clearing the poison.

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Jane Chu (3):
  dax: introduce dax_operation dax_clear_poison
  dax: introduce dax_clear_poison to dax pwrite operation
  libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

 drivers/dax/super.c   | 13 +
 drivers/nvdimm/pmem.c | 17 +
 fs/dax.c  |  9 +
 include/linux/dax.h   |  6 ++
 4 files changed, 45 insertions(+)

-- 
2.18.4

[PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation

2021-09-14 Thread Jane Chu

When pwrite(2) encounters poison in a dax range, it fails with EIO.
But if the backend hardware of the dax device is capable of clearing
poison, try that and resume the write.

Signed-off-by: Jane Chu 
---
 fs/dax.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (ret)
break;
 
+   /*
+* If WRITE operation encounters media error in a page aligned
+* range, try to clear the error, then resume, for just once.
+*/
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
, NULL);
+   if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+   if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 
0)
+   map_len = dax_direct_access(dax_dev, pgoff,
+   PHYS_PFN(size), , NULL);
+   }
if (map_len < 0) {
ret = map_len;
break;
-- 
2.18.4

[PATCH 1/3] dax: introduce dax_operation dax_clear_poison

2021-09-14 Thread Jane Chu

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Signed-off-by: Jane Chu 
---
 drivers/dax/super.c | 13 +
 include/linux/dax.h |  6 ++
 2 files changed, 19 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 44736cbd446e..935d496fa7db 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -373,6 +373,19 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (!dax_dev->ops->clear_poison)
+   return -EOPNOTSUPP;
+
+   return dax_dev->ops->clear_poison(dax_dev, pgoff, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_clear_poison);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..c54c1087ece1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,6 +36,11 @@ struct dax_operations {
struct iov_iter *);
/* zero_page_range: required operation. Zero page range   */
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+   /*
+* clear_poison: clear media error in the given page aligned range via
+* vendor appropriate method. Optional operation.
+*/
+   int (*clear_poison)(struct dax_device *, pgoff_t, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -226,6 +231,7 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t 
pgoff, void *addr,
size_t bytes, struct iov_iter *i);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages);
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff, size_t 
nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.18.4

[PATCH 2/3] dax: introduce dax clear poison to page aligned dax pwrite operation

2021-09-14 Thread Jane Chu

Currenty, when pwrite(2) s issued to a dax range that contains poison,
the pwrite(2) fails with EIO. Well, if the hardware backend of the
dax device is capable of clearing poison, try that and resume the write.

Signed-off-by: Jane Chu 
---
 fs/dax.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (ret)
break;
 
+   /*
+* If WRITE operation encounters media error in a page aligned
+* range, try to clear the error, then resume, for just once.
+*/
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
, NULL);
+   if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+   if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 
0)
+   map_len = dax_direct_access(dax_dev, pgoff,
+   PHYS_PFN(size), , NULL);
+   }
if (map_len < 0) {
ret = map_len;
break;
-- 
2.18.4

[PATCH] mm/memory-failure: unecessary amount of unmapping

2021-04-19 Thread Jane Chu

It appears that unmap_mapping_range() actually takes a 'size' as its
third argument rather than a location, the current calling fashion
causes unecessary amount of unmapping to occur.

Fixes: 6100e34b2526e ("mm, memory_failure: Teach memory_failure() about 
dev_pagemap pages")
Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f7ed9559d494..85ad98c00fd9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1368,7 +1368,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
 * communicated in siginfo, see kill_proc()
 */
start = (page->index << PAGE_SHIFT) & ~(size - 1);
-   unmap_mapping_range(page->mapping, start, start + size, 0);
+   unmap_mapping_range(page->mapping, start, size, 0);
}
kill_procs(, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
rc = 0;
-- 
2.18.4

Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2021-01-08 Thread Jane Chu


Hi, Shiyang,

On 12/18/2020 1:13 AM, Ruan Shiyang wrote:


So I tried the patchset with pmem error injection, the SIGBUS payload
does not look right -

** SIGBUS(7): **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

I expect the payload looks like

** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, 
BUS_MCEERR_AR) **


Thanks for testing.  I test the SIGBUS by writing a program which calls
madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows 
that

the program is killed by SIGBUS.  I cannot get any detail from it.  So,
could you please show me the right way(test tools) to test it?


I'm assuming that Jane is using a program that calls sigaction to
install a SIGBUS handler, and dumps the entire siginfo_t structure
whenever it receives one...


Yes, thanks Darrick.



OK.  Let me try it and figure out what's wrong in it.


I injected poison via "ndctl inject-error",  not expecting it made any 
difference though.


Any luck?

thanks,
-jane

Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-16 Thread Jane Chu


Hi, Shiyang,

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

The call trace is like this:
memory_failure()
  pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
   gendisk->fops->corrupted_range() => - pmem_corrupted_range()
   - md_blk_corrupted_range()
sb->s_ops->currupted_range()=> xfs_fs_corrupted_range()
 xfs_rmap_query_range()
  xfs_currupt_helper()
   * corrupted on metadata
   try to recover data, call xfs_force_shutdown()
   * corrupted on file data
   try to recover data, call mf_dax_mapping_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10)


So I tried the patchset with pmem error injection, the SIGBUS payload
does not look right -

** SIGBUS(7): **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

I expect the payload looks like

** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **

thanks,
-jane

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-15 Thread Jane Chu


On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

  #ifdef CONFIG_SYSFS
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,
+  size_t len, void *data);
  int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
  void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
  #else
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t off,


Did you mean
  static inline int bd_disk_holder_corrupted_range(..
?

thanks,
-jane


+  size_t len, void *data)
+{
+   return 0;
+}
  static inline int bd_link_disk_holder(struct block_device *bdev,
  struct gendisk *disk)

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-15 Thread Jane Chu


On 12/15/2020 3:58 AM, Ruan Shiyang wrote:

Hi Jane

On 2020/12/15 上午4:58, Jane Chu wrote:

Hi, Shiyang,

On 11/22/2020 4:41 PM, Shiyang Ruan wrote:

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Change from v1:
   - Intorduce ->block_lost() for block device
   - Support mapped device
   - Add 'not available' warning for realtime device in XFS
   - Rebased to v5.10-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device calls its ->block_lost()
to find the filesystem which the damaged page located in, and call
->storage_lost() to track files or metadata assocaited with this page.
Finally we are able to try to fix the damaged data in filesystem and do


Does that mean clearing poison? if so, would you mind to elaborate 
specifically which change does that?


Recovering data for filesystem (or pmem device) has not been done in 
this patchset...  I just triggered the handler for the files sharing the 
corrupted page here.


Thanks! That confirms my understanding.

With the framework provided by the patchset, how do you envision it to
ease/simplify poison recovery from the user's perspective?

And how does it help in dealing with page faults upon poisoned
dax page?

thanks!
-jane

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-14 Thread Jane Chu


Hi, Shiyang,

On 11/22/2020 4:41 PM, Shiyang Ruan wrote:

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Change from v1:
   - Intorduce ->block_lost() for block device
   - Support mapped device
   - Add 'not available' warning for realtime device in XFS
   - Rebased to v5.10-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device calls its ->block_lost()
to find the filesystem which the damaged page located in, and call
->storage_lost() to track files or metadata assocaited with this page.
Finally we are able to try to fix the damaged data in filesystem and do


Does that mean clearing poison? if so, would you mind to elaborate 
specifically which change does that?


Thanks!
-jane


other necessary processing, such as killing processes who are using the
files affected.

[PATCH v2 2/3] libnvdimm/security: the 'security' attr never show 'overwrite' state

2020-08-03 Thread Jane Chu

'security' attribute displays the security state of an nvdimm.
During normal operation, the nvdimm state maybe one of 'disabled',
'unlocked' or 'locked'.  When an admin issues
  # ndctl sanitize-dimm nmem0 --overwrite
the attribute is expected to change to 'overwrite' until the overwrite
operation completes.

But tests on our systems show that 'overwrite' is never shown during
the overwrite operation. i.e.
  # cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
  unlocked
the attribute remain 'unlocked' through out the operation, consequently
"ndctl wait-overwrite nmem0" command doesn't wait at all.

The driver tracks the state in 'nvdimm->sec.flags': when the operation
starts, it adds an overwrite bit to the flags; and when the operation
completes, it removes the bit. Hence security_show() should check the
'overwrite' bit first, in order to indicate the actual state when multiple
bits are set in the flags.

Signed-off-by: Jane Chu 
Reviewed-by: Dave Jiang 
---
 drivers/nvdimm/dimm_devs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
+   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
+   return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, >sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, >sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, >sec.flags))
return sprintf(buf, "locked\n");
-   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
-   return sprintf(buf, "overwrite\n");
return -ENOTTY;
 }
 
-- 
1.8.3.1

[PATCH v2 3/3] libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr

2020-08-03 Thread Jane Chu

commit 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
adds a sysfs_notify_dirent() to wake up userspace poll thread when the 
"overwrite"
operation has completed. But the notification is issued before the internal
dimm security state and flags have been updated, so the userspace poll thread
wakes up and fetches the not-yet-updated attr and falls back to sleep, forever.
But if user from another terminal issue "ndctl wait-overwrite nmemX" again,
the command returns instantly.

Cc: Dave Jiang 
Cc: Dan Williams 
Fixes: 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
Signed-off-by: Jane Chu 
Reviewed-by: Dave Jiang 
---
 drivers/nvdimm/security.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 8f3971c..4b80150 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -450,14 +450,19 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
else
dev_dbg(>dev, "overwrite completed\n");
 
-   if (nvdimm->sec.overwrite_state)
-   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   /*
+* Mark the overwrite work done and update dimm security flags,
+* then send a sysfs event notification to wake up userspace
+* poll threads to picked up the changed state.
+*/
nvdimm->sec.overwrite_tmo = 0;
clear_bit(NDD_SECURITY_OVERWRITE, >flags);
clear_bit(NDD_WORK_PENDING, >flags);
-   put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   if (nvdimm->sec.overwrite_state)
+   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   put_device(>dev);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

[PATCH v2 1/3] libnvdimm/security: fix a typo

2020-08-03 Thread Jane Chu

commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
introduced a typo, causing a 'nvdimm->sec.flags' update being overwritten
by the subsequent update meant for 'nvdimm->sec.ext_flags'.

Cc: Dan Williams 
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu 
Reviewed-by: Dave Jiang 
---
 drivers/nvdimm/security.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
clear_bit(NDD_WORK_PENDING, >flags);
put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
-   nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

Re: [PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state

2020-08-03 Thread Jane Chu


Hi, Dave,

On 8/3/2020 1:41 PM, Dave Jiang wrote:

On 7/24/2020 9:09 AM, Jane Chu wrote:

Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' 
attribute"),

when issue
  # ndctl sanitize-dimm nmem0 --overwrite
then immediately check the 'security' attribute,
  # cat 
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security

  unlocked
Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed.  That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.

The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.


May be best to split this fix to a different patch? Just thinking git 
bisect later on to track issues. Otherwise Reviewed-by: Dave Jiang 



Sure. I take it you meant to separate the typo fix from the change that 
tests the OVERWRITE bit first?


Regards,
-jane

[PATCH 2/2] libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr

2020-07-24 Thread Jane Chu

commit 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
adds a sysfs_notify_dirent() to wake up userspace poll thread when the 
"overwrite"
operation has completed. But the notification is issued before the internal
dimm security state and flags have been updated, so the userspace poll thread
wakes up and fetches the not-yet-updated attr and falls back to sleep, forever.
But if user from another terminal issue "ndctl wait-overwrite nmemX" again,
the command returns instantly.

Cc: Dave Jiang 
Cc: Dan Williams 
Fixes: 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/security.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 8f3971c..4b80150 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -450,14 +450,19 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
else
dev_dbg(>dev, "overwrite completed\n");
 
-   if (nvdimm->sec.overwrite_state)
-   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   /*
+* Mark the overwrite work done and update dimm security flags,
+* then send a sysfs event notification to wake up userspace
+* poll threads to picked up the changed state.
+*/
nvdimm->sec.overwrite_tmo = 0;
clear_bit(NDD_SECURITY_OVERWRITE, >flags);
clear_bit(NDD_WORK_PENDING, >flags);
-   put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   if (nvdimm->sec.overwrite_state)
+   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   put_device(>dev);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

[PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state

2020-07-24 Thread Jane Chu

Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute"),
when issue
 # ndctl sanitize-dimm nmem0 --overwrite
then immediately check the 'security' attribute,
 # cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
 unlocked
Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed.  That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.

The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.

Cc: Dan Williams 
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/dimm_devs.c | 4 ++--
 drivers/nvdimm/security.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
+   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
+   return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, >sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, >sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, >sec.flags))
return sprintf(buf, "locked\n");
-   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
-   return sprintf(buf, "overwrite\n");
return -ENOTTY;
 }
 
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
clear_bit(NDD_WORK_PENDING, >flags);
put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
-   nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

[PATCH 2/2] libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr

2020-07-23 Thread Jane Chu

commit 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
adds a sysfs_notify_dirent() to wake up userspace poll thread when the 
"overwrite"
operation has completed. But the notification is issued before the internal
dimm security state and flags have been updated, so the userspace poll thread
wakes up and fetches the not-yet-updated attr and falls back to sleep, forever.
But if user from another terminal issue "ndctl wait-overwrite nmemX" again,
the command returns instantly.

Cc: Dave Jiang 
Cc: Dan Williams 
Fixes: 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite 
support")
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/security.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 8f3971c..4b80150 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -450,14 +450,19 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
else
dev_dbg(>dev, "overwrite completed\n");
 
-   if (nvdimm->sec.overwrite_state)
-   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   /*
+* Mark the overwrite work done and update dimm security flags,
+* then send a sysfs event notification to wake up userspace
+* poll threads to picked up the changed state.
+*/
nvdimm->sec.overwrite_tmo = 0;
clear_bit(NDD_SECURITY_OVERWRITE, >flags);
clear_bit(NDD_WORK_PENDING, >flags);
-   put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   if (nvdimm->sec.overwrite_state)
+   sysfs_notify_dirent(nvdimm->sec.overwrite_state);
+   put_device(>dev);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

[PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state

2020-07-23 Thread Jane Chu

Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute"),
when issue

then immediately check the 'security' attribute,

unlocked

Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed.  That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.

The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.

Cc: Dan Williams 
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu 
---
 drivers/nvdimm/dimm_devs.c | 4 ++--
 drivers/nvdimm/security.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
+   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
+   return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, >sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, >sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, >sec.flags))
return sprintf(buf, "locked\n");
-   if (test_bit(NVDIMM_SECURITY_OVERWRITE, >sec.flags))
-   return sprintf(buf, "overwrite\n");
return -ENOTTY;
 }
 
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm 
*nvdimm)
clear_bit(NDD_WORK_PENDING, >flags);
put_device(>dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
-   nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+   nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
 }
 
 void nvdimm_security_overwrite_query(struct work_struct *work)
-- 
1.8.3.1

Re: [PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-10-08 Thread Jane Chu


Hi, Naoya,

What is the status of the patches?
Is there anything I need to do from my end ?

Regards,
-jane

On 8/6/2019 10:25 AM, Jane Chu wrote:

Change in v4:
  - remove trailing white space

Changes in v3:
  - move **tk cleanup to its own patch

Changes in v2:
  - move 'tk' allocations internal to add_to_kill(), suggested by Dan;
  - ran checkpatch.pl check, pointed out by Matthew;
  - Noaya pointed out that v1 would have missed the SIGKILL
if "tk->addr == -EFAULT", since the code returns early.
Incorporated Noaya's suggestion, also, skip VMAs where
"tk->size_shift == 0" for zone device page, and deliver SIGBUS
when "tk->size_shift != 0" so the payload is helpful;
  - added Suggested-by: Naoya Horiguchi 

Jane Chu (2):
   mm/memory-failure.c clean up around tk pre-allocation
   mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if
 mmaped more than once

  mm/memory-failure.c | 62 ++---
  1 file changed, 26 insertions(+), 36 deletions(-)

Re: kernel panic in 5.3-rc5, nfsd_reply_cache_stats_show+0x11

2019-08-21 Thread jane . chu


Hi, Bruce, Dan,

This patch took care the panic issue.

thanks,
-jane

On 8/21/19 7:12 AM, J. Bruce Fields wrote:

Probably just needs the following.

I've been slow to get some bugfixes upstream, sorry--I'll go send a pull
request now

--b.

commit 78e70e780b28
Author: He Zhe 
Date:   Tue Aug 6 17:41:04 2019 +0800

 nfsd4: Fix kernel crash when reading proc file reply_cache_stats
 
 reply_cache_stats uses wrong parameter as seq file private structure and

 thus causes the following kernel crash when users read
 /proc/fs/nfsd/reply_cache_stats
 
 BUG: kernel NULL pointer dereference, address: 01f9

 PGD 0 P4D 0
 Oops:  [#3] SMP PTI
 CPU: 6 PID: 1502 Comm: cat Tainted: G  D   5.3.0-rc3+ #1
 Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, 
BIOS BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
 RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
 Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 
ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 
c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
 RSP: 0018:aa520106fe08 EFLAGS: 00010246
 RAX: 00cfe1a77123 RBX:  RCX: 00291b46
 RDX: 00cf RSI: 0006 RDI: 00291b28
 RBP: aa520106fe20 R08: 0006 R09: 00cfe17e55dd
 R10: a424e47c R11: 030b R12: 0001
 R13: a424e5697000 R14: 0001 R15: a424e5697000
 FS:  7f805735f580() GS:a424f8f8() 
knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 01f9 CR3: 655ce005 CR4: 003606e0
 Call Trace:
  seq_read+0x194/0x3e0
  __vfs_read+0x1b/0x40
  vfs_read+0x95/0x140
  ksys_read+0x61/0xe0
  __x64_sys_read+0x1a/0x20
  do_syscall_64+0x4d/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f805728b861
 Code: fe ff ff 50 48 8d 3d 86 b4 09 00 e8 79 e0 01 00 66 0f 1f 84 00 00 00 00 00 
48 8d 05 d9 19 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 
66 0f 1f 44 00 00 48 83 ec 28 48 89 54
 RSP: 002b:7ffea1ce3c38 EFLAGS: 0246 ORIG_RAX: 
 RAX: ffda RBX: 0002 RCX: 7f805728b861
 RDX: 0002 RSI: 7f8057183000 RDI: 0003
 RBP: 7f8057183000 R08: 7f8057182010 R09: 
 R10: 0022 R11: 0246 R12: 559a60e8ff10
 R13: 0003 R14: 0002 R15: 0002
 Modules linked in:
 CR2: 01f9
 ---[ end trace 01613595153f0cba ]---
 RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
 Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 d1 
ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 00 48 
c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
 RSP: 0018:aa52004b3e08 EFLAGS: 00010246
 RAX: 002bab45a7c6 RBX:  RCX: 00291b4c
 RDX: 002b RSI: 0004 RDI: 00291b28
 RBP: aa52004b3e20 R08: 0004 R09: 002bab1c8c7a
 R10: a424e550 R11: 02a9 R12: 0001
 R13: a424e4475000 R14: 0001 R15: a424e4475000
 FS:  7f805735f580() GS:a424f8f8() 
knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 01f9 CR3: 655ce005 CR4: 003606e0
 Killed
 
 Fixes: 3ba75830ce17 ("nfsd4: drc containerization")

 Signed-off-by: He Zhe 
 Signed-off-by: J. Bruce Fields 

diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 26ad75ae2be0..96352ab7bd81 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -571,7 +571,7 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec *data)
   */
  static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
  {
-   struct nfsd_net *nn = v;
+   struct nfsd_net *nn = m->private;
  
  	seq_printf(m, "max entries:   %u\n", nn->max_drc_entries);

seq_printf(m, "num entries:   %u\n",

Re: kernel panic in 5.3-rc5, nfsd_reply_cache_stats_show+0x11

2019-08-21 Thread jane . chu


Hi, Dan,

On 8/20/19 8:48 PM, Dan Williams wrote:

On Tue, Aug 20, 2019 at 6:39 PM  wrote:


Hi,

Apology if there is a better channel reporting the issue, if so, please
let me know.

I just saw below regression in 5.3-rc5 kernel, but not in 5.2-rc7 or
earlier kernels.


Is the error stable enough to bisect?



The error is stable, I haven't tried bisect, thought to report the issue
first since it's late in the 5.3 process.
Then saw Bruce' email, I'll report soon whether the patch takes care the 
issue.


thanks!
-jane

kernel panic in 5.3-rc5, nfsd_reply_cache_stats_show+0x11

2019-08-20 Thread jane . chu


Hi,

Apology if there is a better channel reporting the issue, if so, please 
let me know.


I just saw below regression in 5.3-rc5 kernel, but not in 5.2-rc7 or 
earlier kernels.


[ 3533.659787] mce: Uncorrected hardware memory error in user-access at 
383e202000
[ 3533.659903] Memory failure: 0x383e202: Sending SIGBUS to 
read_poison:14493 due to hardware memory corruption
[ 3533.679041] Memory failure: 0x383e202: recovery action for dax page: 
Recovered
[ 3564.624934] BUG: kernel NULL pointer dereference, address: 
01f9

[ 3564.632707] #PF: supervisor read access in kernel mode
[ 3564.638440] #PF: error_code(0x) - not-present page
[ 3564.644174] PGD acd7b47067 P4D acd7b47067 PUD acd7aba067 PMD 0
[ 3564.650784] Oops:  [#1] SMP NOPTI
[ 3564.654869] CPU: 58 PID: 15026 Comm: sosreport Tainted: G   M 
 5.3.0-rc5.master.20190820.ol7.x86_64 #1
[ 3564.666420] Hardware name: Oracle Corporation ORACLE SERVER 
X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019

[ 3564.677112] RIP: 0010:nfsd_reply_cache_stats_show+0x11/0x110 [nfsd]
[ 3564.684106] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 63 
47 ec 48 89 e5 5d c3 90 0f 1f 44 00 00 55 31 c0 48 89 e5 41 54 49 89 f4 
53 <8b> 96 f8 01 00 00 48 c7 c6 d9 b8 74 c0 48 89 fb e8 9a ae bc f2 41

[ 3564.705062] RSP: 0018:aa140f87fe18 EFLAGS: 00010246
[ 3564.710894] RAX:  RBX: 9f7c9b562ca8 RCX: 
5c19
[ 3564.718858] RDX: 1000 RSI: 0001 RDI: 
9f7c9b562c80
[ 3564.726822] RBP: aa140f87fe28 R08: 9f801fab01a0 R09: 
9ed347c06600
[ 3564.734785] R10: 9f801e287000 R11: 9f8012f8d638 R12: 
0001
[ 3564.742749] R13: 9f8012f8d600 R14: 9f7c9b562c80 R15: 
0001
[ 3564.750712] FS:  7f3cfaa92700() GS:9f801fa8() 
knlGS:

[ 3564.759743] CS:  0010 DS:  ES:  CR0: 80050033
[ 3564.766156] CR2: 01f9 CR3: 00add1894004 CR4: 
007606e0
[ 3564.774120] DR0:  DR1:  DR2: 

[ 3564.782084] DR3:  DR6: fffe0ff0 DR7: 
0400

[ 3564.790050] PKRU: 5554
[ 3564.793068] Call Trace:
[ 3564.795800]  seq_read+0x13b/0x390
[ 3564.799502]  __vfs_read+0x1b/0x40
[ 3564.803202]  vfs_read+0x8e/0x140
[ 3564.806794]  ksys_read+0x61/0xd0
[ 3564.810394]  __x64_sys_read+0x1a/0x20
[ 3564.814484]  do_syscall_64+0x60/0x1e0
[ 3564.818572]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3564.824199] RIP: 0033:0x7f3d163a304d


The panic is reproducible every time in 5.3-rc5, here is the steps.

On system with Intel DC PMEM, configure at AppDirect NonInterleave mode,
# ndctl create-namespace -m devdax
{
  "dev":"namespace1.0",
  "mode":"devdax",
  "map":"dev",
  "size":"124.03 GiB (133.18 GB)",
..
"align":2097152,
"devices":[
  {
"chardev":"dax1.0",

# ndctl inject-error namespace1.0 -B 16 --count=1

# ./read_poison -x dax1.0 -o 8192 -m 1
Read poison location at (16 * 512 = 8192)

About a little under 30sec later, kernel panics.

thanks,
-jane

[PATCH v4 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-08-06 Thread Jane Chu

Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
  => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
  => to deliver SIGBUS

Signed-off-by: Jane Chu 
Suggested-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51d5b20..bd4db33 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -199,7 +199,6 @@ struct to_kill {
struct task_struct *tsk;
unsigned long addr;
short size_shift;
-   char addr_valid;
 };
 
 /*
@@ -318,22 +317,27 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
 
tk->addr = page_address_in_vma(p, vma);
-   tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
/*
-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Send SIGKILL if "tk->addr == -EFAULT". Also, as
+* "tk->size_shift" is always non-zero for !is_zone_device_page(),
+* so "tk->size_shift == 0" effectively checks no mapping on
+* ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
+* to a process' address space, it's possible not all N VMAs
+* contain mappings for the page, but at least one VMA does.
+* Only deliver SIGBUS with payload derived from the VMA that
+* has a mapping for the page.
 */
-   if (tk->addr == -EFAULT || tk->size_shift == 0) {
+   if (tk->addr == -EFAULT) {
pr_info("Memory failure: Unable to find user space address %lx 
in %s\n",
page_to_pfn(p), tsk->comm);
-   tk->addr_valid = 0;
+   } else if (tk->size_shift == 0) {
+   kfree(tk);
+   return;
}
 
get_task_struct(tsk);
@@ -361,7 +365,7 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * make sure the process doesn't catch the
 * signal and then access the memory. Just kill it.
 */
-   if (fail || tk->addr_valid == 0) {
+   if (fail || tk->addr == -EFAULT) {
pr_err("Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
   pfn, tk->tsk->comm, tk->tsk->pid);
do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
-- 
1.8.3.1

Re: [PATCH v3 1/2] mm/memory-failure.c clean up around tk pre-allocation

2019-08-06 Thread Jane Chu


Hi, Naoya,

Thanks a lot!  v4 on the way. :)

-jane

On 8/1/2019 2:06 AM, Naoya Horiguchi wrote:

On Thu, Jul 25, 2019 at 04:01:40PM -0600, Jane Chu wrote:

add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read. Move all the allocation internal to add_to_kill()
and drop the **tk argument.

Signed-off-by: Jane Chu 


Acked-by: Naoya Horiguchi 

# somehow I sent 2 acks to 2/2, sorry about the noise.

[PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-08-06 Thread Jane Chu

Change in v4:
 - remove trailing white space

Changes in v3:
 - move **tk cleanup to its own patch

Changes in v2:
 - move 'tk' allocations internal to add_to_kill(), suggested by Dan;
 - ran checkpatch.pl check, pointed out by Matthew;
 - Noaya pointed out that v1 would have missed the SIGKILL
   if "tk->addr == -EFAULT", since the code returns early.
   Incorporated Noaya's suggestion, also, skip VMAs where
   "tk->size_shift == 0" for zone device page, and deliver SIGBUS
   when "tk->size_shift != 0" so the payload is helpful;
 - added Suggested-by: Naoya Horiguchi 

Jane Chu (2):
  mm/memory-failure.c clean up around tk pre-allocation
  mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if
mmaped more than once

 mm/memory-failure.c | 62 ++---
 1 file changed, 26 insertions(+), 36 deletions(-)

-- 
1.8.3.1

[PATCH v4 1/2] mm/memory-failure.c clean up around tk pre-allocation

2019-08-06 Thread Jane Chu

add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read. Move all the allocation internal to add_to_kill()
and drop the **tk argument.

Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 40 +---
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..51d5b20 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -304,25 +304,19 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
   struct vm_area_struct *vma,
-  struct list_head *to_kill,
-  struct to_kill **tkc)
+  struct list_head *to_kill)
 {
struct to_kill *tk;
 
-   if (*tkc) {
-   tk = *tkc;
-   *tkc = NULL;
-   } else {
-   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
-   if (!tk) {
-   pr_err("Memory failure: Out of memory while machine 
check handling\n");
-   return;
-   }
+   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+   if (!tk) {
+   pr_err("Memory failure: Out of memory while machine check 
handling\n");
+   return;
}
+
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;
if (is_zone_device_page(p))
@@ -341,6 +335,7 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
}
+
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(>nd, to_kill);
@@ -432,7 +427,7 @@ static struct task_struct *task_early_kill(struct 
task_struct *tsk,
  * Collect processes when the error hit an anonymous page.
  */
 static void collect_procs_anon(struct page *page, struct list_head *to_kill,
- struct to_kill **tkc, int force_early)
+   int force_early)
 {
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -457,7 +452,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill, tkc);
+   add_to_kill(t, page, vma, to_kill);
}
}
read_unlock(_lock);
@@ -468,7 +463,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(struct page *page, struct list_head *to_kill,
- struct to_kill **tkc, int force_early)
+   int force_early)
 {
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -492,7 +487,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 * to be informed of all such data corruptions.
 */
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill, tkc);
+   add_to_kill(t, page, vma, to_kill);
}
}
read_unlock(_lock);
@@ -501,26 +496,17 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
- * This is done in two steps for locking reasons.
- * First preallocate one tokill structure outside the spin locks,
- * so that we can kill at least one process reasonably reliable.
  */
 static void collect_procs(struct page *page, struct list_head *tokill,
int force_early)
 {
-   struct to_kill *tk;
-
if (!page->mapping)
return;
 
-   tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
-   if (!tk)
-   return;
if (PageAnon(page))
-   collect_procs_anon(page, tokill, , force_early);
+   collect_procs_anon(page, tokill, force_early);
else
-   collect_procs_file(page, tokill, , force_early);
-   kfree(tk);
+   collect_procs_file(page, tokill, force_early);
 }
 
 static const char *action_name[] = {
-- 
1.8.3.1

[PATCH v3 1/2] mm/memory-failure.c clean up around tk pre-allocation

2019-07-25 Thread Jane Chu

add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read. Move all the allocation internal to add_to_kill()
and drop the **tk argument.

Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 40 +---
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..51d5b20 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -304,25 +304,19 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
   struct vm_area_struct *vma,
-  struct list_head *to_kill,
-  struct to_kill **tkc)
+  struct list_head *to_kill)
 {
struct to_kill *tk;
 
-   if (*tkc) {
-   tk = *tkc;
-   *tkc = NULL;
-   } else {
-   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
-   if (!tk) {
-   pr_err("Memory failure: Out of memory while machine 
check handling\n");
-   return;
-   }
+   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+   if (!tk) {
+   pr_err("Memory failure: Out of memory while machine check 
handling\n");
+   return;
}
+
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;
if (is_zone_device_page(p))
@@ -341,6 +335,7 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
}
+
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(>nd, to_kill);
@@ -432,7 +427,7 @@ static struct task_struct *task_early_kill(struct 
task_struct *tsk,
  * Collect processes when the error hit an anonymous page.
  */
 static void collect_procs_anon(struct page *page, struct list_head *to_kill,
- struct to_kill **tkc, int force_early)
+   int force_early)
 {
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -457,7 +452,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill, tkc);
+   add_to_kill(t, page, vma, to_kill);
}
}
read_unlock(_lock);
@@ -468,7 +463,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(struct page *page, struct list_head *to_kill,
- struct to_kill **tkc, int force_early)
+   int force_early)
 {
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -492,7 +487,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 * to be informed of all such data corruptions.
 */
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill, tkc);
+   add_to_kill(t, page, vma, to_kill);
}
}
read_unlock(_lock);
@@ -501,26 +496,17 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
- * This is done in two steps for locking reasons.
- * First preallocate one tokill structure outside the spin locks,
- * so that we can kill at least one process reasonably reliable.
  */
 static void collect_procs(struct page *page, struct list_head *tokill,
int force_early)
 {
-   struct to_kill *tk;
-
if (!page->mapping)
return;
 
-   tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
-   if (!tk)
-   return;
if (PageAnon(page))
-   collect_procs_anon(page, tokill, , force_early);
+   collect_procs_anon(page, tokill, force_early);
else
-   collect_procs_file(page, tokill, , force_early);
-   kfree(tk);
+   collect_procs_file(page, tokill, force_early);
 }
 
 static const char *action_name[] = {
-- 
1.8.3.1

[PATCH v3 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-07-25 Thread Jane Chu

Changes in v3:
 - move **tk cleanup to its own patch

Changes in v2:
 - move 'tk' allocations internal to add_to_kill(), suggested by Dan;
 - ran checkpatch.pl check, pointed out by Matthew;
 - Noaya pointed out that v1 would have missed the SIGKILL
   if "tk->addr == -EFAULT", since the code returns early.
   Incorporated Noaya's suggestion, also, skip VMAs where
   "tk->size_shift == 0" for zone device page, and deliver SIGBUS
   when "tk->size_shift != 0" so the payload is helpful;
 - added Suggested-by: Naoya Horiguchi 

Jane Chu (2):
  mm/memory-failure.c clean up around tk pre-allocation
  mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if
mmaped more than once

 mm/memory-failure.c | 62 ++---
 1 file changed, 26 insertions(+), 36 deletions(-)

-- 
1.8.3.1

[PATCH v3 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-25 Thread Jane Chu

Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
  => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
  => to deliver SIGBUS

Signed-off-by: Jane Chu 
Suggested-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51d5b20..f668c88 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -199,7 +199,6 @@ struct to_kill {
struct task_struct *tsk;
unsigned long addr;
short size_shift;
-   char addr_valid;
 };
 
 /*
@@ -318,22 +317,27 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
 
tk->addr = page_address_in_vma(p, vma);
-   tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
/*
-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Send SIGKILL if "tk->addr == -EFAULT". Also, as
+* "tk->size_shift" is always non-zero for !is_zone_device_page(),
+* so "tk->size_shift == 0" effectively checks no mapping on
+* ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
+* to a process' address space, it's possible not all N VMAs
+* contain mappings for the page, but at least one VMA does.
+* Only deliver SIGBUS with payload derived from the VMA that
+* has a mapping for the page.
 */
-   if (tk->addr == -EFAULT || tk->size_shift == 0) {
+   if (tk->addr == -EFAULT) { 
pr_info("Memory failure: Unable to find user space address %lx 
in %s\n",
page_to_pfn(p), tsk->comm);
-   tk->addr_valid = 0;
+   } else if (tk->size_shift == 0) {
+   kfree(tk);
+   return;
}
 
get_task_struct(tsk);
@@ -361,7 +365,7 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * make sure the process doesn't catch the
 * signal and then access the memory. Just kill it.
 */
-   if (fail || tk->addr_valid == 0) {
+   if (fail || tk->addr == -EFAULT) {
pr_err("Memory failure: %#lx: forcibly killing 
%s:%d because of failure to unmap corrupted page\n",
   pfn, tk->tsk->comm, tk->tsk->pid);
do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
-- 
1.8.3.1

Re: [PATCH v2 0/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-07-24 Thread Jane Chu


On 7/24/2019 3:52 PM, Dan Williams wrote:

On Wed, Jul 24, 2019 at 3:35 PM Jane Chu  wrote:


Changes in v2:
  - move 'tk' allocations internal to add_to_kill(), suggested by Dan;


Oh, sorry if it wasn't clear, this should move to its own patch that
only does the cleanup, and then the follow on fix patch becomes
smaller and more straightforward.



Make sense, thanks! I'll split up the patch next.

thanks,
-jane

Re: [PATCH v2 1/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Jane Chu





On 7/24/2019 4:43 PM, Naoya Horiguchi wrote:

On Wed, Jul 24, 2019 at 04:33:23PM -0600, Jane Chu wrote:

Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
   => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
   => to deliver SIGBUS

Signed-off-by: Jane Chu 
Suggested-by: Naoya Horiguchi 
---
  mm/memory-failure.c | 62 ++---
  1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..bd4db33 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -199,7 +199,6 @@ struct to_kill {
struct task_struct *tsk;
unsigned long addr;
short size_shift;
-   char addr_valid;
  };
  
  /*

@@ -304,43 +303,43 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
  /*
   * Schedule a process for later kill.
   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
   */
  static void add_to_kill(struct task_struct *tsk, struct page *p,
   struct vm_area_struct *vma,
-  struct list_head *to_kill,
-  struct to_kill **tkc)
+  struct list_head *to_kill)
  {
struct to_kill *tk;
  
-	if (*tkc) {

-   tk = *tkc;
-   *tkc = NULL;
-   } else {
-   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
-   if (!tk) {
-   pr_err("Memory failure: Out of memory while machine check 
handling\n");
-   return;
-   }
+   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+   if (!tk) {
+   pr_err("Memory failure: Out of memory while machine check 
handling\n");
+   return;


As Dan pointed out, the cleanup part can be delivered as a separate patch.


My bad, will take care splitting up the patch.




}
+
tk->addr = page_address_in_vma(p, vma);
-   tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
  
  	/*

-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Send SIGKILL if "tk->addr == -EFAULT". Also, as
+* "tk->size_shift" is always non-zero for !is_zone_device_page(),
+* so "tk->size_shift == 0" effectively checks no mapping on
+* ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
+* to a process' address space, it's possible not all N VMAs
+* contain mappings for the page, but at least one VMA does.
+* Only deliver SIGBUS with payload derived from the VMA that
+* has a mapping for the page.


OK, so SIGBUSs are sent M times (where M is the number of mappings
for the page). Then I'm convinced that we need "else if" block below.


Yes. I run read_poison that mmaps /dev/dax 4 times with MAPS_POPULATE flag
set, so the kernel attempted sending SIGBUS 4 times.
One time, while the poison was consumed at uaddr[1] (2nd mmap), but the
SIGBUS payload indicated the si_addr was uaddr[3] (4th mmap).

thanks!
-jane




Thanks,
Naoya Horiguchi


 */
-   if (tk->addr == -EFAULT || tk->size_shift == 0) {
+   if (tk->addr == -EFAULT) {
pr_info("M

[PATCH v2 1/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Jane Chu

Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
  => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
  => to deliver SIGBUS

Signed-off-by: Jane Chu 
Suggested-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 62 ++---
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..bd4db33 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -199,7 +199,6 @@ struct to_kill {
struct task_struct *tsk;
unsigned long addr;
short size_shift;
-   char addr_valid;
 };
 
 /*
@@ -304,43 +303,43 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
   struct vm_area_struct *vma,
-  struct list_head *to_kill,
-  struct to_kill **tkc)
+  struct list_head *to_kill)
 {
struct to_kill *tk;
 
-   if (*tkc) {
-   tk = *tkc;
-   *tkc = NULL;
-   } else {
-   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
-   if (!tk) {
-   pr_err("Memory failure: Out of memory while machine 
check handling\n");
-   return;
-   }
+   tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+   if (!tk) {
+   pr_err("Memory failure: Out of memory while machine check 
handling\n");
+   return;
}
+
tk->addr = page_address_in_vma(p, vma);
-   tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
/*
-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Send SIGKILL if "tk->addr == -EFAULT". Also, as
+* "tk->size_shift" is always non-zero for !is_zone_device_page(),
+* so "tk->size_shift == 0" effectively checks no mapping on
+* ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
+* to a process' address space, it's possible not all N VMAs
+* contain mappings for the page, but at least one VMA does.
+* Only deliver SIGBUS with payload derived from the VMA that
+* has a mapping for the page.
 */
-   if (tk->addr == -EFAULT || tk->size_shift == 0) {
+   if (tk->addr == -EFAULT) {
pr_info("Memory failure: Unable to find user space address %lx 
in %s\n",
page_to_pfn(p), tsk->comm);
-   tk->addr_valid = 0;
+   } else if (tk->size_shift == 0) {
+   kfree(tk);
+   return;
}
+
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(>nd, to_kill);
@@ -366,7 +365,7 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill, bool fail,
 * make sure the process doesn't catch the
 * signal and then access the memory. Just kill it.
 */
-   if (fail || tk->addr_valid == 0) {
+

[PATCH v2 0/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-07-24 Thread Jane Chu

Changes in v2:
 - move 'tk' allocations internal to add_to_kill(), suggested by Dan;
 - ran checkpatch.pl check, pointed out by Matthew;
 - Noaya pointed out that v1 would have missed the SIGKILL
   if "tk->addr == -EFAULT", since the code returns early.
   Incorporated Noaya's suggestion, also, skip VMAs where
   "tk->size_shift == 0" for zone device page, and deliver SIGBUS
   when "tk->size_shift != 0" so the payload is helpful;
 - added Suggested-by: Naoya Horiguchi 


Jane Chu (1):
  mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if
mmaped more than once

 mm/memory-failure.c | 62 ++---
 1 file changed, 26 insertions(+), 36 deletions(-)

-- 
1.8.3.1

Re: [PATCH] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread jane . chu


Thank you all for your comments!
I've incorporated them, tested, and have a v2 ready for review.

Thanks!
-jane

On 7/23/19 11:48 PM, Naoya Horiguchi wrote:

Hi Jane, Dan,

On Tue, Jul 23, 2019 at 06:34:35PM -0700, Dan Williams wrote:

On Tue, Jul 23, 2019 at 4:49 PM Jane Chu  wrote:


Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
   => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
   => to deliver SIGBUS

Signed-off-by: Jane Chu 
---
  mm/memory-failure.c | 16 ++--
  1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..7038abd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -315,7 +315,6 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,

 if (*tkc) {
 tk = *tkc;
-   *tkc = NULL;
 } else {
 tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
 if (!tk) {
@@ -331,16 +330,21 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
 tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;

 /*
-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Indeed a page could be mmapped N times within a process. And it's 
possible
+* that not all of those N VMAs contain valid mapping for the page. In 
which
+* case we don't want to send SIGKILL to the process on behalf of the 
VMAs
+* that don't have the valid mapping, because doing so will eclipse the 
SIGBUS
+* delivered on behalf of the active VMA.
  */
 if (tk->addr == -EFAULT || tk->size_shift == 0) {
 pr_info("Memory failure: Unable to find user space address %lx in 
%s\n",
 page_to_pfn(p), tsk->comm);
-   tk->addr_valid = 0;
+   if (tk != *tkc)
+   kfree(tk);
+   return;


The immediate return bypasses list_add_tail() below, so we might lose
the chance of sending SIGBUS to the process.

tk->size_shift is always non-zero for !is_zone_device_page(), so
"tk->size_shift == 0" effectively checks "no mapping on ZONE_DEVICE" now.
As you mention above, "no mapping" doesn't means "invalid address"
so we can drop "tk->size_shift == 0" check from this if-statement.
Going forward in this direction, "tk->addr_valid == 0" is equivalent to
"tk->addr == -EFAULT", so we seems to be able to remove ->addr_valid.
This observation leads me to the following change, does it work for you?

   --- a/mm/memory-failure.c
   +++ b/mm/memory-failure.c
   @@ -199,7 +199,6 @@ struct to_kill {
struct task_struct *tsk;
unsigned long addr;
short size_shift;
   -char addr_valid;
};

/*

   @@ -324,7 +323,6 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
}
tk->addr = page_address_in_vma(p, vma);
   -tk->addr_valid = 1;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(p, vma);
else
   @@ -336,11 +334,9 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
 * likely very rare kill anyways just out of paranoia, but use
 * a SIGKILL because the error is not contained anymore.
 */
   -if (tk->addr == -EFAULT |

[PATCH] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-23 Thread Jane Chu

Mmap /dev/dax more than once, then read the poison location using address
from one of the mappings. The other mappings due to not having the page
mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds
over SIGBUS, so user process looses the opportunity to handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

ndctl inject-error --block=10 --count=1 namespace6.0

./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb660
mmaped address 0x7f3cf360
doing local read at address 0x7f3cf3601400
Killed

Console messages in instrumented kernel -

mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of 
failure to unmap corrupted page
  => to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory 
corruption
  => to deliver SIGBUS

Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9cc660..7038abd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -315,7 +315,6 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
 
if (*tkc) {
tk = *tkc;
-   *tkc = NULL;
} else {
tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
if (!tk) {
@@ -331,16 +330,21 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
/*
-* In theory we don't have to kill when the page was
-* munmaped. But it could be also a mremap. Since that's
-* likely very rare kill anyways just out of paranoia, but use
-* a SIGKILL because the error is not contained anymore.
+* Indeed a page could be mmapped N times within a process. And it's 
possible
+* that not all of those N VMAs contain valid mapping for the page. In 
which
+* case we don't want to send SIGKILL to the process on behalf of the 
VMAs
+* that don't have the valid mapping, because doing so will eclipse the 
SIGBUS
+* delivered on behalf of the active VMA.
 */
if (tk->addr == -EFAULT || tk->size_shift == 0) {
pr_info("Memory failure: Unable to find user space address %lx 
in %s\n",
page_to_pfn(p), tsk->comm);
-   tk->addr_valid = 0;
+   if (tk != *tkc)
+   kfree(tk);
+   return;
}
+   if (tk == *tkc)
+   *tkc = NULL;
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(>nd, to_kill);
-- 
1.8.3.1

Re: [PATCH 0/6] libnvdimm: Fix async operations and locking

2019-06-18 Thread Jane Chu


On 6/11/2019 4:25 PM, Dan Williams wrote:

The libnvdimm subsystem uses async operations to parallelize device
probing operations and to allow sysfs to trigger device_unregister() on
deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs
interface uncovered a case where device_unregister() is triggered
multiple times, and the subsequent investigation uncovered a broken
locking scenario.

The lack of lockdep coverage for device_lock() stymied the debug. That
is, until patch6 "driver-core, libnvdimm: Let device subsystems add
local lockdep coverage" solved that with a shadow lock, with lockdep
coverage, to mirror device_lock() operations. Given the time saved with
shadow-lock debug-hack, patch6 attempts to generalize device_lock()
debug facility that might be able to be carried upstream. Patch6 is
staged at the end of this fix series in case it is contentious and needs
to be dropped.

Patch1 "drivers/base: Introduce kill_device()" could be achieved with
local libnvdimm infrastructure. However, the existing 'dead' flag in
'struct device_private' aims to solve similar async register/unregister
races so the fix in patch2 "libnvdimm/bus: Prevent duplicate
device_unregister() calls" can be implemented with existing driver-core
infrastructure.

Patch3 is a rare lockdep warning that is intermittent based on
namespaces racing ahead of the completion of probe of their parent
region. It is not related to the other fixes, it just happened to
trigger as a result of the async stress test.

Patch4 and patch5 address an ABBA deadlock tripped by the stress test.

These patches pass the failing stress test and the existing libnvdimm
unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex"
shadow lock with no lockdep warnings.

---

Dan Williams (6):
   drivers/base: Introduce kill_device()
   libnvdimm/bus: Prevent duplicate device_unregister() calls
   libnvdimm/region: Register badblocks before namespaces
   libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
   libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
   driver-core, libnvdimm: Let device subsystems add local lockdep coverage


  drivers/acpi/nfit/core.c|   28 ---
  drivers/acpi/nfit/nfit.h|   24 ++
  drivers/base/core.c |   30 ++--
  drivers/nvdimm/btt_devs.c   |   16 ++--
  drivers/nvdimm/bus.c|  154 +++
  drivers/nvdimm/core.c   |   10 +--
  drivers/nvdimm/dimm_devs.c  |4 +
  drivers/nvdimm/namespace_devs.c |   36 +
  drivers/nvdimm/nd-core.h|   71 ++
  drivers/nvdimm/pfn_devs.c   |   24 +++---
  drivers/nvdimm/pmem.c   |4 +
  drivers/nvdimm/region.c |   24 +++---
  drivers/nvdimm/region_devs.c|   12 ++-
  include/linux/device.h  |6 ++
  14 files changed, 308 insertions(+), 135 deletions(-)



Tested-by: Jane Chu 

Specifically, running parallel ndctls creating/destroying namespaces in 
multiple processes concurrently led to system panic, that has been 
verified fixed by this patch series.


Thanks!
-jane

Re: [PATCH] mm, memory-failure: clarify error message

2019-05-20 Thread Jane Chu


On 5/16/2019 9:48 PM, Anshuman Khandual wrote:


On 05/17/2019 09:38 AM, Jane Chu wrote:

Some user who install SIGBUS handler that does longjmp out

What the longjmp about ? Are you referring to the mechanism of catching the
signal which was registered ?


Yes.

thanks,
-jane

[PATCH v2] mm, memory-failure: clarify error message

2019-05-20 Thread Jane Chu

Some user who install SIGBUS handler that does longjmp out
therefore keeping the process alive is confused by the error
message
  "[188988.765862] Memory failure: 0x1840200: Killing
   cellsrv:33395 due to hardware memory corruption"
Slightly modify the error message to improve clarity.

Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..c4f4bcd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -216,7 +216,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, 
int flags)
short addr_lsb = tk->size_shift;
int ret;
 
-   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
corruption\n",
+   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware 
memory corruption\n",
pfn, t->comm, t->pid);
 
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
-- 
1.8.3.1

Re: [PATCH] mm, memory-failure: clarify error message

2019-05-20 Thread Jane Chu


Thanks Vishal and Naoya!

-jane

On 5/20/2019 3:21 AM, Naoya Horiguchi wrote:

On Fri, May 17, 2019 at 10:18:02AM +0530, Anshuman Khandual wrote:


On 05/17/2019 09:38 AM, Jane Chu wrote:

Some user who install SIGBUS handler that does longjmp out

What the longjmp about ? Are you referring to the mechanism of catching the
signal which was registered ?

AFAIK, longjmp() might be useful for signal-based retrying, so highly
optimized applications like Oracle DB might want to utilize it to handle
memory errors in application level, I guess.


therefore keeping the process alive is confused by the error
message
   "[188988.765862] Memory failure: 0x1840200: Killing
cellsrv:33395 due to hardware memory corruption"

Its a valid point because those are two distinct actions.


Slightly modify the error message to improve clarity.

Signed-off-by: Jane Chu 
---
  mm/memory-failure.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..14de5e2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long 
pfn, int flags)
short addr_lsb = tk->size_shift;
int ret;
  
-	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",

-   pfn, t->comm, t->pid);
-
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
+   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
"
+   "corruption\n", pfn, t->comm, t->pid);
ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
   addr_lsb, current);
} else {
@@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, 
int flags)
 * This could cause a loop when the user sets SIGBUS
 * to SIG_IGN, but hopefully no one will do that?
 */
+   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
hardware "
+   "memory corruption\n", pfn, t->comm, t->pid);
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
  addr_lsb, t);  /* synchronous? */

As both the pr_err() messages are very similar, could not we just switch between 
"Killing"
and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] evaluated 
previously
with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm).

That might need additional if sentence, which I'm not sure worth doing.
I think that the simplest fix for the reported problem (a confusing message)
is like below:

-   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
corruption\n",
+   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
hardware memory corruption\n",
pfn, t->comm, t->pid);

Or, if we have a good reason to separate the message for MF_ACTION_REQUIRED and
MF_ACTION_OPTIONAL, that might be OK.

Thanks,
Naoya Horiguchi

[PATCH] mm, memory-failure: clarify error message

2019-05-16 Thread Jane Chu

Some user who install SIGBUS handler that does longjmp out
therefore keeping the process alive is confused by the error
message
  "[188988.765862] Memory failure: 0x1840200: Killing
   cellsrv:33395 due to hardware memory corruption"
Slightly modify the error message to improve clarity.

Signed-off-by: Jane Chu 
---
 mm/memory-failure.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..14de5e2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long 
pfn, int flags)
short addr_lsb = tk->size_shift;
int ret;
 
-   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
corruption\n",
-   pfn, t->comm, t->pid);
-
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
+   pr_err("Memory failure: %#lx: Killing %s:%d due to hardware 
memory "
+   "corruption\n", pfn, t->comm, t->pid);
ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
   addr_lsb, current);
} else {
@@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, 
int flags)
 * This could cause a loop when the user sets SIGBUS
 * to SIG_IGN, but hopefully no one will do that?
 */
+   pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
hardware "
+   "memory corruption\n", pfn, t->comm, t->pid);
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
  addr_lsb, t);  /* synchronous? */
}
-- 
1.8.3.1

Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

2019-05-16 Thread Jane Chu


On 5/16/2019 2:51 PM, Dan Williams wrote:


On Thu, May 16, 2019 at 9:45 AM Jane Chu  wrote:

Hi,

I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic.  But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
region.

Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
user mistakes.

We do already have locking in the driver to prevent configuration
collisions. The problem looks to be broken assumptions about running
the device unregistration path in a separate thread outside the lock.
I suspect it may be incorrect assumptions about the userspace
visibility of the device relative to teardown actions. To be clear
this isn't the nd_ioctl() path this is the sysfs path.


I see, thanks!




Also, sensing the subject fix is for a different problem, and has been
verified, I'm happy to see it in upstream, so we have a better
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.

Right, the crash signature you are seeing looks unrelated to the issue
being address in these patches which is device-teardown racing active
page pins. I'll start the investigation on the crash signature, but
again I don't think it reads on this fix series.


Agreed on investigating the crash as separate issue, looking forward
to see this patchset in upstream.

Thanks!
-jane

Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

2019-05-16 Thread jane . chu


Apology for resending in plain text.
-jane

On 5/16/19 9:45 AM, Jane Chu wrote:

Hi,

I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic.  But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
region.

Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
user mistakes.

Also, sensing the subject fix is for a different problem, and has been
verified, I'm happy to see it in upstream, so we have a better
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.

thanks,
-jane

On 5/14/2019 2:18 PM, Jane Chu wrote:

On 5/14/2019 12:04 PM, Dan Williams wrote:


On Tue, May 14, 2019 at 11:53 AM Jane Chu  wrote:

On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:

On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:

On 2019-05-07 5:55 p.m., Dan Williams wrote:

Changes since v1 [1]:
- Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)

- Refresh the p2pdma patch headers to match the format of other p2pdma
    patches (Bjorn)

- Collect Ira's reviewed-by

[1]: 
https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.st...@dwillia2-desk3.amr.corp.intel.com/ 



This series looks good to me:

Reviewed-by: Logan Gunthorpe 

However, I haven't tested it yet but I intend to later this week.

I've tested libnvdimm-pending which includes this series on my setup 
and

everything works great.

Just wondering in a difference scenario where pmem pages are 
exported to
a KVM guest, and then by mistake the user issues "ndctl 
destroy-namespace -f",
will the kernel wait indefinitely until the user figures out to kill 
the guest

and release the pmem pages?

It depends on whether the pages are pinned. Typically DAX memory
mappings assigned to a guest are not pinned in the host and can be
invalidated at any time. The pinning only occurs with VFIO and
device-assignment which isn't the common case, especially since that
configuration is blocked by fsdax. However, with devdax, yes you can
arrange for the system to go into an indefinite wait.

This somewhat ties back to the get_user_pages() vs DAX debate. The
indefinite stall issue with device-assignment could be addressed with
a requirement to hold a lease and expect that a lease revocation event
may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
expectation with device-dax is that it is already a raw interface with
pointy edges and caveats, but I would not be opposed to introducing a
lease semantic.


Thanks for the quick response Dan.

I am not convinced that the get_user_pages() vs FS-DAX dilemma is a 
perfect

comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.

Others might disagree with me, I thought that there is no risk of panic
if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
user application. Also, both actions are on the same host, so in theory
the admin could shutdown the application before attempt a destructive
action.

By allowing 'opposite' actions in competition with each other at fine
granularity, there is potential for panic in general, not necessarily 
with

pinned page I guess.  I just ran an experiment and panic'd the system.

So, as Optane DCPMEM is generally for server/cloud deployment, and as
RAS is a priority for server over administrative commands, to allow
namespace management command to panic kernel is not an option?

Here is my stress experiment -
  Start out with ./create_nm.sh to create as many 48G devdax namespaces
as possible. Once that's completed, firing up 6 actions in quick
successions in below order:
  -> ndctl destroy-namespace all -f
  -> ./create_nm.sh
  -> ndctl destroy-namespace all -f
  -> ./create_nm.sh
  -> ndctl destroy-namespace all -f
  -> ./create_nm.sh

==  console message ===
Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64

ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer 
dereference, address: 0020

[ 1620.874585] #PF: supervisor read access in kernel mode
[ 1620.880319] #PF: error_code(0x) - not-present page
[ 1620.886052] PGD 0 P4D 0
[ 1620.79] Oops:  [#1] SMP NOPTI
[ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: 
G    W 5.1.0-rc7-next-20190501-libnvdimm-pending #5
[ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER 
X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019

[ 1620.916069] Workqueue: events_unbound async_run_entry_fn
[ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
[ 1620.926471] Code: 48 8b

1 2 >

1 - 100 of 125 matches

Mail list logo