from:"Shiyang Ruan"

Re: [RFC PATCH] xfs: check shared state of when CoW, update reflink flag when io ends

2023-03-22 Thread Shiyang Ruan





在 2023/3/21 23:13, Darrick J. Wong 写道:

On Mon, Mar 20, 2023 at 06:02:05PM +0800, Shiyang Ruan wrote:



在 2023/3/18 4:35, Darrick J. Wong 写道:

On Fri, Mar 17, 2023 at 03:59:48AM +, Shiyang Ruan wrote:

As is mentioned[1] before, the generic/388 will randomly fail with dmesg
warning.  This case uses fsstress with a lot of random operations.  It is hard
to  reproduce.  Finally I found a 100% reproduce condition, which is setting
the seed to 1677104360.  So I changed the generic/388 code: removed the loop
and used the code below instad:
```
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
_check_dmesg_for dax_insert_entry
```

According to the operations log, and kernel debug log I added, I found that
the reflink flag of one inode won't be unset even if there's no more shared
extents any more.
Then write to this file again.  Because of the reflink flag, xfs thinks it
  needs cow, and extent(called it extA) will be CoWed to a new
  extent(called it extB) incorrectly.  And extA is not used any more,
  but didn't be unmapped (didn't do dax_disassociate_entry()).


IOWs, dax_iomap_copy_around (or something very near it) should be
calling dax_disassociate_entry on the source range after copying extA's
contents to extB to drop its page->shared count?


If extA is a shared extent, its pages will be disassociated correctly by
invalidate_inode_pages2_range() in dax_iomap_iter().

But the problem is that extA is not shared but now be CoWed,


Aha!  Ok, I hadn't realized that extA is not shared...


invalidate_inode_pages2_range() is also called but it can't disassociate the
old page (because the page is marked dirty, can't be invalidated)


...so what marked the old page dirty?   Was it the case that the
unshared extA got marked dirty, then later someone created a cow
reservation (extB, I guess) that covered the already dirty extA?

Should we be transferring the dirty state from A to B here before the
invalidate_inode_pages2_range ?


Is the behavior to do CoW on a non-shared extent allowed?


In general, yes, XFS allows COW on non-shared extents.  The (cow) extent
size hint provides for cowing the unshared blocks adjacent to a shared
block to try to combat fragmentation.


Ok, I did't realize its benifit.  Thanks a lot.

Now I've fixed it based on your suggestion and it works.  The failed 
cases all passed.  Now I'm running the generic/388 for many and many 
times to make sure it doesn't fail again.  I'll send the patch if the 
generic/388 is passed.







The next time we mapwrite to another file, xfs will allocate extA for it,
  page fault handler do dax_associate_entry().  BUT bucause the extA didn't
  be unmapped, it still stores old file's info in page->mapping,->index.
  Then, It reports dmesg warning when it try to sotre the new file's info.

So, I think:
1. reflink flag should be updated after CoW operations.
2. xfs_reflink_allocate_cow() should add "if extent is shared" to determine
   xfs do CoW or not.

I made the fix patch, it can resolve the fail of generic/388.  But it causes
other cases fail: generic/127, generic/263, generic/616, xfs/315 xfs/421. I'm
not sure if the fix is right, or I have missed something somewhere.  Please
give me some advice.

Thank you very much!!

[1]: 
https://lore.kernel.org/linux-xfs/1669908538-55-1-git-send-email-ruansy.f...@fujitsu.com/

Signed-off-by: Shiyang Ruan 
---
   fs/xfs/xfs_reflink.c | 44 
   fs/xfs/xfs_reflink.h |  2 ++
   2 files changed, 46 insertions(+)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5dc46ce9803..a6b07f5c1db2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -154,6 +154,40 @@ xfs_reflink_find_shared(
return error;
   }
+int xfs_reflink_extent_is_shared(
+   struct xfs_inode*ip,
+   struct xfs_bmbt_irec*irec,
+   bool*shared)
+{
+   struct xfs_mount*mp = ip->i_mount;
+   struct xfs_perag*pag;
+   xfs_agblock_t   agbno;
+   xfs_extlen_taglen;
+   xfs_agblock_t   fbno;
+   xfs_extlen_tflen;
+   int error = 0;
+
+   *shared = false;
+
+   /* Holes, unwritten, and delalloc extents cannot be shared */
+   if (!xfs_bmap_is_written_extent(irec))
+   return 0;
+
+   pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, irec->br_startblock));
+   agbno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
+   aglen = irec->br_blockcount;
+   error = xfs_reflink_find_shared(pag, NULL, agbno, aglen, , ,
+   true);
+   xfs_perag_put(pag);
+   if (error)
+   return error;
+

Re: [RFC PATCH] xfs: check shared state of when CoW, update reflink flag when io ends

2023-03-20 Thread Shiyang Ruan





在 2023/3/18 4:35, Darrick J. Wong 写道:

On Fri, Mar 17, 2023 at 03:59:48AM +, Shiyang Ruan wrote:

As is mentioned[1] before, the generic/388 will randomly fail with dmesg
warning.  This case uses fsstress with a lot of random operations.  It is hard
to  reproduce.  Finally I found a 100% reproduce condition, which is setting
the seed to 1677104360.  So I changed the generic/388 code: removed the loop
and used the code below instad:
```
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
_check_dmesg_for dax_insert_entry
```

According to the operations log, and kernel debug log I added, I found that
the reflink flag of one inode won't be unset even if there's no more shared
extents any more.
   Then write to this file again.  Because of the reflink flag, xfs thinks it
 needs cow, and extent(called it extA) will be CoWed to a new
 extent(called it extB) incorrectly.  And extA is not used any more,
 but didn't be unmapped (didn't do dax_disassociate_entry()).


IOWs, dax_iomap_copy_around (or something very near it) should be
calling dax_disassociate_entry on the source range after copying extA's
contents to extB to drop its page->shared count?


If extA is a shared extent, its pages will be disassociated correctly by 
invalidate_inode_pages2_range() in dax_iomap_iter().


But the problem is that extA is not shared but now be CoWed, 
invalidate_inode_pages2_range() is also called but it can't disassociate 
the old page (because the page is marked dirty, can't be invalidated)


Is the behavior to do CoW on a non-shared extent allowed?




   The next time we mapwrite to another file, xfs will allocate extA for it,
 page fault handler do dax_associate_entry().  BUT bucause the extA didn't
 be unmapped, it still stores old file's info in page->mapping,->index.
 Then, It reports dmesg warning when it try to sotre the new file's info.

So, I think:
   1. reflink flag should be updated after CoW operations.
   2. xfs_reflink_allocate_cow() should add "if extent is shared" to determine
  xfs do CoW or not.

I made the fix patch, it can resolve the fail of generic/388.  But it causes
other cases fail: generic/127, generic/263, generic/616, xfs/315 xfs/421. I'm
not sure if the fix is right, or I have missed something somewhere.  Please
give me some advice.

Thank you very much!!

[1]: 
https://lore.kernel.org/linux-xfs/1669908538-55-1-git-send-email-ruansy.f...@fujitsu.com/

Signed-off-by: Shiyang Ruan 
---
  fs/xfs/xfs_reflink.c | 44 
  fs/xfs/xfs_reflink.h |  2 ++
  2 files changed, 46 insertions(+)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5dc46ce9803..a6b07f5c1db2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -154,6 +154,40 @@ xfs_reflink_find_shared(
return error;
  }
  
+int xfs_reflink_extent_is_shared(

+   struct xfs_inode*ip,
+   struct xfs_bmbt_irec*irec,
+   bool*shared)
+{
+   struct xfs_mount*mp = ip->i_mount;
+   struct xfs_perag*pag;
+   xfs_agblock_t   agbno;
+   xfs_extlen_taglen;
+   xfs_agblock_t   fbno;
+   xfs_extlen_tflen;
+   int error = 0;
+
+   *shared = false;
+
+   /* Holes, unwritten, and delalloc extents cannot be shared */
+   if (!xfs_bmap_is_written_extent(irec))
+   return 0;
+
+   pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, irec->br_startblock));
+   agbno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
+   aglen = irec->br_blockcount;
+   error = xfs_reflink_find_shared(pag, NULL, agbno, aglen, , ,
+   true);
+   xfs_perag_put(pag);
+   if (error)
+   return error;
+
+   if (fbno != NULLAGBLOCK)
+   *shared = true;
+
+   return 0;
+}
+
  /*
   * Trim the mapping to the next block where there's a change in the
   * shared/unshared status.  More specifically, this means that we
@@ -533,6 +567,12 @@ xfs_reflink_allocate_cow(
xfs_ifork_init_cow(ip);
}
  
+	error = xfs_reflink_extent_is_shared(ip, imap, shared);

+   if (error)
+   return error;
+   if (!*shared)
+   return 0;
+
error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, );
if (error || !*shared)
return error;
@@ -834,6 +874,10 @@ xfs_reflink_end_cow_extent(
/* Remove the mapping from the CoW fork. */
xfs_bmap_del_extent_cow(ip, , , );
  
+	error = xfs_reflink_clear_inode_flag(ip, );


This will disable COW on /all/ blocks in the entire file, including the
shared ones.  At a bare minimum you'd have to scan the entire da

[RFC PATCH] xfs: check shared state of when CoW, update reflink flag when io ends

2023-03-16 Thread Shiyang Ruan

As is mentioned[1] before, the generic/388 will randomly fail with dmesg
warning.  This case uses fsstress with a lot of random operations.  It is hard
to  reproduce.  Finally I found a 100% reproduce condition, which is setting
the seed to 1677104360.  So I changed the generic/388 code: removed the loop
and used the code below instad:
```
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 >> 
$seqres.full) > /dev/null 2>&1
_check_dmesg_for dax_insert_entry
```

According to the operations log, and kernel debug log I added, I found that
the reflink flag of one inode won't be unset even if there's no more shared
extents any more.
  Then write to this file again.  Because of the reflink flag, xfs thinks it
needs cow, and extent(called it extA) will be CoWed to a new
extent(called it extB) incorrectly.  And extA is not used any more,
but didn't be unmapped (didn't do dax_disassociate_entry()).
  The next time we mapwrite to another file, xfs will allocate extA for it,
page fault handler do dax_associate_entry().  BUT bucause the extA didn't
be unmapped, it still stores old file's info in page->mapping,->index.
Then, It reports dmesg warning when it try to sotre the new file's info.

So, I think:
  1. reflink flag should be updated after CoW operations.
  2. xfs_reflink_allocate_cow() should add "if extent is shared" to determine
 xfs do CoW or not.

I made the fix patch, it can resolve the fail of generic/388.  But it causes
other cases fail: generic/127, generic/263, generic/616, xfs/315 xfs/421. I'm
not sure if the fix is right, or I have missed something somewhere.  Please
give me some advice.

Thank you very much!!

[1]: 
https://lore.kernel.org/linux-xfs/1669908538-55-1-git-send-email-ruansy.f...@fujitsu.com/

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_reflink.c | 44 
 fs/xfs/xfs_reflink.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5dc46ce9803..a6b07f5c1db2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -154,6 +154,40 @@ xfs_reflink_find_shared(
return error;
 }
 
+int xfs_reflink_extent_is_shared(
+   struct xfs_inode*ip,
+   struct xfs_bmbt_irec*irec,
+   bool*shared)
+{
+   struct xfs_mount*mp = ip->i_mount;
+   struct xfs_perag*pag;
+   xfs_agblock_t   agbno;
+   xfs_extlen_taglen;
+   xfs_agblock_t   fbno;
+   xfs_extlen_tflen;
+   int error = 0;
+
+   *shared = false;
+
+   /* Holes, unwritten, and delalloc extents cannot be shared */
+   if (!xfs_bmap_is_written_extent(irec))
+   return 0;
+
+   pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, irec->br_startblock));
+   agbno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
+   aglen = irec->br_blockcount;
+   error = xfs_reflink_find_shared(pag, NULL, agbno, aglen, , ,
+   true);
+   xfs_perag_put(pag);
+   if (error)
+   return error;
+
+   if (fbno != NULLAGBLOCK)
+   *shared = true;
+
+   return 0;
+}
+
 /*
  * Trim the mapping to the next block where there's a change in the
  * shared/unshared status.  More specifically, this means that we
@@ -533,6 +567,12 @@ xfs_reflink_allocate_cow(
xfs_ifork_init_cow(ip);
}
 
+   error = xfs_reflink_extent_is_shared(ip, imap, shared);
+   if (error)
+   return error;
+   if (!*shared)
+   return 0;
+
error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, );
if (error || !*shared)
return error;
@@ -834,6 +874,10 @@ xfs_reflink_end_cow_extent(
/* Remove the mapping from the CoW fork. */
xfs_bmap_del_extent_cow(ip, , , );
 
+   error = xfs_reflink_clear_inode_flag(ip, );
+   if (error)
+   goto out_cancel;
+
error = xfs_trans_commit(tp);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
if (error)
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 65c5dfe17ecf..d5835814bce6 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -16,6 +16,8 @@ static inline bool xfs_is_cow_inode(struct xfs_inode *ip)
return xfs_is_reflink_inode(ip) || xfs_is_always_cow_inode(ip);
 }
 
+int xfs_reflink_extent_is_shared(struct xfs_inode *ip,
+   struct xfs_bmbt_irec *irec, bool *shared);
 extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
struct xfs_bmbt_irec *irec, bool *shared);
 int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
-- 
2.39.2

Re: [PATCH v9 1/3] xfs: fix the calculation of length and end

2023-02-05 Thread Shiyang Ruan





在 2023/2/5 19:42, Matthew Wilcox 写道:

On Sat, Feb 04, 2023 at 02:58:36PM +, Shiyang Ruan wrote:

@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
-   if (offset + len > ddev_end)
-   len -= ddev_end - offset;
+   if (offset + len - 1 > ddev_end)
+   len -= offset + len - 1 - ddev_end;


This _looks_ wrong.  Are you sure it shouldn't be:

len = ddev_end - offset + 1;



It is to make sure the range won't beyond the end of device.

But actually, both of us are rgiht.
  Mine: len -= offset + len - 1 - ddev_end;
 => len = len - (offset + len - 1 - ddev_end);
 => len = len - offset - len + 1 + ddev_end;
 => len = ddev_end - offset + 1;  --> Yours

I forgot to simplify it.  Will fix.


--
Thanks,
Ruan.

[PATCH v9 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2023-02-04 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
 drivers/dax/super.c |  3 ++-
 fs/xfs/xfs_notify_failure.c | 28 +++-
 include/linux/mm.h  |  1 +
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index da4438f3188c..40274d19f4f9 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5c1e678a1285 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 struct xfs_failure_info {
xfs_agblock_t   startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
 
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* The device is about to be removed.  Not a really failure. */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -168,7 +172,9 @@ xfs_dax_notify_ddev_failure(
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
-   }
+   } else if (mf_flags & MF_MEM_PRE_REMOVE)
+   xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
return error;
 }
 
@@ -182,12 +188,24 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
 
if (!(mp->m_super->s_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
 
+   if (mf_flags & MF_MEM_PRE_REMOVE) {
+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   /* invalidate_inode_pages2() invalidates dax mapping */
+   super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_debug(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +214,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +229,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+   /* Notify failure on the whole device */
+   if (offset == 0 && len == U64_MAX) {
+   offset = ddev_start;
+   len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+   }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/incl

[PATCH v9 2/3] fs: move drop_pagecache_sb() for others to use

2023-02-04 Thread Shiyang Ruan

xfs_notify_failure.c requires a method to invalidate all dax mappings.
drop_pagecache_sb() can do this but it is a static function and only
build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
for others.  And use its second argument to choose which invalidate
method to use.

Signed-off-by: Shiyang Ruan 
---
 fs/drop_caches.c| 35 ++---
 fs/super.c  | 43 +
 include/linux/fs.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/truncate.c   | 20 +--
 5 files changed, 65 insertions(+), 35 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index e619c31b6bd9..4c9281885077 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -15,38 +15,6 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
-{
-   struct inode *inode, *toput_inode = NULL;
-
-   spin_lock(>s_inode_list_lock);
-   list_for_each_entry(inode, >s_inodes, i_sb_list) {
-   spin_lock(>i_lock);
-   /*
-* We must skip inodes in unusual state. We may also skip
-* inodes without pages but we deliberately won't in case
-* we need to reschedule to avoid softlockups.
-*/
-   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-   (mapping_empty(inode->i_mapping) && !need_resched())) {
-   spin_unlock(>i_lock);
-   continue;
-   }
-   __iget(inode);
-   spin_unlock(>i_lock);
-   spin_unlock(>s_inode_list_lock);
-
-   invalidate_mapping_pages(inode->i_mapping, 0, -1);
-   iput(toput_inode);
-   toput_inode = inode;
-
-   cond_resched();
-   spin_lock(>s_inode_list_lock);
-   }
-   spin_unlock(>s_inode_list_lock);
-   iput(toput_inode);
-}
-
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos)
 {
@@ -59,7 +27,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int 
write,
static int stfu;
 
if (sysctl_drop_caches & 1) {
-   iterate_supers(drop_pagecache_sb, NULL);
+   iterate_supers(super_drop_pagecache,
+  invalidate_inode_pages);
count_vm_event(DROP_PAGECACHE);
}
if (sysctl_drop_caches & 2) {
diff --git a/fs/super.c b/fs/super.c
index 12c08cb20405..d788b73f93f0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -678,6 +679,48 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+/*
+ * super_drop_pagecache - drop all page caches of a filesystem
+ * @sb: superblock to invalidate
+ * @arg: invalidate method, such as invalidate_inode_pages(),
+ * invalidate_inode_pages2()
+ *
+ * Scans the inodes of a filesystem, drop all page caches.
+ */
+void super_drop_pagecache(struct super_block *sb, void *arg)
+{
+   struct inode *inode, *toput_inode = NULL;
+   int (*invalidator)(struct address_space *) = arg;
+
+   spin_lock(>s_inode_list_lock);
+   list_for_each_entry(inode, >s_inodes, i_sb_list) {
+   spin_lock(>i_lock);
+   /*
+* We must skip inodes in unusual state. We may also skip
+* inodes without pages but we deliberately won't in case
+* we need to reschedule to avoid softlockups.
+*/
+   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (mapping_empty(inode->i_mapping) && !need_resched())) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   spin_unlock(>s_inode_list_lock);
+
+   invalidator(inode->i_mapping);
+   iput(toput_inode);
+   toput_inode = inode;
+
+   cond_resched();
+   spin_lock(>s_inode_list_lock);
+   }
+   spin_unlock(>s_inode_list_lock);
+   iput(toput_inode);
+}
+EXPORT_SYMBOL(super_drop_pagecache);
+
 static void __iterate_supers(void (*f)(struct super_block *))
 {
struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1769a2c5d70..b853632e76cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3308,6 +3308,7 @@ extern struct super_block *get_super(struct block_device 
*);
 extern struct super_

[PATCH v9 1/3] xfs: fix the calculation of length and end

2023-02-04 Thread Shiyang Ruan

The end should be start + length - 1.  Also fix the calculation of the
length when seeking for intersection of notify range and device.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_notify_failure.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..3830f908e215 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
int error = 0;
xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
-   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen 
- 1);
xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
error = xfs_trans_alloc_empty(mp, );
@@ -210,7 +210,7 @@ xfs_dax_notify_failure(
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
/* Ignore the range out of filesystem area */
-   if (offset + len < ddev_start)
+   if (offset + len - 1 < ddev_start)
return -ENXIO;
if (offset > ddev_end)
return -ENXIO;
@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
-   if (offset + len > ddev_end)
-   len -= ddev_end - offset;
+   if (offset + len - 1 > ddev_end)
+   len -= offset + len - 1 - ddev_end;
 
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
-- 
2.39.1

[RESEND PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2023-02-04 Thread Shiyang Ruan

Changes since v9:
  1. Rebase on 6.2-rc6

Changes since v8:
  1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
  2. P2: let super_drop_pagecache() accept invalidate method.
  3. P3: invalidate all dax mappings by invalidate_inode_pages2().
  4. P3: shutdown the filesystem when it is to be removed.
  5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
  xfs: fix the calculation of length and end
  fs: move drop_pagecache_sb() for others to use
  mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

 drivers/dax/super.c |  3 ++-
 fs/drop_caches.c| 35 ++
 fs/super.c  | 43 +
 fs/xfs/xfs_notify_failure.c | 36 ++-
 include/linux/fs.h  |  1 +
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/truncate.c   | 20 +++--
 8 files changed, 99 insertions(+), 41 deletions(-)

-- 
2.39.1

Re: [PATCH v2 0/8] fsdax,xfs: fix warning messages

2022-12-29 Thread Shiyang Ruan





在 2022/12/3 9:21, Dan Williams 写道:

Shiyang Ruan wrote:

Changes since v1:
  1. Added a snippet of the warning message and some of the failed cases
  2. Separated the patch for easily review
  3. Added page->share and its helper functions
  4. Included the patch[1] that removes the restrictions of fsdax and reflink
[1] 
https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/


...


This also effects dax+noreflink mode if we run the test after a
dax+reflink test.  So, the most urgent thing is solving the warning
messages.

With these fixes, most warning messages in dax_associate_entry() are
gone.  But honestly, generic/388 will randomly failed with the warning.
The case shutdown the xfs when fsstress is running, and do it for many
times.  I think the reason is that dax pages in use are not able to be
invalidated in time when fs is shutdown.  The next time dax page to be
associated, it still remains the mapping value set last time.  I'll keep
on solving it.


This one also sounds like it is going to be relevant for CXL PMEM, and
the improvements to the reference counting. CXL has a facility where the
driver asserts that no more writes are in-flight to the device so that
the device can assert a clean shutdown. Part of that will be making sure
that page access ends at fs shutdown.


I was trying to locate the root cause of the fail on generic/388.  But 
since it's a fsstress test, I can't relpay the operation sequence to 
help me locate the operations.  So, I tried to replace fsstress with 
fsx, which can do replay after the case fails, but it can't reproduce 
the fail.  I think another important factor is that fsstress tests with 
multiple threads.  So, for now, it's hard for me to locate the cause by 
running the test.


Then I updated the kernel to the latest v6.2-rc1 and run generic/388 for 
many times.  The warning dmesg doesn't show any more.


How is your test on this case?  Does it still fail on the latest kernel? 
 If so, I think I have to keep on locating the cause, and need your advice.



--
Thanks,
Ruan.




The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.


Shiyang Ruan (8):
   fsdax: introduce page->share for fsdax in reflink mode
   fsdax: invalidate pages when CoW
   fsdax: zero the edges if source is HOLE or UNWRITTEN
   fsdax,xfs: set the shared flag when file extent is shared
   fsdax: dedupe: iter two files at the same time
   xfs: use dax ops for zero and truncate in fsdax mode
   fsdax,xfs: port unshare to fsdax
   xfs: remove restrictions for fsdax and reflink

  fs/dax.c   | 220 +
  fs/xfs/xfs_ioctl.c |   4 -
  fs/xfs/xfs_iomap.c |   6 +-
  fs/xfs/xfs_iops.c  |   4 -
  fs/xfs/xfs_reflink.c   |   8 +-
  include/linux/dax.h|   2 +
  include/linux/mm_types.h   |   5 +-
  include/linux/page-flags.h |   2 +-
  8 files changed, 166 insertions(+), 85 deletions(-)

--
2.38.1

[PATCH v2.2 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-06 Thread Shiyang Ruan

fsdax page is used not only when CoW, but also mapread. To make the it
easily understood, use 'share' to indicate that the dax page is shared
by more than one extent.  And add helper functions to use it.

Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Allison Henderson 
---
 fs/dax.c   | 38 ++
 include/linux/mm_types.h   |  5 -
 include/linux/page-flags.h |  2 +-
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..84fadea08705 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_mapping_is_cow(struct address_space *mapping)
+static inline bool dax_page_is_shared(struct page *page)
 {
-   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+   return page->mapping == PAGE_MAPPING_DAX_SHARED;
 }
 
 /*
- * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
+ * refcount.
  */
-static inline void dax_mapping_set_cow(struct page *page)
+static inline void dax_page_share_get(struct page *page)
 {
-   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
/*
 * Reset the index if the page was already mapped
 * regularly before.
 */
if (page->mapping)
-   page->index = 1;
-   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   page->share = 1;
+   page->mapping = PAGE_MAPPING_DAX_SHARED;
}
-   page->index++;
+   page->share++;
+}
+
+static inline unsigned long dax_page_share_put(struct page *page)
+{
+   return --page->share;
 }
 
 /*
- * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * When it is called in dax_insert_entry(), the shared flag will indicate that
  * whether this entry is shared by multiple files.  If so, set the 
page->mapping
- * FS_DAX_MAPPING_COW, and use page->index as refcount.
+ * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address, bool cow)
+   struct vm_area_struct *vma, unsigned long address, bool shared)
 {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (cow) {
-   dax_mapping_set_cow(page);
+   if (shared) {
+   dax_page_share_get(page);
} else {
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
@@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   if (dax_mapping_is_cow(page->mapping)) {
-   /* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (dax_page_is_shared(page)) {
+   /* keep the shared flag if this page is still shared */
+   if (dax_page_share_put(page) > 0)
continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..f46cac3657ad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,7 +103,10 @@ struct page {
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
-   pgoff_t index;  /* Our offset within mapping. */
+   union {
+   pgoff_t index;  /* Our offset within 
mapping. */
+   unsigned long share;/* share count for 
fsdax */
+   };
/**
 * @private: Mapping-private opaque data.
 * Usually used for buffer_heads if PagePrivate.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..d8e94f2f704a 100644
--- a/include/linux/page

Re: [PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-04 Thread Shiyang Ruan





在 2022/12/3 10:07, Dan Williams 写道:

Shiyang Ruan wrote:

fsdax page is used not only when CoW, but also mapread. To make the it
easily understood, use 'share' to indicate that the dax page is shared
by more than one extent.  And add helper functions to use it.

Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.

Signed-off-by: Shiyang Ruan 
---
  fs/dax.c   | 38 ++
  include/linux/mm_types.h   |  5 -
  include/linux/page-flags.h |  2 +-
  3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..edbacb273ab5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
  
-static inline bool dax_mapping_is_cow(struct address_space *mapping)

+static inline bool dax_page_is_shared(struct page *page)
  {
-   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+   return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
  }
  
  /*

- * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
+ * refcount.
   */
-static inline void dax_mapping_set_cow(struct page *page)
+static inline void dax_page_bump_sharing(struct page *page)


Similar to page_ref naming I would call this page_share_get() and the
corresponding function page_share_put().


  {
-   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
/*
 * Reset the index if the page was already mapped
 * regularly before.
 */
if (page->mapping)
-   page->index = 1;
-   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   page->share = 1;
+   page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;


Small nit, You could save a cast here by defining
PAGE_MAPPING_DAX_SHARED as "((void *) 1)".


Ok.




}
-   page->index++;
+   page->share++;
+}
+
+static inline unsigned long dax_page_drop_sharing(struct page *page)
+{
+   return --page->share;
  }
  
  /*

- * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * When it is called in dax_insert_entry(), the shared flag will indicate that
   * whether this entry is shared by multiple files.  If so, set the 
page->mapping
- * FS_DAX_MAPPING_COW, and use page->index as refcount.
+ * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
   */
  static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address, bool cow)
+   struct vm_area_struct *vma, unsigned long address, bool shared)
  {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
  
-		if (cow) {

-   dax_mapping_set_cow(page);
+   if (shared) {
+   dax_page_bump_sharing(page);
} else {
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
@@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
  
  		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

-   if (dax_mapping_is_cow(page->mapping)) {
-   /* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (dax_page_is_shared(page)) {
+   /* keep the shared flag if this page is still shared */
+   if (dax_page_drop_sharing(page) > 0)
continue;


I think part of what makes this hard to read is trying to preserve the
same code paths for shared pages and typical pages.

page_share_put() should, in addition to decrementing the share, clear
out page->mapping value.


In order to be consistent, how about naming the 3 helper functions like 
this:


bool  dax_page_is_shared(struct page *page);
void  dax_page_share_get(struct page *page);
unsigned long dax_page_share_put(struct page *page);


--
Thanks,
Ruan.




} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..f46cac3657ad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.

[PATCH v2.1 3/8] fsdax: zero the edges if source is HOLE or UNWRITTEN

2022-12-02 Thread Shiyang Ruan

If srcmap contains invalid data, such as HOLE and UNWRITTEN, the dest
page should be zeroed.  Otherwise, since it's a pmem, old data may
remains on the dest page, the result of CoW will be incorrect.

The function name is also not easy to understand, rename it to
"dax_iomap_copy_around()", which means it copys data around the range.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 79 +++-
 1 file changed, 49 insertions(+), 30 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a77739f2abe7..f12645d6f3c8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1092,7 +1092,8 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
 }
 
 /**
- * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page
+ * by copying the data before and after the range to be written.
  * @pos:   address to do copy from.
  * @length:size of copy operation.
  * @align_size:aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
@@ -1101,35 +1102,50 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
  *
  * This can be called from two places. Either during DAX write fault (page
  * aligned), to copy the length size data to daddr. Or, while doing normal DAX
- * write operation, dax_iomap_actor() might call this to do the copy of either
+ * write operation, dax_iomap_iter() might call this to do the copy of either
  * start or end unaligned address. In the latter case the rest of the copy of
- * aligned ranges is taken care by dax_iomap_actor() itself.
+ * aligned ranges is taken care by dax_iomap_iter() itself.
+ * If the srcmap contains invalid data, such as HOLE and UNWRITTEN, zero the
+ * area to make sure no old data remains.
  */
-static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+static int dax_iomap_copy_around(loff_t pos, uint64_t length, size_t 
align_size,
const struct iomap *srcmap, void *daddr)
 {
loff_t head_off = pos & (align_size - 1);
size_t size = ALIGN(head_off + length, align_size);
loff_t end = pos + length;
loff_t pg_end = round_up(end, align_size);
+   /* copy_all is usually in page fault case */
bool copy_all = head_off == 0 && end == pg_end;
+   /* zero the edges if srcmap is a HOLE or IOMAP_UNWRITTEN */
+   bool zero_edge = srcmap->flags & IOMAP_F_SHARED ||
+srcmap->type == IOMAP_UNWRITTEN;
void *saddr = 0;
int ret = 0;
 
-   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
-   if (ret)
-   return ret;
+   if (!zero_edge) {
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+   }
 
if (copy_all) {
-   ret = copy_mc_to_kernel(daddr, saddr, length);
-   return ret ? -EIO : 0;
+   if (zero_edge)
+   memset(daddr, 0, size);
+   else
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   goto out;
}
 
/* Copy the head part of the range */
if (head_off) {
-   ret = copy_mc_to_kernel(daddr, saddr, head_off);
-   if (ret)
-   return -EIO;
+   if (zero_edge)
+   memset(daddr, 0, head_off);
+   else {
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   if (ret)
+   return -EIO;
+   }
}
 
/* Copy the tail part of the range */
@@ -1137,12 +1153,19 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
length, size_t align_size,
loff_t tail_off = head_off + length;
loff_t tail_len = pg_end - end;
 
-   ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
-   tail_len);
-   if (ret)
-   return -EIO;
+   if (zero_edge)
+   memset(daddr + tail_off, 0, tail_len);
+   else {
+   ret = copy_mc_to_kernel(daddr + tail_off,
+   saddr + tail_off, tail_len);
+   if (ret)
+   return -EIO;
+   }
}
-   return 0;
+out:
+   if (zero_edge)
+   dax_flush(srcmap->dax_dev, daddr, size);
+   return ret ? -EIO : 0;
 }
 
 /*
@@ -1241,13 +1264,10 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
pos, size_t size)
if (ret < 0)
return ret;
memset(kaddr + offset, 0, size);
-   if (srcmap->addr != iomap->addr) {
-   ret = dax

[PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-02 Thread Shiyang Ruan

fsdax page is used not only when CoW, but also mapread. To make the it
easily understood, use 'share' to indicate that the dax page is shared
by more than one extent.  And add helper functions to use it.

Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c   | 38 ++
 include/linux/mm_types.h   |  5 -
 include/linux/page-flags.h |  2 +-
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..edbacb273ab5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_mapping_is_cow(struct address_space *mapping)
+static inline bool dax_page_is_shared(struct page *page)
 {
-   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+   return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
 }
 
 /*
- * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
+ * refcount.
  */
-static inline void dax_mapping_set_cow(struct page *page)
+static inline void dax_page_bump_sharing(struct page *page)
 {
-   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
/*
 * Reset the index if the page was already mapped
 * regularly before.
 */
if (page->mapping)
-   page->index = 1;
-   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   page->share = 1;
+   page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
}
-   page->index++;
+   page->share++;
+}
+
+static inline unsigned long dax_page_drop_sharing(struct page *page)
+{
+   return --page->share;
 }
 
 /*
- * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * When it is called in dax_insert_entry(), the shared flag will indicate that
  * whether this entry is shared by multiple files.  If so, set the 
page->mapping
- * FS_DAX_MAPPING_COW, and use page->index as refcount.
+ * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address, bool cow)
+   struct vm_area_struct *vma, unsigned long address, bool shared)
 {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (cow) {
-   dax_mapping_set_cow(page);
+   if (shared) {
+   dax_page_bump_sharing(page);
} else {
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
@@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   if (dax_mapping_is_cow(page->mapping)) {
-   /* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (dax_page_is_shared(page)) {
+   /* keep the shared flag if this page is still shared */
+   if (dax_page_drop_sharing(page) > 0)
continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..f46cac3657ad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,7 +103,10 @@ struct page {
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
-   pgoff_t index;  /* Our offset within mapping. */
+   union {
+   pgoff_t index;  /* Our offset within 
mapping. */
+   unsigned long share;/* share count for 
fsdax */
+   };
/**
 * @private: Mapping-private opaque data.
 * Usually used for buffer_heads if PagePrivate.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..c8a3aa02278d 100644
--- a/

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-12-01 Thread Shiyang Ruan





在 2022/12/1 5:08, Darrick J. Wong 写道:

On Tue, Nov 29, 2022 at 11:05:30PM -0800, Dan Williams wrote:

Darrick J. Wong wrote:

On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote:

[ add Andrew ]

Shiyang Ruan wrote:

Many testcases failed in dax+reflink mode with warning message in dmesg.
This also effects dax+noreflink mode if we run the test after a
dax+reflink test.  So, the most urgent thing is solving the warning
messages.

Patch 1 fixes some mistakes and adds handling of CoW cases not
previously considered (srcmap is HOLE or UNWRITTEN).
Patch 2 adds the implementation of unshare for fsdax.

With these fixes, most warning messages in dax_associate_entry() are
gone.  But honestly, generic/388 will randomly failed with the warning.
The case shutdown the xfs when fsstress is running, and do it for many
times.  I think the reason is that dax pages in use are not able to be
invalidated in time when fs is shutdown.  The next time dax page to be
associated, it still remains the mapping value set last time.  I'll keep
on solving it.

The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.


Thank you for digging in on this, I had been pinned down on CXL tasks
and worried that we would need to mark FS_DAX broken for a cycle, so
this is timely.

My only concern is that these patches look to have significant collisions with
the fsdax page reference counting reworks pending in linux-next. Although,
those are still sitting in mm-unstable:

http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org

My preference would be to move ahead with both in which case I can help
rebase these fixes on top. In that scenario everything would go through
Andrew.

However, if we are getting too late in the cycle for that path I think
these dax-fixes take precedence, and one more cycle to let the page
reference count reworks sit is ok.


Well now that raises some interesting questions -- dax and reflink are
totally broken on 6.1.  I was thinking about cramming them into 6.2 as a
data corruption fix on the grounds that is not an acceptable state of
affairs.


I agree it's not an acceptable state of affairs, but for 6.1 the answer
may be to just revert to dax+reflink being forbidden again. The fact
that no end user has noticed is probably a good sign that we can disable
that without any one screaming. That may be the easy answer for 6.2 as
well given how late this all is.


OTOH we're past -rc7, which is **really late** to be changing core code.
Then again, there aren't so many fsdax users and nobody's complained
about 6.0/6.1 being busted, so perhaps the risk of regression isn't so
bad?  Then again, that could be a sign that this could wait, if you and
Andrew are really eager to merge the reworks.


The page reference counting has also been languishing for a long time. A
6.2 merge would be nice, it relieves maintenance burden, but they do not
start to have real end user implications until CXL memory hotplug
platforms arrive and the warts in the reference counting start to show
real problems in production.


Hm.  How bad *would* it be to rebase that patchset atop this one?

After overnight testing on -rc7 it looks like Ruan's patchset fixes all
the problems AFAICT.  Most of the remaining regressions are to mask off
fragmentation testing because fsdax cow (like the directio write paths)
doesn't make much use of extent size hints.


Just looking at the stuff that's still broken with dax+reflink -- I
noticed that xfs/550-552 (aka the dax poison tests) are still regressing
on reflink filesystems.


That's worrying because the whole point of reworking dax, xfs, and
mm/memory-failure all at once was to handle the collision of poison and
reflink'd dax files.


I just tried out -rc7 and all three pass, so disregard this please.


So, uh, what would this patchset need to change if the "fsdax page
reference counting reworks" were applied?  Would it be changing the page
refcount instead of stashing that in page->index?


Nah, it's things like switching from pages to folios and shifting how
dax goes from pfns to pages.

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable=cca48ba3196

Ideally fsdax would never deal in pfns at all and do everything in terms
of offsets relative to a 'struct dax_device'.

My gut is saying these patches, the refcount reworks, and the
dax+reflink fixes, are important but not end user critical. One more
status quo release does not hurt, and we can circle back to get this all
straightened early in v6.3.


Being a data corruption fix, I don't see why we shouldn't revisit this
during the 6.2 cycle, even if it comes after merging the refcounting
stuff.

Question for Ruan: Would it be terribly difficult to push out a v2 with
the review comments applied so that we have something we can backport to
6.1; and then rebase the series atop 6.2-rc1 so we can apply it to
upstream (and then apply the 6.1 versi

[PATCH v2 8/8] xfs: remove restrictions for fsdax and reflink

2022-12-01 Thread Shiyang Ruan

Since the basic function for fsdax and reflink has been implemented,
remove the restrictions of them for widly test.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_ioctl.c | 4 
 fs/xfs/xfs_iops.c  | 4 
 2 files changed, 8 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1f783e979629..13f1b2add390 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1138,10 +1138,6 @@ xfs_ioctl_setattr_xflags(
if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
 
-   /* Don't allow us to set DAX mode for a reflinked file for now. */
-   if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
-   return -EINVAL;
-
/* diflags2 only valid for v3 inodes. */
i_flags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);
if (i_flags2 && !xfs_has_v3inodes(mp))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2e10e1c66ad6..bf0495f7a5e1 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1185,10 +1185,6 @@ xfs_inode_supports_dax(
if (!S_ISREG(VFS_I(ip)->i_mode))
return false;
 
-   /* Only supported on non-reflinked files. */
-   if (xfs_is_reflink_inode(ip))
-   return false;
-
/* Block size must match page size */
if (mp->m_sb.sb_blocksize != PAGE_SIZE)
return false;
-- 
2.38.1

[PATCH v2 7/8] fsdax,xfs: port unshare to fsdax

2022-12-01 Thread Shiyang Ruan

Implement unshare in fsdax mode: copy data from srcmap to iomap.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 52 
 fs/xfs/xfs_reflink.c |  8 +--
 include/linux/dax.h  |  2 ++
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 354be56750c2..a57e320e7971 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1244,6 +1244,58 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
*xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
+static s64 dax_unshare_iter(struct iomap_iter *iter)
+{
+   struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = iomap_iter_srcmap(iter);
+   loff_t pos = iter->pos;
+   loff_t length = iomap_length(iter);
+   int id = 0;
+   s64 ret = 0;
+   void *daddr = NULL, *saddr = NULL;
+
+   /* don't bother with blocks that are not shared to start with */
+   if (!(iomap->flags & IOMAP_F_SHARED))
+   return length;
+   /* don't bother with holes or unwritten extents */
+   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+   return length;
+
+   id = dax_read_lock();
+   ret = dax_iomap_direct_access(iomap, pos, length, , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = dax_iomap_direct_access(srcmap, pos, length, , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   if (ret)
+   ret = -EIO;
+
+out_unlock:
+   dax_read_unlock(id);
+   return ret;
+}
+
+int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
+   const struct iomap_ops *ops)
+{
+   struct iomap_iter iter = {
+   .inode  = inode,
+   .pos= pos,
+   .len= len,
+   .flags  = IOMAP_WRITE | IOMAP_UNSHARE | IOMAP_DAX,
+   };
+   int ret;
+
+   while ((ret = iomap_iter(, ops)) > 0)
+   iter.processed = dax_unshare_iter();
+   return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_unshare);
+
 static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
 {
const struct iomap *iomap = >iomap;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 93bdd25680bc..fe46bce8cae6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1693,8 +1693,12 @@ xfs_reflink_unshare(
 
inode_dio_wait(inode);
 
-   error = iomap_file_unshare(inode, offset, len,
-   _buffered_write_iomap_ops);
+   if (IS_DAX(inode))
+   error = dax_file_unshare(inode, offset, len,
+   _dax_write_iomap_ops);
+   else
+   error = iomap_file_unshare(inode, offset, len,
+   _buffered_write_iomap_ops);
if (error)
goto out;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ba985333e26b..2b5ecb591059 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -205,6 +205,8 @@ static inline void dax_unlock_mapping_entry(struct 
address_space *mapping,
 }
 #endif
 
+int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
+   const struct iomap_ops *ops);
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
const struct iomap_ops *ops);
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-- 
2.38.1

[PATCH v2 6/8] xfs: use dax ops for zero and truncate in fsdax mode

2022-12-01 Thread Shiyang Ruan

Zero and truncate on a dax file may execute CoW.  So use dax ops which
contains end work for CoW.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_iomap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 881de99766ca..d9401d0300ad 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1370,7 +1370,7 @@ xfs_zero_range(
 
if (IS_DAX(inode))
return dax_zero_range(inode, pos, len, did_zero,
- _direct_write_iomap_ops);
+ _dax_write_iomap_ops);
return iomap_zero_range(inode, pos, len, did_zero,
_buffered_write_iomap_ops);
 }
@@ -1385,7 +1385,7 @@ xfs_truncate_page(
 
if (IS_DAX(inode))
return dax_truncate_page(inode, pos, did_zero,
-   _direct_write_iomap_ops);
+   _dax_write_iomap_ops);
return iomap_truncate_page(inode, pos, did_zero,
   _buffered_write_iomap_ops);
 }
-- 
2.38.1

[PATCH v2 5/8] fsdax: dedupe: iter two files at the same time

2022-12-01 Thread Shiyang Ruan

The iomap_iter() on a range of one file may loop more than once.  In
this case, the inner dst_iter can update its iomap but the outer
src_iter can't.  This may cause the wrong remapping in filesystem.  Let
them called at the same time.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f1eb59bee0b5..354be56750c2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1964,15 +1964,15 @@ int dax_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
.len= len,
.flags  = IOMAP_DAX,
};
-   int ret;
+   int ret, compared = 0;
 
-   while ((ret = iomap_iter(_iter, ops)) > 0) {
-   while ((ret = iomap_iter(_iter, ops)) > 0) {
-   dst_iter.processed = dax_range_compare_iter(_iter,
-   _iter, len, same);
-   }
-   if (ret <= 0)
-   src_iter.processed = ret;
+   while ((ret = iomap_iter(_iter, ops)) > 0 &&
+  (ret = iomap_iter(_iter, ops)) > 0) {
+   compared = dax_range_compare_iter(_iter, _iter, len,
+ same);
+   if (compared < 0)
+   return ret;
+   src_iter.processed = dst_iter.processed = compared;
}
return ret;
 }
-- 
2.38.1

[PATCH v2 4/8] fsdax,xfs: set the shared flag when file extent is shared

2022-12-01 Thread Shiyang Ruan

If a dax page is shared, mapread at different offsets can also trigger
page fault on same dax page.  So, change the flag from "cow" to
"shared".  And get the shared flag from filesystem when read.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c   | 19 +++
 fs/xfs/xfs_iomap.c |  2 +-
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6b6e07ad8d80..f1eb59bee0b5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -846,12 +846,6 @@ static bool dax_fault_is_synchronous(const struct 
iomap_iter *iter,
(iter->iomap.flags & IOMAP_F_DIRTY);
 }
 
-static bool dax_fault_is_cow(const struct iomap_iter *iter)
-{
-   return (iter->flags & IOMAP_WRITE) &&
-   (iter->iomap.flags & IOMAP_F_SHARED);
-}
-
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -865,13 +859,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
 {
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
-   bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
-   bool cow = dax_fault_is_cow(iter);
+   bool write = iter->flags & IOMAP_WRITE;
+   bool dirty = write && !dax_fault_is_synchronous(iter, vmf->vma);
+   bool shared = iter->iomap.flags & IOMAP_F_SHARED;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
+   if (shared || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -883,12 +878,12 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-   cow);
+   shared);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -908,7 +903,7 @@ static void *dax_insert_entry(struct xa_state *xas, struct 
vm_fault *vmf,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
-   if (cow)
+   if (write && shared)
xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
 
xas_unlock_irq(xas);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..881de99766ca 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1215,7 +1215,7 @@ xfs_read_iomap_begin(
return error;
error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, ,
   , 0);
-   if (!error && (flags & IOMAP_REPORT))
+   if (!error && ((flags & IOMAP_REPORT) || IS_DAX(inode)))
error = xfs_reflink_trim_around_shared(ip, , );
xfs_iunlock(ip, lockmode);
 
-- 
2.38.1

[PATCH v2 3/8] fsdax: zero the edges if source is HOLE or UNWRITTEN

2022-12-01 Thread Shiyang Ruan

If srcmap contains invalid data, such as HOLE and UNWRITTEN, the dest
page should be zeroed.  Otherwise, since it's a pmem, old data may
remains on the dest page, the result of CoW will be incorrect.

The function name is also not easy to understand, rename it to
"dax_iomap_copy_around()", which means it copys data around the range.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 78 ++--
 1 file changed, 48 insertions(+), 30 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 482dda85ccaf..6b6e07ad8d80 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1092,7 +1092,7 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
 }
 
 /**
- * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * dax_iomap_copy_around - Copy the data from source to destination before 
write
  * @pos:   address to do copy from.
  * @length:size of copy operation.
  * @align_size:aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
@@ -1101,35 +1101,50 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
  *
  * This can be called from two places. Either during DAX write fault (page
  * aligned), to copy the length size data to daddr. Or, while doing normal DAX
- * write operation, dax_iomap_actor() might call this to do the copy of either
+ * write operation, dax_iomap_iter() might call this to do the copy of either
  * start or end unaligned address. In the latter case the rest of the copy of
- * aligned ranges is taken care by dax_iomap_actor() itself.
+ * aligned ranges is taken care by dax_iomap_iter() itself.
+ * If the srcmap contains invalid data, such as HOLE and UNWRITTEN, zero the
+ * area to make sure no old data remains.
  */
-static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+static int dax_iomap_copy_around(loff_t pos, uint64_t length, size_t 
align_size,
const struct iomap *srcmap, void *daddr)
 {
loff_t head_off = pos & (align_size - 1);
size_t size = ALIGN(head_off + length, align_size);
loff_t end = pos + length;
loff_t pg_end = round_up(end, align_size);
+   /* copy_all is usually in page fault case */
bool copy_all = head_off == 0 && end == pg_end;
+   /* zero the edges if srcmap is a HOLE or IOMAP_UNWRITTEN */
+   bool zero_edge = srcmap->flags & IOMAP_F_SHARED ||
+srcmap->type == IOMAP_UNWRITTEN;
void *saddr = 0;
int ret = 0;
 
-   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
-   if (ret)
-   return ret;
+   if (!zero_edge) {
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+   }
 
if (copy_all) {
-   ret = copy_mc_to_kernel(daddr, saddr, length);
-   return ret ? -EIO : 0;
+   if (zero_edge)
+   memset(daddr, 0, size);
+   else
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   goto out;
}
 
/* Copy the head part of the range */
if (head_off) {
-   ret = copy_mc_to_kernel(daddr, saddr, head_off);
-   if (ret)
-   return -EIO;
+   if (zero_edge)
+   memset(daddr, 0, head_off);
+   else {
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   if (ret)
+   return -EIO;
+   }
}
 
/* Copy the tail part of the range */
@@ -1137,12 +1152,19 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
length, size_t align_size,
loff_t tail_off = head_off + length;
loff_t tail_len = pg_end - end;
 
-   ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
-   tail_len);
-   if (ret)
-   return -EIO;
+   if (zero_edge)
+   memset(daddr + tail_off, 0, tail_len);
+   else {
+   ret = copy_mc_to_kernel(daddr + tail_off,
+   saddr + tail_off, tail_len);
+   if (ret)
+   return -EIO;
+   }
}
-   return 0;
+out:
+   if (zero_edge)
+   dax_flush(srcmap->dax_dev, daddr, size);
+   return ret ? -EIO : 0;
 }
 
 /*
@@ -1241,13 +1263,10 @@ static int dax_memzero(struct iomap_iter *iter, loff_t 
pos, size_t size)
if (ret < 0)
return ret;
memset(kaddr + offset, 0, size);
-   if (srcmap->addr != iomap->addr) {
-   ret = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
-kaddr);
-

[PATCH v2 2/8] fsdax: invalidate pages when CoW

2022-12-01 Thread Shiyang Ruan

CoW changes the share state of a dax page, but the share count of the
page isn't updated.  The next time access this page, it should have been
a newly accessed, but old association exists.  So, we need to clear the
share state when CoW happens, in both dax_iomap_rw() and
dax_zero_iter().

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85b81963ea31..482dda85ccaf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1264,6 +1264,15 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool 
*did_zero)
if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
return length;
 
+   /*
+* invalidate the pages whose sharing state is to be changed
+* because of CoW.
+*/
+   if (iomap->flags & IOMAP_F_SHARED)
+   invalidate_inode_pages2_range(iter->inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (pos + length - 1) >> PAGE_SHIFT);
+
do {
unsigned offset = offset_in_page(pos);
unsigned size = min_t(u64, PAGE_SIZE - offset, length);
@@ -1324,12 +1333,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
struct iov_iter *iter)
 {
const struct iomap *iomap = >iomap;
-   const struct iomap *srcmap = >srcmap;
+   const struct iomap *srcmap = iomap_iter_srcmap(iomi);
loff_t length = iomap_length(iomi);
loff_t pos = iomi->pos;
struct dax_device *dax_dev = iomap->dax_dev;
loff_t end = pos + length, done = 0;
bool write = iov_iter_rw(iter) == WRITE;
+   bool cow = write && iomap->flags & IOMAP_F_SHARED;
ssize_t ret = 0;
size_t xfer;
int id;
@@ -1356,7 +1366,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
 * into page tables. We have to tear down these mappings so that data
 * written by write(2) is visible in mmap.
 */
-   if (iomap->flags & IOMAP_F_NEW) {
+   if (iomap->flags & IOMAP_F_NEW || cow) {
invalidate_inode_pages2_range(iomi->inode->i_mapping,
  pos >> PAGE_SHIFT,
  (end - 1) >> PAGE_SHIFT);
@@ -1390,8 +1400,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
break;
}
 
-   if (write &&
-   srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+   if (cow) {
ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
 kaddr);
if (ret)
-- 
2.38.1

[PATCH v2 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-01 Thread Shiyang Ruan

fsdax page is used not only when CoW, but also mapread. To make the it
easily understood, use 'share' to indicate that the dax page is shared
by more than one extent.  And add helper functions to use it.

Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c   | 38 ++
 include/linux/mm_types.h   |  5 -
 include/linux/page-flags.h |  2 +-
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..85b81963ea31 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_mapping_is_cow(struct address_space *mapping)
+static inline bool dax_mapping_is_shared(struct page *page)
 {
-   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+   return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
 }
 
 /*
- * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
+ * refcount.
  */
-static inline void dax_mapping_set_cow(struct page *page)
+static inline void dax_mapping_set_shared(struct page *page)
 {
-   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
/*
 * Reset the index if the page was already mapped
 * regularly before.
 */
if (page->mapping)
-   page->index = 1;
-   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   page->share = 1;
+   page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
}
-   page->index++;
+   page->share++;
+}
+
+static inline unsigned long dax_mapping_decrease_shared(struct page *page)
+{
+   return --page->share;
 }
 
 /*
- * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * When it is called in dax_insert_entry(), the shared flag will indicate that
  * whether this entry is shared by multiple files.  If so, set the 
page->mapping
- * FS_DAX_MAPPING_COW, and use page->index as refcount.
+ * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address, bool cow)
+   struct vm_area_struct *vma, unsigned long address, bool shared)
 {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (cow) {
-   dax_mapping_set_cow(page);
+   if (shared) {
+   dax_mapping_set_shared(page);
} else {
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
@@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   if (dax_mapping_is_cow(page->mapping)) {
-   /* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (dax_mapping_is_shared(page)) {
+   /* keep the shared flag if this page is still shared */
+   if (dax_mapping_decrease_shared(page) > 0)
continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..f46cac3657ad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,7 +103,10 @@ struct page {
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
-   pgoff_t index;  /* Our offset within mapping. */
+   union {
+   pgoff_t index;  /* Our offset within 
mapping. */
+   unsigned long share;/* share count for 
fsdax */
+   };
/**
 * @private: Mapping-private opaque data.
 * Usually used for buffer_heads if PagePrivate.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..c8a3aa02

[PATCH v2 0/8] fsdax,xfs: fix warning messages

2022-12-01 Thread Shiyang Ruan

Changes since v1:
 1. Added a snippet of the warning message and some of the failed cases
 2. Separated the patch for easily review
 3. Added page->share and its helper functions
 4. Included the patch[1] that removes the restrictions of fsdax and reflink
[1] 
https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/

Many testcases failed in dax+reflink mode with warning message in dmesg.
Such as generic/051,075,127.  The warning message is like this:
[  775.509337] [ cut here ]
[  775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 
dax_insert_entry.cold+0x2e/0x69
[  775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash 
af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables 
iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs 
xfs libcrc32c fuse
[  775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: GW
  6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d
[  775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch 
Linux 1.16.0-3-3 04/01/2014
[  775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69
[  775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d 
ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 
dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48
[  775.526708] RSP: :c90001d57b30 EFLAGS: 00010082
[  775.527042] RAX: 002a RBX:  RCX: 0042
[  775.527396] RDX: ea000a0f6c80 RSI: 81dfab1b RDI: 
[  775.527819] RBP: ea000a0f6c40 R08:  R09: 820625e0
[  775.528241] R10: c90001d579d8 R11: 820d2628 R12: 88815fc98320
[  775.528598] R13: c90001d57c18 R14:  R15: 0001
[  775.528997] FS:  7f39fc75d740() GS:88817bc8() 
knlGS:
[  775.529474] CS:  0010 DS:  ES:  CR0: 80050033
[  775.529800] CR2: 7f39fc772040 CR3: 000107eb6001 CR4: 003706e0
[  775.530214] DR0:  DR1:  DR2: 
[  775.530592] DR3:  DR6: fffe0ff0 DR7: 0400
[  775.531002] Call Trace:
[  775.531230]  
[  775.531444]  dax_fault_iter+0x267/0x6c0
[  775.531719]  dax_iomap_pte_fault+0x198/0x3d0
[  775.532002]  __xfs_filemap_fault+0x24a/0x2d0 [xfs 
aa8d25411432b306d9554da38096f4ebb86bdfe7]
[  775.532603]  __do_fault+0x30/0x1e0
[  775.532903]  do_fault+0x314/0x6c0
[  775.533166]  __handle_mm_fault+0x646/0x1250
[  775.533480]  handle_mm_fault+0xc1/0x230
[  775.533810]  do_user_addr_fault+0x1ac/0x610
[  775.534110]  exc_page_fault+0x63/0x140
[  775.534389]  asm_exc_page_fault+0x22/0x30
[  775.534678] RIP: 0033:0x7f39fc55820a
[  775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 
4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9  a4 c4 
c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00
[  775.535839] RSP: 002b:7ffc66a08118 EFLAGS: 00010202
[  775.536157] RAX: 7f39fc772001 RBX: 00042001 RCX: 63c1
[  775.536537] RDX: 6400 RSI: 7f39fac42050 RDI: 7f39fc772040
[  775.536919] RBP: 6400 R08: 7f39fc772001 R09: 00042000
[  775.537304] R10: 0001 R11: 0246 R12: 0001
[  775.537694] R13: 7f39fc772000 R14: 6401 R15: 0003
[  775.538086]  
[  775.538333] ---[ end trace  ]---

This also effects dax+noreflink mode if we run the test after a
dax+reflink test.  So, the most urgent thing is solving the warning
messages.

With these fixes, most warning messages in dax_associate_entry() are
gone.  But honestly, generic/388 will randomly failed with the warning.
The case shutdown the xfs when fsstress is running, and do it for many
times.  I think the reason is that dax pages in use are not able to be
invalidated in time when fs is shutdown.  The next time dax page to be
associated, it still remains the mapping value set last time.  I'll keep
on solving it.

The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.


Shiyang Ruan (8):
  fsdax: introduce page->share for fsdax in reflink mode
  fsdax: invalidate pages when CoW
  fsdax: zero the edges if source is HOLE or UNWRITTEN
  fsdax,xfs: set the shared flag when file extent is shared
  fsdax: dedupe: iter two files at the same time
  xfs: use dax ops for zero and truncate in fsdax mode
  fsdax,xfs: port unshare to fsdax
  xfs: remove restrictions for fsdax and reflink

 fs/dax.c   | 220 +
 fs/xfs/xfs_ioctl.c |   4 -
 fs/xfs/xfs_iomap.c |   6 +-
 fs/xfs/xfs_iops.c

Re: [PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()

2022-11-30 Thread Shiyang Ruan





在 2022/11/30 12:08, Darrick J. Wong 写道:

On Thu, Nov 24, 2022 at 02:54:53PM +, Shiyang Ruan wrote:

This patch fixes the warning message reported in dax_associate_entry()
and dax_disassociate_entry().


Hmm, that's quite a bit to put in a single patch, but I'll try to get
through this...


Oh sorry...




1. reset page->mapping and ->index when refcount counting down to 0.
2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be
associated more than once for not only write but also read.


That makes sense, I think.


3. should zero the edge (when not aligned) if srcmap is HOLE or


When is IOMAP_F_SHARED set on the /source/ mapping?


In fs/xfs/xfs_iomap.c: xfs_direct_write_iomap_begin(): goto 
out_found_cow tag, srcmap is *not set* when the source extent is HOLE, 
then only iomap is set with IOMAP_F_SHARED flag.


Now we come to iomap iter, when we get the srcmap by calling 
iomap_iter_srcmap(iter), the iomap will be returned (because srcmap 
isn't set).  So, in this case, srcmap == iomap, we can think the source 
extent is a HOLE if srcmap->flag & IOMAP_F_SHARED != 0





UNWRITTEN.
4. iterator of two files in dedupe should be executed side by side, not
nested.


Why?  Also, this seems like a separate change?


Explain below.




5. use xfs_dax_write_iomap_ops for xfs zero and truncate.


Makes sense.


Signed-off-by: Shiyang Ruan 
---
  fs/dax.c   | 114 ++---
  fs/xfs/xfs_iomap.c |   6 +--
  2 files changed, 69 insertions(+), 51 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..5ea7c0926b7f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
if (dax_mapping_is_cow(page->mapping)) {
/* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (page->index-- > 1)


Hmm.  So if the fsdax "page" sharing factor drops from 2 to 1, we'll now
null out the mapping and index?  Before, we only did that when it
dropped from 1 to 0.

Does this leave the page with no mapping?  And I guess a subsequent
access will now take a fault to map it back in?


I confused it with --page->index, the result of "page->index--" is 
page->index itself.


So, assume:
this time, refcount is 2, >1, minus 1 to 1, then continue;
next time, refcount is 1, not >1, minus 1 to 0, then clear the 
page->mapping.






continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
@@ -840,12 +840,6 @@ static bool dax_fault_is_synchronous(const struct 
iomap_iter *iter,
(iter->iomap.flags & IOMAP_F_DIRTY);
  }
  
-static bool dax_fault_is_cow(const struct iomap_iter *iter)

-{
-   return (iter->flags & IOMAP_WRITE) &&
-   (iter->iomap.flags & IOMAP_F_SHARED);
-}
-
  /*
   * By this point grab_mapping_entry() has ensured that we have a locked entry
   * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -859,13 +853,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
  {
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
-   bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
-   bool cow = dax_fault_is_cow(iter);
+   bool write = iter->flags & IOMAP_WRITE;
+   bool dirty = write && !dax_fault_is_synchronous(iter, vmf->vma);
+   bool shared = iter->iomap.flags & IOMAP_F_SHARED;
  
  	if (dirty)

__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
  
-	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {

+   if (shared || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {


Ah, ok, so now we're yanking the mapping if the extent is shared,
presumably so that...


unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -877,12 +872,12 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
  
  	xas_reset(xas);

xas_lock_irq(xas);
-   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
  
  		dax_disassociate_entry(entry, mapping, false);

dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-   cow);
+   shared);


...down here we can rebuild the association, bu

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-27 Thread Shiyang Ruan





在 2022/11/28 2:38, Darrick J. Wong 写道:

On Thu, Nov 24, 2022 at 02:54:52PM +, Shiyang Ruan wrote:

Many testcases failed in dax+reflink mode with warning message in dmesg.
This also effects dax+noreflink mode if we run the test after a
dax+reflink test.  So, the most urgent thing is solving the warning
messages.

Patch 1 fixes some mistakes and adds handling of CoW cases not
previously considered (srcmap is HOLE or UNWRITTEN).
Patch 2 adds the implementation of unshare for fsdax.

With these fixes, most warning messages in dax_associate_entry() are
gone.  But honestly, generic/388 will randomly failed with the warning.
The case shutdown the xfs when fsstress is running, and do it for many
times.  I think the reason is that dax pages in use are not able to be
invalidated in time when fs is shutdown.  The next time dax page to be
associated, it still remains the mapping value set last time.  I'll keep
on solving it.

The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.


This cuts down the amount of test failures quite a bit, but I think
you're still missing a piece or two -- namely the part that refuses to
enable S_DAX mode on a reflinked file when the inode is being loaded
from disk.  However, thank you for fixing dax.c, because that was the
part I couldn't figure out at all. :)


I didn't include it[1] in this patchset...

[1] 
https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/



--
Thanks,
Ruan.



--D



Shiyang Ruan (2):
   fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
   fsdax,xfs: port unshare to fsdax

  fs/dax.c | 166 ++-
  fs/xfs/xfs_iomap.c   |   6 +-
  fs/xfs/xfs_reflink.c |   8 ++-
  include/linux/dax.h  |   2 +
  4 files changed, 129 insertions(+), 53 deletions(-)

--
2.38.1

[PATCH 2/2] fsdax,xfs: port unshare to fsdax

2022-11-24 Thread Shiyang Ruan

Implement unshare in fsdax mode: copy data from srcmap to iomap.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 52 
 fs/xfs/xfs_reflink.c |  8 +--
 include/linux/dax.h  |  2 ++
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5ea7c0926b7f..3d0bf68ab6b0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1235,6 +1235,58 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
*xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
+static s64 dax_unshare_iter(struct iomap_iter *iter)
+{
+   struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = iomap_iter_srcmap(iter);
+   loff_t pos = iter->pos;
+   loff_t length = iomap_length(iter);
+   int id = 0;
+   s64 ret = 0;
+   void *daddr = NULL, *saddr = NULL;
+
+   /* don't bother with blocks that are not shared to start with */
+   if (!(iomap->flags & IOMAP_F_SHARED))
+   return length;
+   /* don't bother with holes or unwritten extents */
+   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+   return length;
+
+   id = dax_read_lock();
+   ret = dax_iomap_direct_access(iomap, pos, length, , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = dax_iomap_direct_access(srcmap, pos, length, , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   if (ret)
+   ret = -EIO;
+
+out_unlock:
+   dax_read_unlock(id);
+   return ret;
+}
+
+int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
+   const struct iomap_ops *ops)
+{
+   struct iomap_iter iter = {
+   .inode  = inode,
+   .pos= pos,
+   .len= len,
+   .flags  = IOMAP_WRITE | IOMAP_UNSHARE | IOMAP_DAX,
+   };
+   int ret;
+
+   while ((ret = iomap_iter(, ops)) > 0)
+   iter.processed = dax_unshare_iter();
+   return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_unshare);
+
 static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
 {
const struct iomap *iomap = >iomap;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 93bdd25680bc..fe46bce8cae6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1693,8 +1693,12 @@ xfs_reflink_unshare(
 
inode_dio_wait(inode);
 
-   error = iomap_file_unshare(inode, offset, len,
-   _buffered_write_iomap_ops);
+   if (IS_DAX(inode))
+   error = dax_file_unshare(inode, offset, len,
+   _dax_write_iomap_ops);
+   else
+   error = iomap_file_unshare(inode, offset, len,
+   _buffered_write_iomap_ops);
if (error)
goto out;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ba985333e26b..2b5ecb591059 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -205,6 +205,8 @@ static inline void dax_unlock_mapping_entry(struct 
address_space *mapping,
 }
 #endif
 
+int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
+   const struct iomap_ops *ops);
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
const struct iomap_ops *ops);
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-- 
2.38.1

[PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()

2022-11-24 Thread Shiyang Ruan

This patch fixes the warning message reported in dax_associate_entry()
and dax_disassociate_entry().
1. reset page->mapping and ->index when refcount counting down to 0.
2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be
associated more than once for not only write but also read.
3. should zero the edge (when not aligned) if srcmap is HOLE or
UNWRITTEN.
4. iterator of two files in dedupe should be executed side by side, not
nested.
5. use xfs_dax_write_iomap_ops for xfs zero and truncate. 

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c   | 114 ++---
 fs/xfs/xfs_iomap.c |   6 +--
 2 files changed, 69 insertions(+), 51 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1c6867810cbd..5ea7c0926b7f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
if (dax_mapping_is_cow(page->mapping)) {
/* keep the CoW flag if this page is still shared */
-   if (page->index-- > 0)
+   if (page->index-- > 1)
continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
@@ -840,12 +840,6 @@ static bool dax_fault_is_synchronous(const struct 
iomap_iter *iter,
(iter->iomap.flags & IOMAP_F_DIRTY);
 }
 
-static bool dax_fault_is_cow(const struct iomap_iter *iter)
-{
-   return (iter->flags & IOMAP_WRITE) &&
-   (iter->iomap.flags & IOMAP_F_SHARED);
-}
-
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -859,13 +853,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
 {
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
-   bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
-   bool cow = dax_fault_is_cow(iter);
+   bool write = iter->flags & IOMAP_WRITE;
+   bool dirty = write && !dax_fault_is_synchronous(iter, vmf->vma);
+   bool shared = iter->iomap.flags & IOMAP_F_SHARED;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
+   if (shared || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -877,12 +872,12 @@ static void *dax_insert_entry(struct xa_state *xas, 
struct vm_fault *vmf,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-   cow);
+   shared);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -902,7 +897,7 @@ static void *dax_insert_entry(struct xa_state *xas, struct 
vm_fault *vmf,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
-   if (cow)
+   if (write && shared)
xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
 
xas_unlock_irq(xas);
@@ -1107,23 +1102,35 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
length, size_t align_size,
loff_t end = pos + length;
loff_t pg_end = round_up(end, align_size);
bool copy_all = head_off == 0 && end == pg_end;
+   /* write zero at edge if srcmap is a HOLE or IOMAP_UNWRITTEN */
+   bool zero_edge = srcmap->flags & IOMAP_F_SHARED ||
+srcmap->type == IOMAP_UNWRITTEN;
void *saddr = 0;
int ret = 0;
 
-   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
-   if (ret)
-   return ret;
+   if (!zero_edge) {
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+   }
 
if (copy_all) {
-   ret = copy_mc_to_kernel(daddr, saddr, length);
-   return ret ? -EIO : 0;
+   if (zero_edge)
+   memset(daddr, 0, size);
+   else
+

[PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-24 Thread Shiyang Ruan

Many testcases failed in dax+reflink mode with warning message in dmesg.
This also effects dax+noreflink mode if we run the test after a
dax+reflink test.  So, the most urgent thing is solving the warning
messages.

Patch 1 fixes some mistakes and adds handling of CoW cases not
previously considered (srcmap is HOLE or UNWRITTEN).
Patch 2 adds the implementation of unshare for fsdax.

With these fixes, most warning messages in dax_associate_entry() are
gone.  But honestly, generic/388 will randomly failed with the warning.
The case shutdown the xfs when fsstress is running, and do it for many
times.  I think the reason is that dax pages in use are not able to be
invalidated in time when fs is shutdown.  The next time dax page to be
associated, it still remains the mapping value set last time.  I'll keep
on solving it.

The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.


Shiyang Ruan (2):
  fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
  fsdax,xfs: port unshare to fsdax

 fs/dax.c | 166 ++-
 fs/xfs/xfs_iomap.c   |   6 +-
 fs/xfs/xfs_reflink.c |   8 ++-
 include/linux/dax.h  |   2 +
 4 files changed, 129 insertions(+), 53 deletions(-)

-- 
2.38.1

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-19 Thread Shiyang Ruan





在 2022/10/20 8:01, Darrick J. Wong 写道:

On Sun, Oct 16, 2022 at 10:05:17PM +0800, Shiyang Ruan wrote:



在 2022/10/14 23:50, Darrick J. Wong 写道:

On Fri, Oct 14, 2022 at 10:24:29AM +0800, Shiyang Ruan wrote:



在 2022/10/14 2:30, Darrick J. Wong 写道:

On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:

On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:



...



FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
and I haven't even turned on reflink yet:

run fstests xfs/517 at 2022-09-26 19:53:34
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
XFS (pmem1): Unmounting Filesystem
XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
[ cut here ]
WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x320


Ping?

This time around I replaced the WARN_ON with this:

if (page->mapping)
printk(KERN_ERR "%s:%d ino 0x%lx index 0x%lx page 0x%llx mapping 0x%llx <- 
0x%llx\n", __func__, __LINE__, mapping->host->i_ino, index + i, (unsigned long long)page, 
(unsigned long long)page->mapping, (unsigned long long)mapping);

and promptly started seeing scary things like this:

[   37.576598] dax_associate_entry:381 ino 0x1807870 index 0x370 page 
0xea00133f1480 mapping 0x1 <- 0x888042fbb528
[   37.577570] dax_associate_entry:381 ino 0x1807870 index 0x371 page 
0xea00133f1500 mapping 0x1 <- 0x888042fbb528
[   37.698657] dax_associate_entry:381 ino 0x180044a index 0x5f8 page 
0xea0013244900 mapping 0x888042eaf128 <- 0x888042dda128
[   37.699349] dax_associate_entry:381 ino 0x800808 index 0x136 page 
0xea0013245640 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.699680] dax_associate_entry:381 ino 0x180044a index 0x5f9 page 
0xea0013245680 mapping 0x888042eaf128 <- 0x888042dda128
[   37.700684] dax_associate_entry:381 ino 0x800808 index 0x137 page 
0xea00132456c0 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.701611] dax_associate_entry:381 ino 0x180044a index 0x5fa page 
0xea0013245700 mapping 0x888042eaf128 <- 0x888042dda128
[   37.764126] dax_associate_entry:381 ino 0x103c52c index 0x28a page 
0xea001345afc0 mapping 0x1 <- 0x888019c14928
[   37.765078] dax_associate_entry:381 ino 0x103c52c index 0x28b page 
0xea001345b000 mapping 0x1 <- 0x888019c14928
[   39.193523] dax_associate_entry:381 ino 0x184657f index 0x124 page 
0xea000e2a4440 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.194692] dax_associate_entry:381 ino 0x184657f index 0x125 page 
0xea000e2a4480 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.195716] dax_associate_entry:381 ino 0x184657f index 0x126 page 
0xea000e2a44c0 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.196736] dax_associate_entry:381 ino 0x184657f index 0x127 page 
0xea000e2a4500 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.197906] dax_associate_entry:381 ino 0x184657f index 0x128 page 
0xea000e2a5040 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.198924] dax_associate_entry:381 ino 0x184657f index 0x129 page 
0xea000e2a5080 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.247053] dax_associate_entry:381 ino 0x5dd1e index 0x2d page 
0xea0015a0e640 mapping 0x1 <- 0x88804af88828
[   39.248006] dax_associate_entry:381 ino 0x5dd1e index 0x2e page 
0xea0015a0e680 mapping 0x1 <- 0x88804af88828
[   39.490880] dax_associate_entry:381 ino 0x1a9dc index 0x7d page 
0xea000e7012c0 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.492038] dax_associate_entry:381 ino 0x1a9dc index 0x7e page 
0xea000e701300 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.493099] dax_associate_entry:381 ino 0x1a9dc index 0x7f page 
0xea000e701340 mapping 0x888042fd1728 <- 0x88804afaec28
[   40.926247] dax_associate_entry:381 ino 0x182e265 index 0x54c page 
0xea0015da0840 mapping 0x1 <- 0x888019c0dd28
[   41.675459] dax_associate_entry:381 ino 0x15e5d index 0x29 page 
0xea000e4350c0 mapping 0x1 <- 0x888019c05828
[   41.676418] dax_associate_entry:381 ino 0x15e5d index 0x2a page 
0xea000e435100 mapping 0x1 <- 0x888019c05828
[   41.677352] dax_associate_entry:381 ino 0x15e5d index 0x2b page 
0xea000e435180 mapping 0x1 <- 0x888019c05828
[   41.678372] dax_associate_entry:381 ino 0x15e5d index 0x2c page 
0xea000e4351c0 mapping 0x1 <- 0x888019c05

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-16 Thread Shiyang Ruan





在 2022/10/14 23:50, Darrick J. Wong 写道:

On Fri, Oct 14, 2022 at 10:24:29AM +0800, Shiyang Ruan wrote:



在 2022/10/14 2:30, Darrick J. Wong 写道:

On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:

On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:



...



FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
and I haven't even turned on reflink yet:

run fstests xfs/517 at 2022-09-26 19:53:34
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
XFS (pmem1): Unmounting Filesystem
XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
[ cut here ]
WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x320


Ping?

This time around I replaced the WARN_ON with this:

if (page->mapping)
printk(KERN_ERR "%s:%d ino 0x%lx index 0x%lx page 0x%llx mapping 0x%llx <- 
0x%llx\n", __func__, __LINE__, mapping->host->i_ino, index + i, (unsigned long long)page, 
(unsigned long long)page->mapping, (unsigned long long)mapping);

and promptly started seeing scary things like this:

[   37.576598] dax_associate_entry:381 ino 0x1807870 index 0x370 page 
0xea00133f1480 mapping 0x1 <- 0x888042fbb528
[   37.577570] dax_associate_entry:381 ino 0x1807870 index 0x371 page 
0xea00133f1500 mapping 0x1 <- 0x888042fbb528
[   37.698657] dax_associate_entry:381 ino 0x180044a index 0x5f8 page 
0xea0013244900 mapping 0x888042eaf128 <- 0x888042dda128
[   37.699349] dax_associate_entry:381 ino 0x800808 index 0x136 page 
0xea0013245640 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.699680] dax_associate_entry:381 ino 0x180044a index 0x5f9 page 
0xea0013245680 mapping 0x888042eaf128 <- 0x888042dda128
[   37.700684] dax_associate_entry:381 ino 0x800808 index 0x137 page 
0xea00132456c0 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.701611] dax_associate_entry:381 ino 0x180044a index 0x5fa page 
0xea0013245700 mapping 0x888042eaf128 <- 0x888042dda128
[   37.764126] dax_associate_entry:381 ino 0x103c52c index 0x28a page 
0xea001345afc0 mapping 0x1 <- 0x888019c14928
[   37.765078] dax_associate_entry:381 ino 0x103c52c index 0x28b page 
0xea001345b000 mapping 0x1 <- 0x888019c14928
[   39.193523] dax_associate_entry:381 ino 0x184657f index 0x124 page 
0xea000e2a4440 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.194692] dax_associate_entry:381 ino 0x184657f index 0x125 page 
0xea000e2a4480 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.195716] dax_associate_entry:381 ino 0x184657f index 0x126 page 
0xea000e2a44c0 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.196736] dax_associate_entry:381 ino 0x184657f index 0x127 page 
0xea000e2a4500 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.197906] dax_associate_entry:381 ino 0x184657f index 0x128 page 
0xea000e2a5040 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.198924] dax_associate_entry:381 ino 0x184657f index 0x129 page 
0xea000e2a5080 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.247053] dax_associate_entry:381 ino 0x5dd1e index 0x2d page 
0xea0015a0e640 mapping 0x1 <- 0x88804af88828
[   39.248006] dax_associate_entry:381 ino 0x5dd1e index 0x2e page 
0xea0015a0e680 mapping 0x1 <- 0x88804af88828
[   39.490880] dax_associate_entry:381 ino 0x1a9dc index 0x7d page 
0xea000e7012c0 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.492038] dax_associate_entry:381 ino 0x1a9dc index 0x7e page 
0xea000e701300 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.493099] dax_associate_entry:381 ino 0x1a9dc index 0x7f page 
0xea000e701340 mapping 0x888042fd1728 <- 0x88804afaec28
[   40.926247] dax_associate_entry:381 ino 0x182e265 index 0x54c page 
0xea0015da0840 mapping 0x1 <- 0x888019c0dd28
[   41.675459] dax_associate_entry:381 ino 0x15e5d index 0x29 page 
0xea000e4350c0 mapping 0x1 <- 0x888019c05828
[   41.676418] dax_associate_entry:381 ino 0x15e5d index 0x2a page 
0xea000e435100 mapping 0x1 <- 0x888019c05828
[   41.677352] dax_associate_entry:381 ino 0x15e5d index 0x2b page 
0xea000e435180 mapping 0x1 <- 0x888019c05828
[   41.678372] dax_associate_entry:381 ino 0x15e5d index 0x2c page 
0xea000e4351c0 mapping 0x1 <- 0x888019c05828
[   41.965026] dax_associate_entry:381 ino 0x185adb4 index 0x87 page 
0xea000e616d00 ma

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-13 Thread Shiyang Ruan





在 2022/10/14 2:30, Darrick J. Wong 写道:

On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:

On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:



...



FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
and I haven't even turned on reflink yet:

run fstests xfs/517 at 2022-09-26 19:53:34
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
XFS (pmem1): Unmounting Filesystem
XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
[ cut here ]
WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x320


Ping?

This time around I replaced the WARN_ON with this:

if (page->mapping)
printk(KERN_ERR "%s:%d ino 0x%lx index 0x%lx page 0x%llx mapping 0x%llx <- 
0x%llx\n", __func__, __LINE__, mapping->host->i_ino, index + i, (unsigned long long)page, 
(unsigned long long)page->mapping, (unsigned long long)mapping);

and promptly started seeing scary things like this:

[   37.576598] dax_associate_entry:381 ino 0x1807870 index 0x370 page 
0xea00133f1480 mapping 0x1 <- 0x888042fbb528
[   37.577570] dax_associate_entry:381 ino 0x1807870 index 0x371 page 
0xea00133f1500 mapping 0x1 <- 0x888042fbb528
[   37.698657] dax_associate_entry:381 ino 0x180044a index 0x5f8 page 
0xea0013244900 mapping 0x888042eaf128 <- 0x888042dda128
[   37.699349] dax_associate_entry:381 ino 0x800808 index 0x136 page 
0xea0013245640 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.699680] dax_associate_entry:381 ino 0x180044a index 0x5f9 page 
0xea0013245680 mapping 0x888042eaf128 <- 0x888042dda128
[   37.700684] dax_associate_entry:381 ino 0x800808 index 0x137 page 
0xea00132456c0 mapping 0x888042eaf128 <- 0x888042d3ce28
[   37.701611] dax_associate_entry:381 ino 0x180044a index 0x5fa page 
0xea0013245700 mapping 0x888042eaf128 <- 0x888042dda128
[   37.764126] dax_associate_entry:381 ino 0x103c52c index 0x28a page 
0xea001345afc0 mapping 0x1 <- 0x888019c14928
[   37.765078] dax_associate_entry:381 ino 0x103c52c index 0x28b page 
0xea001345b000 mapping 0x1 <- 0x888019c14928
[   39.193523] dax_associate_entry:381 ino 0x184657f index 0x124 page 
0xea000e2a4440 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.194692] dax_associate_entry:381 ino 0x184657f index 0x125 page 
0xea000e2a4480 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.195716] dax_associate_entry:381 ino 0x184657f index 0x126 page 
0xea000e2a44c0 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.196736] dax_associate_entry:381 ino 0x184657f index 0x127 page 
0xea000e2a4500 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.197906] dax_associate_entry:381 ino 0x184657f index 0x128 page 
0xea000e2a5040 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.198924] dax_associate_entry:381 ino 0x184657f index 0x129 page 
0xea000e2a5080 mapping 0x8880120d7628 <- 0x888019ca3528
[   39.247053] dax_associate_entry:381 ino 0x5dd1e index 0x2d page 
0xea0015a0e640 mapping 0x1 <- 0x88804af88828
[   39.248006] dax_associate_entry:381 ino 0x5dd1e index 0x2e page 
0xea0015a0e680 mapping 0x1 <- 0x88804af88828
[   39.490880] dax_associate_entry:381 ino 0x1a9dc index 0x7d page 
0xea000e7012c0 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.492038] dax_associate_entry:381 ino 0x1a9dc index 0x7e page 
0xea000e701300 mapping 0x888042fd1728 <- 0x88804afaec28
[   39.493099] dax_associate_entry:381 ino 0x1a9dc index 0x7f page 
0xea000e701340 mapping 0x888042fd1728 <- 0x88804afaec28
[   40.926247] dax_associate_entry:381 ino 0x182e265 index 0x54c page 
0xea0015da0840 mapping 0x1 <- 0x888019c0dd28
[   41.675459] dax_associate_entry:381 ino 0x15e5d index 0x29 page 
0xea000e4350c0 mapping 0x1 <- 0x888019c05828
[   41.676418] dax_associate_entry:381 ino 0x15e5d index 0x2a page 
0xea000e435100 mapping 0x1 <- 0x888019c05828
[   41.677352] dax_associate_entry:381 ino 0x15e5d index 0x2b page 
0xea000e435180 mapping 0x1 <- 0x888019c05828
[   41.678372] dax_associate_entry:381 ino 0x15e5d index 0x2c page 
0xea000e4351c0 mapping 0x1 <- 0x888019c05828
[   41.965026] dax_associate_entry:381 ino 0x185adb4 index 0x87 page 
0xea000e616d00 mapping 0x1 <- 0x88801a83b528
[   41.966065] dax_associate_entry:381 ino 0x185adb4 index 0x88 page 
0xe

Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-10-13 Thread Shiyang Ruan


Ping again~

在 2022/9/30 11:28, Shiyang Ruan 写道:

Hi,

Ping

在 2022/9/25 21:33, Shiyang Ruan 写道:

Changes since v8:
   1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
   2. P2: let super_drop_pagecache() accept invalidate method.
   3. P3: invalidate all dax mappings by invalidate_inode_pages2().
   4. P3: shutdown the filesystem when it is to be removed.
   5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/


Shiyang Ruan (3):
   xfs: fix the calculation of length and end
   fs: move drop_pagecache_sb() for others to use
   mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

  drivers/dax/super.c |  3 ++-
  fs/drop_caches.c    | 35 ++
  fs/super.c  | 43 +
  fs/xfs/xfs_notify_failure.c | 36 ++-
  include/linux/fs.h  |  1 +
  include/linux/mm.h  |  1 +
  include/linux/pagemap.h |  1 +
  mm/truncate.c   | 20 +++--
  8 files changed, 99 insertions(+), 41 deletions(-)

Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-29 Thread Shiyang Ruan


Hi,

Ping

在 2022/9/25 21:33, Shiyang Ruan 写道:

Changes since v8:
   1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
   2. P2: let super_drop_pagecache() accept invalidate method.
   3. P3: invalidate all dax mappings by invalidate_inode_pages2().
   4. P3: shutdown the filesystem when it is to be removed.
   5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
   xfs: fix the calculation of length and end
   fs: move drop_pagecache_sb() for others to use
   mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

  drivers/dax/super.c |  3 ++-
  fs/drop_caches.c| 35 ++
  fs/super.c  | 43 +
  fs/xfs/xfs_notify_failure.c | 36 ++-
  include/linux/fs.h  |  1 +
  include/linux/mm.h  |  1 +
  include/linux/pagemap.h |  1 +
  mm/truncate.c   | 20 +++--
  8 files changed, 99 insertions(+), 41 deletions(-)

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-28 Thread Shiyang Ruan





在 2022/9/28 7:51, Dave Chinner 写道:

On Tue, Sep 27, 2022 at 09:02:48AM -0700, Darrick J. Wong wrote:

On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:

...


I have tested these two mode for many times:

xfs_dax mode did failed so many cases.  (If you tested with this "drop"
patch, some warning around "dax_dedupe_file_range_compare()" won't occur any
more.)  I think warning around "dax_disassociate_entry()" is a problem with
concurrency.  Still looking into it.

But xfs_dax_noreflink didn't have so many failure, just 3 in my environment:
Failures: generic/471 generic/519 xfs/148.  I am thinking that did you
forget to reformat the TEST_DEV to be non-reflink before run the test?  If
so it will make sense.


No, I did not forget to turn off reflink for the test device:

# ./run_check.sh --mkfs-opts "-m reflink=0,rmapbt=1" --run-opts "-s 
xfs_dax_noreflink -g auto"
umount: /mnt/test: not mounted.
umount: /mnt/scratch: not mounted.
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 8192 ops; 0:00:03.99 (2.001 GiB/sec and 2049.0850 ops/sec)
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 8192 ops; 0:00:04.13 (1.936 GiB/sec and 1982.5453 ops/sec)
meta-data=/dev/pmem0 isize=512agcount=4, agsize=524288 blks
  =   sectsz=4096  attr=2, projid32bit=1
  =   crc=1finobt=1, sparse=1, rmapbt=1
  =   reflink=0bigtime=1 inobtcount=1 nrext64=0
data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=16384, version=2
  =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
.
Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_dax_noreflink -g auto
SECTION   -- xfs_dax_noreflink
FSTYP -- xfs (debug)
PLATFORM  -- Linux/x86_64 test3 6.0.0-rc6-dgc+ #1543 SMP PREEMPT_DYNAMIC 
Mon Sep 19 07:46:37 AEST 2022
MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 
/dev/pmem1 /mnt/scratch

So, yeah, reflink was turned off on both test and scratch devices,
and dax=always on both the test and scratch devices was used to
ensure that DAX was always in use.



FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
and I haven't even turned on reflink yet:

run fstests xfs/517 at 2022-09-26 19:53:34
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
XFS (pmem1): Unmounting Filesystem
XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your own 
risk!
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Ending clean mount
XFS (pmem1): Quotacheck needed: Please wait.
XFS (pmem1): Quotacheck: Done.
[ cut here ]
WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x320
Modules linked in: xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT 
nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat 
ip_set_hash_mac ip_set nf_tables libcrc32c bfq nfnetlink pvpanic_mmio pvpanic 
nd_pmem dax_pmem nd_btt sch_fq_codel fuse configfs ip_tables x_tables overlay 
nfsv4 af_packet [last unloaded: scsi_d

CPU: 1 PID: 415317 Comm: fsstress Tainted: GW  6.0.0-rc7-xfsx 
#rc7 727341edbd0773a36b78b09dab448fa1896eb3a5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:dax_insert_entry+0x22d/0x320
Code: e0 48 83 c4 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 62 
ff ff ff 48 8b 58 20 48 8d 53 01 e9 4d ff ff ff <0f> 0b e9 6d ff ff ff 31 f6 48 
89 ef e8 72 74 12 00 eb a1 83 e0 02
RSP: :c90004693b28 EFLAGS: 00010002
RAX: ea0010a20480 RBX: 0001 RCX: 0001
RDX: ea00 RSI: 0033 RDI: ea0010a204c0
RBP: c90004693c08 R08:  R09: 
R10: 88800c226228 R11: 0001 R12: 0011
R13: 88800c226228 R14: c90004693e08 R15: 
FS:  7f3aad8db740() GS:88803ed0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f3aad8d1000 CR3: 43104003 CR4: 001706e0
Call Trace:
  
  dax_fault_iter+0x26e/0x670
  dax_iomap_pte_fault+0x1ab/0x3e0
  __xfs_filemap_fault+0x32f/0x5a0 [xfs c617487f99e14abfa5deb24e923415b927df3d4b]
  __do_fault+0x30/0x1e0
  do_fault+0x316/0x6d0
  ?

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-27 Thread Shiyang Ruan





在 2022/9/20 5:15, Dave Chinner 写道:

On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:

On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:

Since reflink can work together now, the last obstacle has been
resolved.  It's time to remove restrictions and drop this warning.

Signed-off-by: Shiyang Ruan 


I haven't looked at reflink+DAX for some time, and I haven't tested
it for even longer. So I'm currently running a v6.0-rc6 kernel with
"-o dax=always" fstests run with reflink enabled and it's not
looking very promising.

All of the fsx tests are failing with data corruption, several
reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
*lots* of tests are leaving stack traces from WARN() conditions in
DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
dax_writeback_mapping_range(), iomap_iter() (called from
dax_dedupe_file_range_compare()), and so on.

At thsi point - the tests are still running - I'd guess that there's
going to be at least 50 test failures by the time it completes -
in comparison using "-o dax=never" results in just a single test
failure and a lot more tests actually being run.


The end results with dax+reflink were:

SECTION   -- xfs_dax
=

Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/091 generic/112 generic/127 generic/164
generic/165 generic/175 generic/231 generic/232 generic/247
generic/269 generic/270 generic/327 generic/340 generic/388
generic/390 generic/413 generic/447 generic/461 generic/471
generic/476 generic/517 generic/519 generic/560 generic/561
generic/605 generic/617 generic/619 generic/630 generic/649
generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
Failed 61 of 1071 tests

Ok, so I did a new no-reflink run as a baseline, because it is a
while since I've tested DAX at all:

SECTION   -- xfs_dax_noreflink
=
Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/112 generic/231 generic/232 generic/269
generic/270 generic/340 generic/388 generic/461 generic/471
generic/476 generic/519 generic/560 generic/561 generic/617
generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
xfs/305 xfs/517 xfs/538
Failed 29 of 1071 tests

Yeah, there's still lots of warnings from dax_insert_entry() and
friends like:

[43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
dax_insert_entry+0x2ab/0x320
[43262.028355] Modules linked in:
[43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
6.0.0-rc6-dgc+ #1543
[43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.15.0-1 04/01/2014
[43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
[43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 
8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff 
ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
[43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
[43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 0001
[43262.046233] RDX: ea00 RSI: 0221 RDI: ea0018cd2000
[43262.048518] RBP: 0011 R08:  R09: 
[43262.050762] R10: 888241a6d318 R11: 0001 R12: c9000a0cbc58
[43262.053020] R13: 888241a6d318 R14: c9000a0cbe20 R15: 
[43262.055309] FS:  7f8ce25e2b80() GS:8885fec8() 
knlGS:
[43262.057859] CS:  0010 DS:  ES:  CR0: 80050033
[43262.059713] CR2: 7f8ce25e1000 CR3: 000152141001 CR4: 00060ee0
[43262.061993] Call Trace:
[43262.062836]  
[43262.063557]  dax_fault_iter+0x243/0x600
[43262.064802]  dax_iomap_pte_fault+0x199/0x360
[43262.066197]  __xfs_filemap_fault+0x1e3/0x2c0
[43262.067602]  __do_fault+0x31/0x1d0
[43262.068719]  __handle_mm_fault+0xd6d/0x1650
[43262.070083]  ? do_mmap+0x348/0x540
[43262.071200]  handle_mm_fault+0x7a/0x1d0
[43262.072449]  ? __kvm_handle_async_pf+0x12/0xb0
[43262.073908]  exc_page_fault+0x1d9/0x810
[43262.075123]  asm_exc_page_fault+0x22/0x30
[43262.076413] RIP: 0033:0x7f8ce268bc23

So it looks to me like DAX is well and truly broken in 6.0-rc6. And,
yes, I'm running the fixes in mm-hotifxes-stable branch that allow
xfs/550 to pass.


I have tested these two mode for many times:

xfs_dax mode did failed so many cases.  (If you tested with this "drop" 
patch, some warning around "dax_dedupe_file_range_compare()" won't occur 
any more.)  I think warning around "dax_disassociate_entry()" is a 
problem with concurrency.  Still looking into it.


But xfs_dax_noreflink didn't have so many failure, just 3 in my 
environment: Failures: generic/471 generic/519 xfs/148.  I a

[PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-25 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
 drivers/dax/super.c |  3 ++-
 fs/xfs/xfs_notify_failure.c | 28 +++-
 include/linux/mm.h  |  1 +
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5c1e678a1285 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 struct xfs_failure_info {
xfs_agblock_t   startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
 
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* The device is about to be removed.  Not a really failure. */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -168,7 +172,9 @@ xfs_dax_notify_ddev_failure(
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
-   }
+   } else if (mf_flags & MF_MEM_PRE_REMOVE)
+   xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
return error;
 }
 
@@ -182,12 +188,24 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
 
if (!(mp->m_super->s_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
 
+   if (mf_flags & MF_MEM_PRE_REMOVE) {
+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   /* invalidate_inode_pages2() invalidates dax mapping */
+   super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_debug(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +214,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +229,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+   /* Notify failure on the whole device */
+   if (offset == 0 && len == U64_MAX) {
+   offset = ddev_start;
+   len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+   }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/incl

[PATCH 2/3] fs: move drop_pagecache_sb() for others to use

2022-09-25 Thread Shiyang Ruan

xfs_notify_failure.c requires a method to invalidate all dax mappings.
drop_pagecache_sb() can do this but it is a static function and only
build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
for others.  And use its second argument to choose which invalidate
method to use.

Signed-off-by: Shiyang Ruan 
---
 fs/drop_caches.c| 35 ++---
 fs/super.c  | 43 +
 include/linux/fs.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/truncate.c   | 20 +--
 5 files changed, 65 insertions(+), 35 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index e619c31b6bd9..4c9281885077 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -15,38 +15,6 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
-{
-   struct inode *inode, *toput_inode = NULL;
-
-   spin_lock(>s_inode_list_lock);
-   list_for_each_entry(inode, >s_inodes, i_sb_list) {
-   spin_lock(>i_lock);
-   /*
-* We must skip inodes in unusual state. We may also skip
-* inodes without pages but we deliberately won't in case
-* we need to reschedule to avoid softlockups.
-*/
-   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-   (mapping_empty(inode->i_mapping) && !need_resched())) {
-   spin_unlock(>i_lock);
-   continue;
-   }
-   __iget(inode);
-   spin_unlock(>i_lock);
-   spin_unlock(>s_inode_list_lock);
-
-   invalidate_mapping_pages(inode->i_mapping, 0, -1);
-   iput(toput_inode);
-   toput_inode = inode;
-
-   cond_resched();
-   spin_lock(>s_inode_list_lock);
-   }
-   spin_unlock(>s_inode_list_lock);
-   iput(toput_inode);
-}
-
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos)
 {
@@ -59,7 +27,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int 
write,
static int stfu;
 
if (sysctl_drop_caches & 1) {
-   iterate_supers(drop_pagecache_sb, NULL);
+   iterate_supers(super_drop_pagecache,
+  invalidate_inode_pages);
count_vm_event(DROP_PAGECACHE);
}
if (sysctl_drop_caches & 2) {
diff --git a/fs/super.c b/fs/super.c
index 734ed584a946..7cdbf146bc31 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -677,6 +678,48 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+/*
+ * super_drop_pagecache - drop all page caches of a filesystem
+ * @sb: superblock to invalidate
+ * @arg: invalidate method, such as invalidate_inode_pages(),
+ * invalidate_inode_pages2()
+ *
+ * Scans the inodes of a filesystem, drop all page caches.
+ */
+void super_drop_pagecache(struct super_block *sb, void *arg)
+{
+   struct inode *inode, *toput_inode = NULL;
+   int (*invalidator)(struct address_space *) = arg;
+
+   spin_lock(>s_inode_list_lock);
+   list_for_each_entry(inode, >s_inodes, i_sb_list) {
+   spin_lock(>i_lock);
+   /*
+* We must skip inodes in unusual state. We may also skip
+* inodes without pages but we deliberately won't in case
+* we need to reschedule to avoid softlockups.
+*/
+   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (mapping_empty(inode->i_mapping) && !need_resched())) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   spin_unlock(>s_inode_list_lock);
+
+   invalidator(inode->i_mapping);
+   iput(toput_inode);
+   toput_inode = inode;
+
+   cond_resched();
+   spin_lock(>s_inode_list_lock);
+   }
+   spin_unlock(>s_inode_list_lock);
+   iput(toput_inode);
+}
+EXPORT_SYMBOL(super_drop_pagecache);
+
 static void __iterate_supers(void (*f)(struct super_block *))
 {
struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..0e60c494688e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct block_device 
*);
 extern struct super_

[PATCH 1/3] xfs: fix the calculation of length and end

2022-09-25 Thread Shiyang Ruan

The end should be start + length - 1.  Also fix the calculation of the
length when seeking for intersection of notify range and device.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_notify_failure.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..3830f908e215 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
int error = 0;
xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
-   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen 
- 1);
xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
error = xfs_trans_alloc_empty(mp, );
@@ -210,7 +210,7 @@ xfs_dax_notify_failure(
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
/* Ignore the range out of filesystem area */
-   if (offset + len < ddev_start)
+   if (offset + len - 1 < ddev_start)
return -ENXIO;
if (offset > ddev_end)
return -ENXIO;
@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
-   if (offset + len > ddev_end)
-   len -= ddev_end - offset;
+   if (offset + len - 1 > ddev_end)
+   len -= offset + len - 1 - ddev_end;
 
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
-- 
2.37.3

[PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-25 Thread Shiyang Ruan

Changes since v8:
  1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
  2. P2: let super_drop_pagecache() accept invalidate method.
  3. P3: invalidate all dax mappings by invalidate_inode_pages2().
  4. P3: shutdown the filesystem when it is to be removed.
  5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
  xfs: fix the calculation of length and end
  fs: move drop_pagecache_sb() for others to use
  mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

 drivers/dax/super.c |  3 ++-
 fs/drop_caches.c| 35 ++
 fs/super.c  | 43 +
 fs/xfs/xfs_notify_failure.c | 36 ++-
 include/linux/fs.h  |  1 +
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/truncate.c   | 20 +++--
 8 files changed, 99 insertions(+), 41 deletions(-)

-- 
2.37.3

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-19 Thread Shiyang Ruan


Hi Dave,

在 2022/9/20 5:15, Dave Chinner 写道:

On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:

On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:

Since reflink can work together now, the last obstacle has been
resolved.  It's time to remove restrictions and drop this warning.

Signed-off-by: Shiyang Ruan 


I haven't looked at reflink+DAX for some time, and I haven't tested
it for even longer. So I'm currently running a v6.0-rc6 kernel with
"-o dax=always" fstests run with reflink enabled and it's not
looking very promising.

All of the fsx tests are failing with data corruption, several
reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
*lots* of tests are leaving stack traces from WARN() conditions in
DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
dax_writeback_mapping_range(), iomap_iter() (called from
dax_dedupe_file_range_compare()), and so on.

At thsi point - the tests are still running - I'd guess that there's
going to be at least 50 test failures by the time it completes -
in comparison using "-o dax=never" results in just a single test
failure and a lot more tests actually being run.


The end results with dax+reflink were:

SECTION   -- xfs_dax
=

Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/091 generic/112 generic/127 generic/164
generic/165 generic/175 generic/231 generic/232 generic/247
generic/269 generic/270 generic/327 generic/340 generic/388
generic/390 generic/413 generic/447 generic/461 generic/471
generic/476 generic/517 generic/519 generic/560 generic/561
generic/605 generic/617 generic/619 generic/630 generic/649
generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
Failed 61 of 1071 tests

Ok, so I did a new no-reflink run as a baseline, because it is a
while since I've tested DAX at all:

SECTION   -- xfs_dax_noreflink
=
Failures: generic/051 generic/068 generic/074 generic/075
generic/083 generic/112 generic/231 generic/232 generic/269
generic/270 generic/340 generic/388 generic/461 generic/471
generic/476 generic/519 generic/560 generic/561 generic/617
generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
xfs/305 xfs/517 xfs/538
Failed 29 of 1071 tests

Yeah, there's still lots of warnings from dax_insert_entry() and
friends like:

[43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
dax_insert_entry+0x2ab/0x320
[43262.028355] Modules linked in:
[43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
6.0.0-rc6-dgc+ #1543
[43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.15.0-1 04/01/2014
[43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
[43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 
8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff 
ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
[43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
[43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 0001
[43262.046233] RDX: ea00 RSI: 0221 RDI: ea0018cd2000
[43262.048518] RBP: 0011 R08:  R09: 
[43262.050762] R10: 888241a6d318 R11: 0001 R12: c9000a0cbc58
[43262.053020] R13: 888241a6d318 R14: c9000a0cbe20 R15: 
[43262.055309] FS:  7f8ce25e2b80() GS:8885fec8() 
knlGS:
[43262.057859] CS:  0010 DS:  ES:  CR0: 80050033
[43262.059713] CR2: 7f8ce25e1000 CR3: 000152141001 CR4: 00060ee0
[43262.061993] Call Trace:
[43262.062836]  
[43262.063557]  dax_fault_iter+0x243/0x600
[43262.064802]  dax_iomap_pte_fault+0x199/0x360
[43262.066197]  __xfs_filemap_fault+0x1e3/0x2c0
[43262.067602]  __do_fault+0x31/0x1d0
[43262.068719]  __handle_mm_fault+0xd6d/0x1650
[43262.070083]  ? do_mmap+0x348/0x540
[43262.071200]  handle_mm_fault+0x7a/0x1d0
[43262.072449]  ? __kvm_handle_async_pf+0x12/0xb0
[43262.073908]  exc_page_fault+0x1d9/0x810
[43262.075123]  asm_exc_page_fault+0x22/0x30
[43262.076413] RIP: 0033:0x7f8ce268bc23


Thanks for testing.  I just ran the xfstests and got these failures too. 
 The failure at dax_insert_entry() appeared during my development but 
was fixed before I sent the patchset.  Now I am looking for what's wrong 
with it.


BTW, which groups did you test?  I usually test quick,clone group.


--
Thanks,
Ruan.



So it looks to me like DAX is well and truly broken in 6.0-rc6. And,
yes, I'm running the fixes in mm-hotifxes-stable branch that allow
xfs/550 to pass.

Who is actually testing this DAX code, and what are they actually
testing on? These are not random failures - I haven't run DAX
tes

[RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-15 Thread Shiyang Ruan

Since reflink can work together now, the last obstacle has been
resolved.  It's time to remove restrictions and drop this warning.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_ioctl.c | 4 
 fs/xfs/xfs_iops.c  | 4 
 fs/xfs/xfs_super.c | 1 -
 3 files changed, 9 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1f783e979629..13f1b2add390 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1138,10 +1138,6 @@ xfs_ioctl_setattr_xflags(
if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
 
-   /* Don't allow us to set DAX mode for a reflinked file for now. */
-   if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
-   return -EINVAL;
-
/* diflags2 only valid for v3 inodes. */
i_flags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);
if (i_flags2 && !xfs_has_v3inodes(mp))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 45518b8c613c..c2e9d7c74170 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1171,10 +1171,6 @@ xfs_inode_supports_dax(
if (!S_ISREG(VFS_I(ip)->i_mode))
return false;
 
-   /* Only supported on non-reflinked files. */
-   if (xfs_is_reflink_inode(ip))
-   return false;
-
/* Block size must match page size */
if (mp->m_sb.sb_blocksize != PAGE_SIZE)
return false;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9ac59814bbb6..fe7e24c353b9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -358,7 +358,6 @@ xfs_setup_dax_always(
return -EINVAL;
}
 
-   xfs_warn(mp, "DAX enabled. Warning: EXPERIMENTAL, use at your own 
risk");
return 0;
 
 disable_dax:
-- 
2.37.3

Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Shiyang Ruan





在 2022/9/15 2:15, Darrick J. Wong 写道:

On Wed, Sep 14, 2022 at 11:09:23AM -0700, Darrick J. Wong wrote:

On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:

ping

在 2022/9/2 18:35, Shiyang Ruan 写道:

Changes since v7:
1. Add P1 to fix calculation mistake
2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
3. P3: Add invalidate all mappings after sync.
4. P3: Set offset to be start of device when it is to be removed.
5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].

Changes since v6:
1. Rebase on 6.0-rc2 and Darrick's patch[1].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/


Just out of curiosity, is it your (or djbw's) intent to send all these
as bugfixes for 6.0 via akpm like all the other dax fixen?


Aha, this is 6.1 stuff, please ignore this question.


Actually I hope these patches can be merged ASAP. (But it seems a bit 
late for 6.0 now.)


And do you know which/whose branch has picked up your patch[1]?  I 
cannot find it.



--
Thanks,
Ruan.



--D


--D



Shiyang Ruan (3):
xfs: fix the calculation of length and end
fs: move drop_pagecache_sb() for others to use
mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

   drivers/dax/super.c |  3 ++-
   fs/drop_caches.c| 33 -
   fs/super.c  | 34 ++
   fs/xfs/xfs_notify_failure.c | 31 +++
   include/linux/fs.h  |  1 +
   include/linux/mm.h  |  1 +
   6 files changed, 65 insertions(+), 38 deletions(-)

Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Shiyang Ruan





在 2022/9/15 2:15, Darrick J. Wong 写道:

On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
  -> unbind_store()
   -> ... (skip)
-> devres_release_all()   # was pmem driver ->remove() in v1
 -> kill_dax()
  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
   -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
  drivers/dax/super.c |  3 ++-
  fs/xfs/xfs_notify_failure.c | 23 +++
  include/linux/mm.h  |  1 +
  3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
  
  	if (dax_dev->holder_data != NULL)

-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
  
  	clear_bit(DAXDEV_ALIVE, _dev->flags);

synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5e04ba7fa403 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
  
  #include 

  #include 
+#include 
  
  struct xfs_failure_info {

xfs_agblock_t   startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
  
  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||

(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* The device is about to be removed.  Not a really failure. */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -182,12 +186,23 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
  
  	if (!(mp->m_super->s_flags & SB_BORN)) {

xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
  
+	if (mf_flags & MF_MEM_PRE_REMOVE) {

+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   drop_pagecache_sb(mp->m_super, NULL);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_debug(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +211,8 @@ xfs_dax_notify_failure(
  
  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&

mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +226,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
  
+	/* Notify failure on the whole device */

+   if (offset == 0 && len == U64_MAX) {
+   offset = ddev_start;
+   len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+   }


I wonder, won't the trimming code below take care of this?


The len is U64_MAX, so 'offset + len - 1' will overflow.  That can't be 
handled correctly by the trimming code below.



--
Thanks,
Ruan.



The rest of the patch looks ok to me.

--D


+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..9122a1c57dd2 100644
--- a/include/l

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-08 Thread Shiyang Ruan





在 2022/8/4 8:51, Darrick J. Wong 写道:

On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote:


...



BTW, since these patches (dax + THIS + pmem-unbind) are
waiting to be merged, is it time to think about "removing the
experimental tag" again?  :)


It's probably time to take up that question again.

Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
and at least one of those dm layers no longer allows fsdax pass-through,
so XFS silently turned mount -o dax into -o dax=never. :(


Hi Darrick,

I tried generic/470 but it didn't run:
 [not run] Cannot use thin-pool devices on DAX capable block devices.

Did you modify the _require_dm_target() in common/rc?  I added thin-pool
to not to check dax capability:

   case $target in
   stripe|linear|log-writes|thin-pool)  # add thin-pool here
   ;;

then the case finally ran and it silently turned off dax as you said.

Are the steps for reproduction correct? If so, I will continue to
investigate this problem.


Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
mention that.  I suspect that the removal of dm support for pmem is
going to force us to completely redesign this test.  I can't really
think of how, though, since there's no good way that I know of to gain a
point-in-time snapshot of a pmem device.


Hi Darrick,

  > removal of dm support for pmem
I think here we are saying about xfstest who removed the support, not
kernel?

I found some xfstests commits:
fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool

So, this case was never able to run since the second commit?  (I didn't
notice the not run case.  I thought it was expected to be not run.)

And according to the first commit, the restriction was added because
some of dm devices don't support dax.  So my understanding is: we should
redesign the case to make the it work, and firstly, we should add dax
support for dm devices in kernel.


dm devices used to have fsdax support; I think Christoph is actively
removing (or already has removed) all that support.


In addition, is there any other testcase has the same problem?  so that
we can deal with them together.


The last I checked, there aren't any that require MAP_SYNC or pmem aside
from g/470 and the three poison notification tests that you sent a few
days ago.

--D



Hi Darrick, Brian

I made a little investigation on generic/470.

This case was able to run before introducing thin-pool[1], but since 
that, it became 'Failed'/'Not Run' because thin-pool does not support 
DAX.  I have checked the log of thin-pool, it never supports DAX.  And, 
it's not someone has removed the fsdax support.  So, I think it's not 
correct to bypass the requirement conditions by adding 'thin-pool' to 
_require_dm_target().


As far as I known, to prevent out-of-order replay of dm-log-write, 
thin-pool was introduced (to provide discard zeroing).  Should we solve 
the 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian


Besides, since it's not a fsdax problem, I think there is nothing need 
to be fixed in fsdax.  I'd like to help it solved, but I'm still 
wondering if we could back to the original topic("Remove Experimental 
Tag") firstly? :)



[1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin 
volume for dmlogwrites target device



--
Thanks,
Ruan.

Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-07 Thread Shiyang Ruan


ping

在 2022/9/2 18:35, Shiyang Ruan 写道:

Changes since v7:
   1. Add P1 to fix calculation mistake
   2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
   3. P3: Add invalidate all mappings after sync.
   4. P3: Set offset to be start of device when it is to be removed.
   5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].

Changes since v6:
   1. Rebase on 6.0-rc2 and Darrick's patch[1].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
   xfs: fix the calculation of length and end
   fs: move drop_pagecache_sb() for others to use
   mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

  drivers/dax/super.c |  3 ++-
  fs/drop_caches.c| 33 -
  fs/super.c  | 34 ++
  fs/xfs/xfs_notify_failure.c | 31 +++
  include/linux/fs.h  |  1 +
  include/linux/mm.h  |  1 +
  6 files changed, 65 insertions(+), 38 deletions(-)

[PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-02 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
 drivers/dax/super.c |  3 ++-
 fs/xfs/xfs_notify_failure.c | 23 +++
 include/linux/mm.h  |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5e04ba7fa403 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 struct xfs_failure_info {
xfs_agblock_t   startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
 
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* The device is about to be removed.  Not a really failure. */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -182,12 +186,23 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
 
if (!(mp->m_super->s_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
 
+   if (mf_flags & MF_MEM_PRE_REMOVE) {
+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   drop_pagecache_sb(mp->m_super, NULL);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_debug(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +211,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +226,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+   /* Notify failure on the whole device */
+   if (offset == 0 && len == U64_MAX) {
+   offset = ddev_start;
+   len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+   }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..9122a1c57dd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,6 +3183,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+   MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
-- 
2.37.2

[PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-02 Thread Shiyang Ruan

Changes since v7:
  1. Add P1 to fix calculation mistake
  2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
  3. P3: Add invalidate all mappings after sync.
  4. P3: Set offset to be start of device when it is to be removed.
  5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].

Changes since v6:
  1. Rebase on 6.0-rc2 and Darrick's patch[1].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
  xfs: fix the calculation of length and end
  fs: move drop_pagecache_sb() for others to use
  mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

 drivers/dax/super.c |  3 ++-
 fs/drop_caches.c| 33 -
 fs/super.c  | 34 ++
 fs/xfs/xfs_notify_failure.c | 31 +++
 include/linux/fs.h  |  1 +
 include/linux/mm.h  |  1 +
 6 files changed, 65 insertions(+), 38 deletions(-)

-- 
2.37.2

[PATCH 2/3] fs: move drop_pagecache_sb() for others to use

2022-09-02 Thread Shiyang Ruan

xfs_notify_failure requires a method to invalidate all mappings.
drop_pagecache_sb() can do this but it is a static function and only
build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
for others.

Signed-off-by: Shiyang Ruan 
---
 fs/drop_caches.c   | 33 -
 fs/super.c | 34 ++
 include/linux/fs.h |  1 +
 3 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index e619c31b6bd9..5c8406076f9b 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -3,7 +3,6 @@
  * Implement the manual drop-all-pagecache function
  */
 
-#include 
 #include 
 #include 
 #include 
@@ -15,38 +14,6 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
-{
-   struct inode *inode, *toput_inode = NULL;
-
-   spin_lock(>s_inode_list_lock);
-   list_for_each_entry(inode, >s_inodes, i_sb_list) {
-   spin_lock(>i_lock);
-   /*
-* We must skip inodes in unusual state. We may also skip
-* inodes without pages but we deliberately won't in case
-* we need to reschedule to avoid softlockups.
-*/
-   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-   (mapping_empty(inode->i_mapping) && !need_resched())) {
-   spin_unlock(>i_lock);
-   continue;
-   }
-   __iget(inode);
-   spin_unlock(>i_lock);
-   spin_unlock(>s_inode_list_lock);
-
-   invalidate_mapping_pages(inode->i_mapping, 0, -1);
-   iput(toput_inode);
-   toput_inode = inode;
-
-   cond_resched();
-   spin_lock(>s_inode_list_lock);
-   }
-   spin_unlock(>s_inode_list_lock);
-   iput(toput_inode);
-}
-
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/fs/super.c b/fs/super.c
index 734ed584a946..bdf53dbe834c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -677,6 +678,39 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+void drop_pagecache_sb(struct super_block *sb, void *unused)
+{
+   struct inode *inode, *toput_inode = NULL;
+
+   spin_lock(>s_inode_list_lock);
+   list_for_each_entry(inode, >s_inodes, i_sb_list) {
+   spin_lock(>i_lock);
+   /*
+* We must skip inodes in unusual state. We may also skip
+* inodes without pages but we deliberately won't in case
+* we need to reschedule to avoid softlockups.
+*/
+   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (mapping_empty(inode->i_mapping) && !need_resched())) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   spin_unlock(>s_inode_list_lock);
+
+   invalidate_mapping_pages(inode->i_mapping, 0, -1);
+   iput(toput_inode);
+   toput_inode = inode;
+
+   cond_resched();
+   spin_lock(>s_inode_list_lock);
+   }
+   spin_unlock(>s_inode_list_lock);
+   iput(toput_inode);
+}
+EXPORT_SYMBOL(drop_pagecache_sb);
+
 static void __iterate_supers(void (*f)(struct super_block *))
 {
struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..5ded28c0d2c9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct block_device 
*);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
+void drop_pagecache_sb(struct super_block *sb, void *unused);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
-- 
2.37.2

[PATCH 1/3] xfs: fix the calculation of length and end

2022-09-02 Thread Shiyang Ruan

The end should be start + length - 1.  Also fix the calculation of the
length when seeking for intersection of notify range and device.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_notify_failure.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..3830f908e215 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
int error = 0;
xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
-   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+   xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen 
- 1);
xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
error = xfs_trans_alloc_empty(mp, );
@@ -210,7 +210,7 @@ xfs_dax_notify_failure(
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
/* Ignore the range out of filesystem area */
-   if (offset + len < ddev_start)
+   if (offset + len - 1 < ddev_start)
return -ENXIO;
if (offset > ddev_end)
return -ENXIO;
@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
-   if (offset + len > ddev_end)
-   len -= ddev_end - offset;
+   if (offset + len - 1 > ddev_end)
+   len -= offset + len - 1 - ddev_end;
 
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
-- 
2.37.2

Re: [PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-29 Thread Shiyang Ruan





在 2022/8/27 5:35, Dan Williams 写道:

Shiyang Ruan wrote:

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
   -> unbind_store()
-> ... (skip)
 -> devres_release_all()
  -> kill_dax()
   -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

==
Changes since v6:
1. Rebase on 6.0-rc2 and Darrick's patch[2].

Changes since v5:
1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
2. hold s_umount before sync_filesystem()
3. do sync_filesystem() after SB_BORN check
4. Rebased on next-20220714

[1]:
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
   drivers/dax/super.c |  3 ++-
   fs/xfs/xfs_notify_failure.c | 15 +++
   include/linux/mm.h  |  1 +
   3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 65d5eb20878e..a9769f17e998 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -77,6 +77,9 @@ xfs_dax_failure_fn(
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* Do not shutdown so early when device is to be removed */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -182,12 +185,22 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
if (!(mp->m_sb.sb_flags & SB_BORN)) {


How are you testing the SB_BORN interactions? I have a fix for this
pending here:

https://lore.kernel.org/nvdimm/166153428094.2758201.7936572520826540019.st...@dwillia2-xfh.jf.intel.com/


That was my mistake.  Yes, it should be mp->m_super->s_flags.

(I remember my testcase did pass in my dev version, but now that seems 
impossible.  I think something was wrong when I did the test.)





xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
   +if (mf_flags & MF_MEM_PRE_REMOVE) {


It appears this patch is corrupted here. I confirmed that b4 sees the
same when trying to apply it.


Can't this patch be applied?  It is based on 6.0-rc2 + Darrick's patch. 
It's also ok to rebase on 6.0-rc3 + Darrick's patch.





+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);


This syncs to make data persistent, but for DAX this also needs to get
invalidate all current DAX mappings. I do not see that in these changes.


I'll add it.


--
Thanks,
Ruan.




+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_warn(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +209,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTE

[PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-26 Thread Shiyang Ruan


This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

==
Changes since v6:
  1. Rebase on 6.0-rc2 and Darrick's patch[2].

Changes since v5:
  1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
  2. hold s_umount before sync_filesystem()
  3. do sync_filesystem() after SB_BORN check
  4. Rebased on next-20220714

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

[2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 drivers/dax/super.c |  3 ++-
 fs/xfs/xfs_notify_failure.c | 15 +++
 include/linux/mm.h  |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 65d5eb20878e..a9769f17e998 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -77,6 +77,9 @@ xfs_dax_failure_fn(
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* Do not shutdown so early when device is to be removed */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -182,12 +185,22 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
if (!(mp->m_sb.sb_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
 +  if (mf_flags & MF_MEM_PRE_REMOVE) {
+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_warn(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +209,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 982f2607180b..2c7c132e6512 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3176,6 +3176,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+   MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
--
2.37.2

Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-18 Thread Shiyang Ruan





在 2022/8/3 12:33, Darrick J. Wong 写道:

On Wed, Aug 03, 2022 at 02:43:20AM +, ruansy.f...@fujitsu.com wrote:


在 2022/7/19 6:56, Dan Williams 写道:

Darrick J. Wong wrote:

On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:

ruansy.f...@fujitsu.com wrote:

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
   -> unbind_store()
-> ... (skip)
 -> devres_release_all()   # was pmem driver ->remove() in v1
  -> kill_dax()
   -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

==
Changes since v5:
1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
2. hold s_umount before sync_filesystem()
3. move sync_filesystem() after SB_BORN check
4. Rebased on next-20220714

Changes since v4:
1. sync_filesystem() at the beginning when MF_MEM_REMOVE
2. Rebased on next-20220706

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
   drivers/dax/super.c |  3 ++-
   fs/xfs/xfs_notify_failure.c | 15 +++
   include/linux/mm.h  |  1 +
   3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
   
   	if (dax_dev->holder_data != NULL)

-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
   
   	clear_bit(DAXDEV_ALIVE, _dev->flags);

synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 69d9c83ea4b2..6da6747435eb 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -76,6 +76,9 @@ xfs_dax_failure_fn(
   
   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||

(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* Do not shutdown so early when device is to be removed */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
}
@@ -174,12 +177,22 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
   
   	if (!(mp->m_sb.sb_flags & SB_BORN)) {

xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
   
+	if (mf_flags & MF_MEM_PRE_REMOVE) {

+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   up_write(>m_super->s_umount);


Are all mappings invalidated after this point?


No; all this step does is pushes dirty filesystem [meta]data to pmem
before we lose DAXDEV_ALIVE...


The goal of the removal notification is to invalidate all DAX mappings
that are no pointing to pfns that do not exist anymore, so just syncing
does not seem like enough, and the shutdown is skipped above. What am I
missing?


...however, the shutdown above only applies to filesystem metadata.  In
effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which
enables the mf_dax_kill_procs calls to proceed against mapped file data.
I have a nagging suspicion that in non-PREREMOVE mode, we can end up
shutting down the filesytem on an xattr block and the 'return
-EFSCORRUPTED' actually prevents us from reaching all the remaining file
data mappings.

IOWs, I think that clause above really ought to have returned zero so
that we keep the filesystem up while we're tearing down mappings, and
only call xfs_force_shutdown() after we've had a chance to let
xfs_dax_notify_ddev_failure() tear down all the mappings.

I missed that subtlety in the initial ~30 rounds of review, but I figure
at this point let's just land it in 5.20 and clean up that quirk for
-rc1.


Sure, this is a good baseline to incrementally improve.


Hi Dan, Darrick

Do I need to fix somewhere on this patch?  I'm not su

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-08-03 Thread Shiyang Ruan





在 2022/8/4 8:51, Darrick J. Wong 写道:

On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote:



在 2022/7/29 12:54, Darrick J. Wong 写道:

On Fri, Jul 29, 2022 at 03:55:24AM +, ruansy.f...@fujitsu.com wrote:



在 2022/7/22 0:16, Darrick J. Wong 写道:

On Thu, Jul 21, 2022 at 02:06:10PM +, ruansy.f...@fujitsu.com wrote:

在 2022/7/1 8:31, Darrick J. Wong 写道:

On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:

Failure notification is not supported on partitions.  So, when we mount
a reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Signed-off-by: Shiyang Ruan 


Looks good to me, though I think this patch applies to ... wherever all
those rmap+reflink+dax patches went.  I think that's akpm's tree, right?

Ideally this would go in through there to keep the pieces together, but
I don't mind tossing this in at the end of the 5.20 merge window if akpm
is unwilling.


BTW, since these patches (dax + THIS + pmem-unbind) are
waiting to be merged, is it time to think about "removing the
experimental tag" again?  :)


It's probably time to take up that question again.

Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
and at least one of those dm layers no longer allows fsdax pass-through,
so XFS silently turned mount -o dax into -o dax=never. :(


Hi Darrick,

I tried generic/470 but it didn't run:
 [not run] Cannot use thin-pool devices on DAX capable block devices.

Did you modify the _require_dm_target() in common/rc?  I added thin-pool
to not to check dax capability:

   case $target in
   stripe|linear|log-writes|thin-pool)  # add thin-pool here
   ;;

then the case finally ran and it silently turned off dax as you said.

Are the steps for reproduction correct? If so, I will continue to
investigate this problem.


Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
mention that.  I suspect that the removal of dm support for pmem is
going to force us to completely redesign this test.  I can't really
think of how, though, since there's no good way that I know of to gain a
point-in-time snapshot of a pmem device.


Hi Darrick,

  > removal of dm support for pmem
I think here we are saying about xfstest who removed the support, not
kernel?

I found some xfstests commits:
fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool

So, this case was never able to run since the second commit?  (I didn't
notice the not run case.  I thought it was expected to be not run.)

And according to the first commit, the restriction was added because
some of dm devices don't support dax.  So my understanding is: we should
redesign the case to make the it work, and firstly, we should add dax
support for dm devices in kernel.


dm devices used to have fsdax support; I think Christoph is actively
removing (or already has removed) all that support.


In addition, is there any other testcase has the same problem?  so that
we can deal with them together.


The last I checked, there aren't any that require MAP_SYNC or pmem aside
from g/470 and the three poison notification tests that you sent a few
days ago.


Ok.  Got it.  Thank you!


--
Ruan.



--D



--
Thanks,
Ruan




--D



--
Thanks,
Ruan.





I'm not sure how to fix that...

--D



--
Thanks,
Ruan.



Reviewed-by: Darrick J. Wong 

--D


---
 fs/xfs/xfs_super.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8495ef076ffc..a3c221841fa6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -348,8 +348,10 @@ xfs_setup_dax_always(
goto disable_dax;
}
 
-	if (xfs_has_reflink(mp)) {

-   xfs_alert(mp, "DAX and reflink cannot be used together!");
+   if (xfs_has_reflink(mp) &&
+   bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
+   xfs_alert(mp,
+   "DAX and reflink cannot work with multi-partitions!");
return -EINVAL;
}
 
--

2.36.1

[RFC PATCH v4] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-03 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
So do not shutdown filesystem directly if something not supported, or if
failure range includes metadata area.  Make sure all files and processes
are handled correctly.

==
Changes since v3:
  1. Flush dirty files and logs when pmem is about to be removed.
  2. Rebased on next-20220701

Changes since v2:
  1. Rebased on next-20220615

Changes since v1:
  1. Drop the needless change of moving {kill,put}_dax()
  2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
[2]: 
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/

Signed-off-by: Shiyang Ruan 
---
 drivers/dax/super.c |  2 +-
 fs/xfs/xfs_notify_failure.c | 23 ++-
 include/linux/mm.h  |  1 +
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..d4bc83159d46 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
return;

if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);

clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index aa8dc27c599c..269e21b3341c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -18,6 +18,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_rtalloc.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"

 #include 
 #include 
@@ -75,6 +76,10 @@ xfs_dax_failure_fn(

if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* Do not shutdown so early when device is to be removed */
+   if (notify->mf_flags & MF_MEM_REMOVE) {
+   return 0;
+   }
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
}
@@ -168,6 +173,7 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;

if (!(mp->m_sb.sb_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
@@ -182,6 +188,13 @@ xfs_dax_notify_failure(

if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_REMOVE) {
+   /* Flush the log since device is about to be removed. */
+   error = xfs_log_force(mp, XFS_LOG_SYNC);
+   if (error)
+   return error;
+   return -EOPNOTSUPP;
+   }
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -211,8 +224,16 @@ xfs_dax_notify_failure(
if (offset + len > ddev_end)
len -= ddev_end - offset;

-   return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
+   error = xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
+   if (error)
+   return error;
+
+   if (mf_flags & MF_MEM_REMOVE) {
+   xfs_flush_inodes(mp);
+   error = xfs_log_force(mp, XFS_LOG_SYNC);
+   }
+   return error;
 }

 const struct dax_holder_operations xfs_dax_holder_operations = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2270e35a676..e66d23188323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3236,6 +3236,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
+   MF_MEM_REMOVE = 1 << 6,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
--
2.36.1

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-06-30 Thread Shiyang Ruan





在 2022/7/1 8:31, Darrick J. Wong 写道:

On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:

Failure notification is not supported on partitions.  So, when we mount
a reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Signed-off-by: Shiyang Ruan 


Looks good to me, though I think this patch applies to ... wherever all
those rmap+reflink+dax patches went.  I think that's akpm's tree, right?


Yes, they are in his tree, still in mm-unstable now.



Ideally this would go in through there to keep the pieces together, but
I don't mind tossing this in at the end of the 5.20 merge window if akpm
is unwilling.


Both are fine to me.  Thanks!


--
Ruan.



Reviewed-by: Darrick J. Wong 

--D


---
  fs/xfs/xfs_super.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8495ef076ffc..a3c221841fa6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -348,8 +348,10 @@ xfs_setup_dax_always(
goto disable_dax;
}
  
-	if (xfs_has_reflink(mp)) {

-   xfs_alert(mp, "DAX and reflink cannot be used together!");
+   if (xfs_has_reflink(mp) &&
+   bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
+   xfs_alert(mp,
+   "DAX and reflink cannot work with multi-partitions!");
return -EINVAL;
}
  
--

2.36.1

Re: [RFC PATCH v3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-06-23 Thread Shiyang Ruan





在 2022/6/23 0:49, Darrick J. Wong 写道:

On Wed, Jun 15, 2022 at 08:54:00PM +0800, Shiyang Ruan wrote:

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
  -> unbind_store()
   -> ... (skip)
-> devres_release_all()   # was pmem driver ->remove() in v1
 -> kill_dax()
  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
   -> xfs_dax_notify_failure()

Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
So do not shutdown filesystem directly if something not supported, or if
failure range includes metadata area.  Make sure all files and processes
are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 

==
Changes since v2:
   1. Rebased on next-20220615

Changes since v1:
   1. Drop the needless change of moving {kill,put}_dax()
   2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]

---
  drivers/dax/super.c | 2 +-
  fs/xfs/xfs_notify_failure.c | 6 +-
  include/linux/mm.h  | 1 +
  3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..d4bc83159d46 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
return;
  
  	if (dax_dev->holder_data != NULL)

-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);


At the point we're initiating a MEM_REMOVE call, is the pmem already
gone, or is it about to be gone?


It's about to be gone.

I found two cases:
  1. exec `unbind` by user, who wants to unplug the pmem
  2. handle failures during initialization



  
  	clear_bit(DAXDEV_ALIVE, _dev->flags);

synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index aa8dc27c599c..91d3f05d4241 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -73,7 +73,9 @@ xfs_dax_failure_fn(
struct failure_info *notify = data;
int error = 0;
  
-	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||

+   /* Do not shutdown so early when device is to be removed */
+   if (!(notify->mf_flags & MF_MEM_REMOVE) ||
+   XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -182,6 +184,8 @@ xfs_dax_notify_failure(
  
  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&

mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_REMOVE)
+   return -EOPNOTSUPP;


The reason I ask is that if the pmem is *about to be* but not yet
removed from the system, shouldn't we at least try to flush all dirty
files and the log to reduce data loss and minimize recovery time?


Yes, they should be flushed.  Will add it.


--
Thanks,
Ruan.



If it's already gone, then you might as well shut down immediately,
unless there's a chance the pmem will come back(?)

--D


xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 623c2ee8330a..bbeb31883362 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3249,6 +3249,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
MF_NO_RETRY = 1 << 5,
+   MF_MEM_REMOVE = 1 << 6,
  };
  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
--
2.36.1

[RFC PATCH v3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-06-15 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
So do not shutdown filesystem directly if something not supported, or if
failure range includes metadata area.  Make sure all files and processes
are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 

==
Changes since v2:
  1. Rebased on next-20220615

Changes since v1:
  1. Drop the needless change of moving {kill,put}_dax()
  2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]

---
 drivers/dax/super.c | 2 +-
 fs/xfs/xfs_notify_failure.c | 6 +-
 include/linux/mm.h  | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..d4bc83159d46 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index aa8dc27c599c..91d3f05d4241 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -73,7 +73,9 @@ xfs_dax_failure_fn(
struct failure_info *notify = data;
int error = 0;
 
-   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+   /* Do not shutdown so early when device is to be removed */
+   if (!(notify->mf_flags & MF_MEM_REMOVE) ||
+   XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -182,6 +184,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_REMOVE)
+   return -EOPNOTSUPP;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 623c2ee8330a..bbeb31883362 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3249,6 +3249,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
MF_NO_RETRY = 1 << 5,
+   MF_MEM_REMOVE = 1 << 6,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
-- 
2.36.1

[PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-06-09 Thread Shiyang Ruan

Failure notification is not supported on partitions.  So, when we mount
a reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_super.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8495ef076ffc..a3c221841fa6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -348,8 +348,10 @@ xfs_setup_dax_always(
goto disable_dax;
}
 
-   if (xfs_has_reflink(mp)) {
-   xfs_alert(mp, "DAX and reflink cannot be used together!");
+   if (xfs_has_reflink(mp) &&
+   bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
+   xfs_alert(mp,
+   "DAX and reflink cannot work with multi-partitions!");
return -EINVAL;
}
 
-- 
2.36.1

[PATCH v2.1 08/14] fsdax: Output address in dax_iomap_pfn() and rename it

2022-06-07 Thread Shiyang Ruan

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 

==
Hi Andrew,

As Dan mentioned[1], the rc should be initialized.  I fixed it and resend this 
patch.

[1] https://lore.kernel.org/linux-fsdevel/Yp8FUZnO64Qvyx5G@kili/

---
 fs/dax.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b59b864017ad..7a8eb1e30a1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1026,20 +1026,22 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+   size_t size, void **kaddr, pfn_t *pfnp)
 {
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
-   int id, rc;
+   int id, rc = 0;
long length;
 
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-  DAX_ACCESS, NULL, pfnp);
+  DAX_ACCESS, kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+   if (!pfnp)
+   goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1049,6 +1051,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+   if (!kaddr)
+   goto out;
+   if (!*kaddr)
+   rc = -EFAULT;
 out:
dax_read_unlock(id);
return rc;
@@ -1456,7 +1464,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
}
 
-   err = dax_iomap_pfn(>iomap, pos, size, );
+   err = dax_iomap_direct_access(>iomap, pos, size, NULL, );
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-- 
2.36.1

[PATCH v2 14/14] xfs: Add dax dedupe support

2022-06-02 Thread Shiyang Ruan

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped.  After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/xfs/xfs_file.c|  2 +-
 fs/xfs/xfs_inode.c   | 69 +---
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  4 +--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 07ec4ada5163..9f433006edcd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -808,7 +808,7 @@ xfs_wait_dax_page(
xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 }
 
-static int
+int
 xfs_break_dax_layouts(
struct inode*inode,
bool*retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b2879870a17e..96308065a2b3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3767,6 +3767,50 @@ xfs_iolock_two_inodes_and_break_layout(
return 0;
 }
 
+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+   struct xfs_inode*ip1,
+   struct xfs_inode*ip2)
+{
+   int error;
+   boolretry;
+   struct page *page;
+
+   if (ip1->i_ino > ip2->i_ino)
+   swap(ip1, ip2);
+
+again:
+   retry = false;
+   /* Lock the first inode */
+   xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+   error = xfs_break_dax_layouts(VFS_I(ip1), );
+   if (error || retry) {
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   if (error == 0 && retry)
+   goto again;
+   return error;
+   }
+
+   if (ip1 == ip2)
+   return 0;
+
+   /* Nested lock the second inode */
+   xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+   /*
+* We cannot use xfs_break_dax_layouts() directly here because it may
+* need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+* for this nested lock case.
+*/
+   page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+   if (page && page_ref_count(page) != 1) {
+   xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   goto again;
+   }
+
+   return 0;
+}
+
 /*
  * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
  * mmap activity.
@@ -3781,8 +3825,19 @@ xfs_ilock2_io_mmap(
ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
if (ret)
return ret;
-   filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
-   VFS_I(ip2)->i_mapping);
+
+   if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+   ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+   if (ret) {
+   inode_unlock(VFS_I(ip2));
+   if (ip1 != ip2)
+   inode_unlock(VFS_I(ip1));
+   return ret;
+   }
+   } else
+   filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
+   VFS_I(ip2)->i_mapping);
+
return 0;
 }
 
@@ -3792,8 +3847,14 @@ xfs_iunlock2_io_mmap(
struct xfs_inode*ip1,
struct xfs_inode*ip2)
 {
-   filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
- VFS_I(ip2)->i_mapping);
+   if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+   xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+   if (ip1 != ip2)
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   } else
+   filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
+ VFS_I(ip2)->i_mapping);
+
inode_unlock(VFS_I(ip2));
if (ip1 != ip2)
inode_unlock(VFS_I(ip1));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7be6f8e705ab..8313cc83b6ee 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -467,6 +467,7 @@ xfs_itruncate_extents(
 }
 
 /* from xfs_file.c */
+intxfs_break_dax_layouts(struct inode *inode, bool *retry);
 intxfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cbaf36d21020..d07f06ff0f13 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1363,8 +1363,8 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;
 
-   /* Don't share DAX file data for now. */
-   if (IS_DAX(inode_in) || IS_DAX(inode_out))
+   /* Don't share DAX file data with non-DAX file. */
+   if

[PATCH v2 12/14] fsdax: Dedup file range to use a compare function

2022-06-02 Thread Shiyang Ruan

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 82 
 fs/remap_range.c | 31 ++---
 fs/xfs/xfs_reflink.c |  8 +++--
 include/linux/dax.h  |  8 +
 include/linux/fs.h   | 12 ---
 5 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 24d8b4f99e98..cda43a819509 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1873,3 +1873,85 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
+   struct iomap_iter *it_dest, u64 len, bool *same)
+{
+   const struct iomap *smap = _src->iomap;
+   const struct iomap *dmap = _dest->iomap;
+   loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
+   void *saddr, *daddr;
+   int id, ret;
+
+   len = min(len, min(smap->length, dmap->length));
+
+   if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+   *same = true;
+   return len;
+   }
+
+   if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+   *same = false;
+   return 0;
+   }
+
+   id = dax_read_lock();
+   ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   *same = !memcmp(saddr, daddr, len);
+   if (!*same)
+   len = 0;
+   dax_read_unlock(id);
+   return len;
+
+out_unlock:
+   dax_read_unlock(id);
+   return -EIO;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+   struct inode *dst, loff_t dstoff, loff_t len, bool *same,
+   const struct iomap_ops *ops)
+{
+   struct iomap_iter src_iter = {
+   .inode  = src,
+   .pos= srcoff,
+   .len= len,
+   .flags  = IOMAP_DAX,
+   };
+   struct iomap_iter dst_iter = {
+   .inode  = dst,
+   .pos= dstoff,
+   .len= len,
+   .flags  = IOMAP_DAX,
+   };
+   int ret;
+
+   while ((ret = iomap_iter(_iter, ops)) > 0) {
+   while ((ret = iomap_iter(_iter, ops)) > 0) {
+   dst_iter.processed = dax_range_compare_iter(_iter,
+   _iter, len, same);
+   }
+   if (ret <= 0)
+   src_iter.processed = ret;
+   }
+   return ret;
+}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+   return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+  pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e112b5424cdb..231de627c1b9 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -271,9 +272,11 @@ static int vfs_dedupe_file_range_compare(struct file *src, 
loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   loff_t *len, unsigned int remap_flags,
+   const struct iomap_ops *dax_read_ops)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -333,8 +336,18 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
boolis_

[PATCH v2 13/14] xfs: support CoW in fsdax mode

2022-06-02 Thread Shiyang Ruan

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c  | 33 -
 fs/xfs/xfs_iomap.c | 30 +-
 fs/xfs/xfs_iomap.h |  1 +
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a60632ecc3f0..07ec4ada5163 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -25,6 +25,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -669,7 +670,7 @@ xfs_file_dax_write(
pos = iocb->ki_pos;
 
trace_xfs_file_dax_write(iocb, from);
-   ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
+   ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
error = xfs_setfilesize(ip, pos, ret);
@@ -1254,6 +1255,31 @@ xfs_file_llseek(
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+   (write_fault && !vmf->cow_page) ?
+   _dax_write_iomap_ops :
+   _read_iomap_ops);
+}
+#else
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return 0;
+}
+#endif
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1285,10 +1311,7 @@ __xfs_filemap_fault(
pfn_t pfn;
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_iomap_fault(vmf, pe_size, , NULL,
-   (write_fault && !vmf->cow_page) ?
-_direct_write_iomap_ops :
-_read_iomap_ops);
+   ret = xfs_dax_fault(vmf, pe_size, write_fault, );
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..4c07f5e718fb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -773,7 +773,8 @@ xfs_direct_write_iomap_begin(
 
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , , ,
-   , flags & IOMAP_DIRECT);
+   ,
+   (flags & IOMAP_DIRECT) || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -867,6 +868,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+   struct inode*inode,
+   loff_t  pos,
+   loff_t  length,
+   ssize_t written,
+   unsignedflags,
+   struct iomap*iomap)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   if (!xfs_is_cow_inode(ip))
+   return 0;
+
+   if (!written) {
+   xfs_reflink_cancel_cow_range(ip, pos, length, true);
+   return 0;
+   }
+
+   return xfs_reflink_end_cow(ip, pos, written);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+   .iomap_begin= xfs_direct_write_iomap_begin,
+   .iomap_end  = xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
struct inode*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index e88dc162c785..c782e8c0479c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -51,5 +51,6 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.36.1

[PATCH v2 10/14] fsdax: Replace mmap entry in case of CoW

2022-06-02 Thread Shiyang Ruan

Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 77 ++--
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 3fe8e3714327..f69e937f6496 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -829,6 +829,23 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const 
struct iomap_iter *iter
return 0;
 }
 
+/*
+ * MAP_SYNC on a dax mapping guarantees dirty metadata is
+ * flushed on write-faults (non-cow), but not read-faults.
+ */
+static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
+   struct vm_area_struct *vma)
+{
+   return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
+   (iter->iomap.flags & IOMAP_F_DIRTY);
+}
+
+static bool dax_fault_is_cow(const struct iomap_iter *iter)
+{
+   return (iter->flags & IOMAP_WRITE) &&
+   (iter->iomap.flags & IOMAP_F_SHARED);
+}
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -836,16 +853,19 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const 
struct iomap_iter *iter
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-   struct address_space *mapping, struct vm_fault *vmf,
-   void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+   const struct iomap_iter *iter, void *entry, pfn_t pfn,
+   unsigned long flags)
 {
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
+   bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
+   bool cow = dax_fault_is_cow(iter);
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-   if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -857,12 +877,12 @@ static void *dax_insert_entry(struct xa_state *xas,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-   false);
+   cow);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -882,6 +902,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+   if (cow)
+   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
 }
@@ -1123,17 +1146,15 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
length, size_t align_size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
-   struct address_space *mapping, void **entry,
-   struct vm_fault *vmf)
+static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   const struct iomap_iter *iter, void **entry)
 {
-   struct inode *inode = mapping->host;
+   struct inode *inode = iter->inode;
unsigned long vaddr = vmf->address;
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;
 
-   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-   DAX_ZERO_PAGE, false);
+   *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1142,7 +1163,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 
 #ifdef CONFIG_FS_DAX_PMD
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vm

[PATCH v2 06/14] xfs: Implement ->notify_failure() for XFS

2022-06-02 Thread Shiyang Ruan

Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/Makefile |   5 +
 fs/xfs/xfs_buf.c|  11 +-
 fs/xfs/xfs_fsops.c  |   3 +
 fs/xfs/xfs_mount.h  |   1 +
 fs/xfs/xfs_notify_failure.c | 220 
 fs/xfs/xfs_super.h  |   1 +
 6 files changed, 238 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b056cfc6398e..805a0d0a88c1 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -129,6 +129,11 @@ xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
 
+# notify failure
+ifeq ($(CONFIG_MEMORY_FAILURE),y)
+xfs-$(CONFIG_FS_DAX)   += xfs_notify_failure.o
+endif
+
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1ec2a7b6d44e..59c6b62fde57 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -5,6 +5,7 @@
  */
 #include "xfs.h"
 #include 
+#include 
 
 #include "xfs_shared.h"
 #include "xfs_format.h"
@@ -1911,7 +1912,7 @@ xfs_free_buftarg(
list_lru_destroy(>bt_lru);
 
blkdev_issue_flush(btp->bt_bdev);
-   fs_put_dax(btp->bt_daxdev, NULL);
+   fs_put_dax(btp->bt_daxdev, btp->bt_mount);
 
kmem_free(btp);
 }
@@ -1958,14 +1959,18 @@ xfs_alloc_buftarg(
struct block_device *bdev)
 {
xfs_buftarg_t   *btp;
+   const struct dax_holder_operations *ops = NULL;
 
+#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
+   ops = _dax_holder_operations;
+#endif
btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
 
btp->bt_mount = mp;
btp->bt_dev =  bdev->bd_dev;
btp->bt_bdev = bdev;
-   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, NULL,
-   NULL);
+   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off,
+   mp, ops);
 
/*
 * Buffer IO error rate limiting. Limit it to no more than 10 messages
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 39e75d11..ea9159967eaa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -533,6 +533,9 @@ xfs_do_force_shutdown(
} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
tag = XFS_PTAG_SHUTDOWN_CORRUPT;
why = "Corruption of in-memory data";
+   } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
+   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+   why = "Corruption of on-disk metadata";
} else {
tag = XFS_PTAG_SHUTDOWN_IOERROR;
why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c42786e4942..540924b9e583 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -438,6 +438,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t 
flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR  (1u << 1) /* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT  (1u << 2) /* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE(1u << 3) /* corrupt in-memory 
structures */
+#define SHUTDOWN_CORRUPT_ONDISK(1u << 4)  /* corrupt metadata on 
device */
 
 #define XFS_SHUTDOWN_STRINGS \
{ SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
new file mode 100644
index ..aa8dc27c599c
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
+ */
+
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_alloc.h"
+#include "xfs_bit.h"
+#include "xfs_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_trans.h"
+
+#include 
+#include

[PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-06-02 Thread Shiyang Ruan

This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing
a DAX mapping.  It is intended to be called by the file systems as part
of the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Dan Williams 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Miaohe Lin 
---
 include/linux/mm.h  |  2 +
 mm/memory-failure.c | 96 -
 2 files changed, 88 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8a96197b9afd..623c2ee8330a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3250,6 +3250,8 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_NO_RETRY = 1 << 5,
 };
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+ unsigned long count, int mf_flags);
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a9d93c30a1e4..5d015e1387bd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -301,10 +301,9 @@ void shake_page(struct page *p)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-   struct vm_area_struct *vma)
+static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
+   unsigned long address)
 {
-   unsigned long address = vma_address(page, vma);
unsigned long ret = 0;
pgd_t *pgd;
p4d_t *p4d;
@@ -344,10 +343,14 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ *
+ * Notice: @fsdax_pgoff is used only when @p is a fsdax page.
+ *   In other cases, such as anonymous and file-backend page, the address to be
+ *   killed can be caculated by @p itself.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-  struct vm_area_struct *vma,
-  struct list_head *to_kill)
+   pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
+   struct list_head *to_kill)
 {
struct to_kill *tk;
 
@@ -358,9 +361,15 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
 
tk->addr = page_address_in_vma(p, vma);
-   if (is_zone_device_page(p))
-   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-   else
+   if (is_zone_device_page(p)) {
+   /*
+* Since page->mapping is not used for fsdax, we need
+* calculate the address based on the vma.
+*/
+   if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
+   tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
+   tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
+   } else
tk->size_shift = page_shift(compound_head(p));
 
/*
@@ -509,7 +518,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, 0, vma, to_kill);
}
}
read_unlock(_lock);
@@ -545,13 +554,41 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 * to be informed of all such data corruptions.
 */
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, 0, vma, to_kill);
}
}
read_unlock(_lock);
i_mmap_unlock_read(mapping);
 }
 
+#ifdef CONFIG_FS_DAX
+/*
+ * Collect processes when the error hit a fsdax page.
+ */
+static void collect_procs_fsdax(struct page *page,
+   struct address_space *mapping, pgoff_t pgoff,
+   struct list_head *to_kill)
+{
+   struct vm_area_struct *vma;
+   struct task_struct *tsk;
+
+   i_mmap_lock_read(mapping);
+   read_lock(_lock);
+   for_each_process(tsk) {
+   struct task_struct *t = task_early_kill(tsk, true);
+
+   if (!t)
+   continue;
+   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
+   if (vma->vm_mm == t->mm)
+   add_to_kill(t, page, pgoff, vm

[PATCH v2 08/14] fsdax: Output address in dax_iomap_pfn() and rename it

2022-06-02 Thread Shiyang Ruan

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b59b864017ad..ab659c9f142a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1026,8 +1026,8 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+   size_t size, void **kaddr, pfn_t *pfnp)
 {
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
int id, rc;
@@ -1035,11 +1035,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
loff_t pos, size_t size,
 
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-  DAX_ACCESS, NULL, pfnp);
+  DAX_ACCESS, kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+   if (!pfnp)
+   goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1049,6 +1051,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+   if (!kaddr)
+   goto out;
+   if (!*kaddr)
+   rc = -EFAULT;
 out:
dax_read_unlock(id);
return rc;
@@ -1456,7 +1464,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
}
 
-   err = dax_iomap_pfn(>iomap, pos, size, );
+   err = dax_iomap_direct_access(>iomap, pos, size, NULL, );
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-- 
2.36.1

[PATCH v2 07/14] fsdax: set a CoW flag when associate reflink mappings

2022-06-02 Thread Shiyang Ruan

Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
mappings.  In this case, since the dax-rmap has already took the
responsibility to look up for shared files by given dax page,
the page->mapping is no longer to used for rmap but for marking that
this dax page is shared.  And to make sure disassociation works fine, we
use page->index as refcount, and clear page->mapping to the initial
state when page->index is decreased to 0.

With the help of this new flag, it is able to distinguish normal case
and CoW case, and keep the warning in normal case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c   | 50 +++---
 include/linux/page-flags.h |  6 +
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 65e44d78b3bb..b59b864017ad 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,13 +334,35 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
+static inline bool dax_mapping_is_cow(struct address_space *mapping)
+{
+   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+}
+
 /*
- * TODO: for reflink+dax we need a way to associate a single page with
- * multiple address_space instances at different linear_page_index()
- * offsets.
+ * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ */
+static inline void dax_mapping_set_cow(struct page *page)
+{
+   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   /*
+* Reset the index if the page was already mapped
+* regularly before.
+*/
+   if (page->mapping)
+   page->index = 1;
+   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   }
+   page->index++;
+}
+
+/*
+ * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * whether this entry is shared by multiple files.  If so, set the 
page->mapping
+ * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address)
+   struct vm_area_struct *vma, unsigned long address, bool cow)
 {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -352,9 +374,13 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(page->mapping);
-   page->mapping = mapping;
-   page->index = index + i++;
+   if (cow) {
+   dax_mapping_set_cow(page);
+   } else {
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   page->index = index + i++;
+   }
}
 }
 
@@ -370,7 +396,12 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   if (dax_mapping_is_cow(page->mapping)) {
+   /* keep the CoW flag if this page is still shared */
+   if (page->index-- > 0)
+   continue;
+   } else
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
}
@@ -830,7 +861,8 @@ static void *dax_insert_entry(struct xa_state *xas,
void *old;
 
dax_disassociate_entry(entry, mapping, false);
-   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
+   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
+   false);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..a5263a21b72f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -650,6 +650,12 @@ __PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
 #define PAGE_MAPPING_KSM   (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
+/*
+ * Different with flags above, this flag is used only for fsdax mode.  It
+ * indicates that this page->mapping is now under reflink case.
+ */

[PATCH v2 11/14] fsdax: Add dax_iomap_cow_copy() for dax zero

2022-06-02 Thread Shiyang Ruan

Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
Otherwise, data in not aligned area will be not correct.  So, add the
CoW operation for not aligned case in dax_memzero().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 

==
This patch changed a lot when rebasing to next-20220504 branch.  Though it
has been tested by myself, I think it needs a re-review.
==
---
 fs/dax.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f69e937f6496..24d8b4f99e98 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1221,17 +1221,28 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
*xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
-   unsigned int offset, size_t size)
+static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
 {
+   const struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = iomap_iter_srcmap(iter);
+   unsigned offset = offset_in_page(pos);
+   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
void *kaddr;
long ret;
 
-   ret = dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, , NULL);
-   if (ret > 0) {
-   memset(kaddr + offset, 0, size);
-   dax_flush(dax_dev, kaddr + offset, size);
-   }
+   ret = dax_direct_access(iomap->dax_dev, pgoff, 1, DAX_ACCESS, ,
+   NULL);
+   if (ret < 0)
+   return ret;
+   memset(kaddr + offset, 0, size);
+   if (srcmap->addr != iomap->addr) {
+   ret = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+kaddr);
+   if (ret < 0)
+   return ret;
+   dax_flush(iomap->dax_dev, kaddr, PAGE_SIZE);
+   } else
+   dax_flush(iomap->dax_dev, kaddr + offset, size);
return ret;
 }
 
@@ -1258,7 +1269,7 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool 
*did_zero)
if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
else
-   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
+   rc = dax_memzero(iter, pos, size);
dax_read_unlock(id);
 
if (rc < 0)
-- 
2.36.1

[PATCH v2 09/14] fsdax: Introduce dax_iomap_cow_copy()

2022-06-02 Thread Shiyang Ruan

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

In this case, the destination (iomap->addr) points to a newly allocated
extent.  It is needed to copy the data from srcmap to the extent.  In
theory, it is better to copy the head and tail ranges which is outside
of the non-aligned area instead of copying the whole aligned range. But
in dax page fault, it will always be an aligned range. So copy the whole
range in this case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 88 
 1 file changed, 83 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ab659c9f142a..3fe8e3714327 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1062,6 +1062,60 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
return rc;
 }
 
+/**
+ * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * @pos:   address to do copy from.
+ * @length:size of copy operation.
+ * @align_size:aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap:iomap srcmap
+ * @daddr: destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault (page
+ * aligned), to copy the length size data to daddr. Or, while doing normal DAX
+ * write operation, dax_iomap_actor() might call this to do the copy of either
+ * start or end unaligned address. In the latter case the rest of the copy of
+ * aligned ranges is taken care by dax_iomap_actor() itself.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+   const struct iomap *srcmap, void *daddr)
+{
+   loff_t head_off = pos & (align_size - 1);
+   size_t size = ALIGN(head_off + length, align_size);
+   loff_t end = pos + length;
+   loff_t pg_end = round_up(end, align_size);
+   bool copy_all = head_off == 0 && end == pg_end;
+   void *saddr = 0;
+   int ret = 0;
+
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+
+   if (copy_all) {
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   return ret ? -EIO : 0;
+   }
+
+   /* Copy the head part of the range */
+   if (head_off) {
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   if (ret)
+   return -EIO;
+   }
+
+   /* Copy the tail part of the range */
+   if (end < pg_end) {
+   loff_t tail_off = head_off + length;
+   loff_t tail_len = pg_end - end;
+
+   ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+   tail_len);
+   if (ret)
+   return -EIO;
+   }
+   return 0;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1232,15 +1286,17 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
struct iov_iter *iter)
 {
const struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = >srcmap;
loff_t length = iomap_length(iomi);
loff_t pos = iomi->pos;
struct dax_device *dax_dev = iomap->dax_dev;
loff_t end = pos + length, done = 0;
+   bool write = iov_iter_rw(iter) == WRITE;
ssize_t ret = 0;
size_t xfer;
int id;
 
-   if (iov_iter_rw(iter) == READ) {
+   if (!write) {
end = min(end, i_size_read(iomi->inode));
if (pos >= end)
return 0;
@@ -1249,7 +1305,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
return iov_iter_zero(min(length, end - pos), iter);
}
 
-   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+   /*
+* In DAX mode, enforce either pure overwrites of written extents, or
+* writes to unwritten extents as part of a copy-on-write operation.
+*/
+   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+   !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;
 
/*
@@ -1291,6 +1352,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
break;
}
 
+   if (write &&
+   srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+   ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+kaddr);
+   if (ret)
+   break;
+   }
+
map_len = PFN_PHYS(map_len);
kaddr

[PATCH v2 01/14] dax: Introduce holder for dax_device

2022-06-02 Thread Shiyang Ruan

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
 - When it is the backend of a filesystem, the holder will be the
   instance of this filesystem.
 - When this pmem device is one of the targets in a mapped device, the
   holder will be this mapped device.  In this case, the mapped device
   has its own dax_device and it will follow the first rule.  So that we
   can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
---
 drivers/dax/super.c | 67 -
 drivers/md/dm.c |  2 +-
 fs/erofs/super.c| 10 ---
 fs/ext2/super.c |  7 +++--
 fs/ext4/super.c |  9 +++---
 fs/xfs/xfs_buf.c|  5 ++--
 include/linux/dax.h | 33 --
 7 files changed, 110 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 50a08b2ec247..9b5e2a5eb0ae 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -22,6 +22,8 @@
  * @private: dax driver private data
  * @flags: state and boolean properties
  * @ops: operations for this device
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
+ * @holder_ops: operations for the inner holder
  */
 struct dax_device {
struct inode inode;
@@ -29,6 +31,8 @@ struct dax_device {
void *private;
unsigned long flags;
const struct dax_operations *ops;
+   void *holder_data;
+   const struct dax_holder_operations *holder_ops;
 };
 
 static dev_t dax_devt;
@@ -71,8 +75,11 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
  * fs_dax_get_by_bdev() - temporary lookup mechanism for filesystem-dax
  * @bdev: block device to find a dax_device for
  * @start_off: returns the byte offset into the dax_device that @bdev starts
+ * @holder: filesystem or mapped device inside the dax_device
+ * @ops: operations for the inner holder
  */
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
*start_off)
+struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
*start_off,
+   void *holder, const struct dax_holder_operations *ops)
 {
struct dax_device *dax_dev;
u64 part_size;
@@ -92,11 +99,26 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
*bdev, u64 *start_off)
dax_dev = xa_load(_hosts, (unsigned long)bdev->bd_disk);
if (!dax_dev || !dax_alive(dax_dev) || !igrab(_dev->inode))
dax_dev = NULL;
+   else if (holder) {
+   if (!cmpxchg(_dev->holder_data, NULL, holder))
+   dax_dev->holder_ops = ops;
+   else
+   dax_dev = NULL;
+   }
dax_read_unlock(id);
 
return dax_dev;
 }
 EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
+
+void fs_put_dax(struct dax_device *dax_dev, void *holder)
+{
+   if (dax_dev && holder &&
+   cmpxchg(_dev->holder_data, holder, NULL) == holder)
+   dax_dev->holder_ops = NULL;
+   put_dax(dax_dev);
+}
+EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
 enum dax_device_flags {
@@ -204,6 +226,29 @@ size_t dax_recovery_write(struct dax_device *dax_dev, 
pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_recovery_write);
 
+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
+ u64 len, int mf_flags)
+{
+   int rc, id;
+
+   id = dax_read_lock();
+   if (!dax_alive(dax_dev)) {
+   rc = -ENXIO;
+   goto out;
+   }
+
+   if (!dax_dev->holder_ops) {
+   rc = -EOPNOTSUPP;
+   goto out;
+   }
+
+   rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+out:
+   dax_read_unlock(id);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -277,8 +322,15 @@ void kill_dax(struct dax_device *dax_dev)
if (!dax_dev)
return;
 
+   if (dax_dev->holder_data != NULL)
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
+
+   /* clear holder data */
+   dax_dev->holder_ops = NULL;
+   dax_dev->holder_data = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dax);
 
@@ -420,6 +472,19 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+/**
+ * dax_holder() - obtain the holder of a dax device
+ * @dax_dev: a dax_device instance
+
+ * Return: the holder's dat

[PATCHSETS v2] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Shiyang Ruan

 Changes since v1[1]:
  1. Rebased to mm-unstable, solved many conflicts

[1] 
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/


This is an *updated* combination of two patchsets:
 1.fsdax-rmap: 
https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
 2.fsdax-reflink: 
https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/


==
Shiyang Ruan (14):
  dax: Introduce holder for dax_device
  mm: factor helpers for memory_failure_dev_pagemap
  pagemap,pmem: Introduce ->memory_failure()
  fsdax: Introduce dax_lock_mapping_entry()
  mm: Introduce mf_dax_kill_procs() for fsdax case
  xfs: Implement ->notify_failure() for XFS
  fsdax: set a CoW flag when associate reflink mappings
  fsdax: Output address in dax_iomap_pfn() and rename it
  fsdax: Introduce dax_iomap_cow_copy()
  fsdax: Replace mmap entry in case of CoW
  fsdax: Add dax_iomap_cow_copy() for dax zero
  fsdax: Dedup file range to use a compare function
  xfs: support CoW in fsdax mode
  xfs: Add dax dedupe support

 drivers/dax/super.c |  67 +-
 drivers/md/dm.c |   2 +-
 drivers/nvdimm/pmem.c   |  17 ++
 fs/dax.c| 399 ++--
 fs/erofs/super.c|  10 +-
 fs/ext2/super.c |   7 +-
 fs/ext4/super.c |   9 +-
 fs/remap_range.c|  31 ++-
 fs/xfs/Makefile |   5 +
 fs/xfs/xfs_buf.c|  10 +-
 fs/xfs/xfs_file.c   |  35 +++-
 fs/xfs/xfs_fsops.c  |   3 +
 fs/xfs/xfs_inode.c  |  69 ++-
 fs/xfs/xfs_inode.h  |   1 +
 fs/xfs/xfs_iomap.c  |  30 ++-
 fs/xfs/xfs_iomap.h  |   1 +
 fs/xfs/xfs_mount.h  |   1 +
 fs/xfs/xfs_notify_failure.c | 220 
 fs/xfs/xfs_reflink.c|  12 +-
 fs/xfs/xfs_super.h  |   1 +
 include/linux/dax.h |  56 -
 include/linux/fs.h  |  12 +-
 include/linux/memremap.h|  12 ++
 include/linux/mm.h  |   2 +
 include/linux/page-flags.h  |   6 +
 mm/memory-failure.c | 265 +---
 26 files changed, 1098 insertions(+), 185 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

-- 
2.36.1

[PATCH v2 04/14] fsdax: Introduce dax_lock_mapping_entry()

2022-06-02 Thread Shiyang Ruan

The current dax_lock_page() locks dax entry by obtaining mapping and
index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
to lock a specific dax entry corresponding to this file's mapping,index.
And output the page corresponding to the specific dax entry for caller
use.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c| 63 +
 include/linux/dax.h | 15 +++
 2 files changed, 78 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 4155a6107fa1..65e44d78b3bb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
dax_unlock_entry(, (void *)cookie);
 }
 
+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t 
index,
+   struct page **page)
+{
+   XA_STATE(xas, NULL, 0);
+   void *entry;
+
+   rcu_read_lock();
+   for (;;) {
+   entry = NULL;
+   if (!dax_mapping(mapping))
+   break;
+
+   xas.xa = >i_pages;
+   xas_lock_irq();
+   xas_set(, index);
+   entry = xas_load();
+   if (dax_is_locked(entry)) {
+   rcu_read_unlock();
+   wait_entry_unlocked(, entry);
+   rcu_read_lock();
+   continue;
+   }
+   if (!entry ||
+   dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   /*
+* Because we are looking for entry from file's mapping
+* and index, so the entry may not be inserted for now,
+* or even a zero/empty entry.  We don't think this is
+* an error case.  So, return a special value and do
+* not output @page.
+*/
+   entry = (void *)~0UL;
+   } else {
+   *page = pfn_to_page(dax_to_pfn(entry));
+   dax_lock_entry(, entry);
+   }
+   xas_unlock_irq();
+   break;
+   }
+   rcu_read_unlock();
+   return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+   dax_entry_t cookie)
+{
+   XA_STATE(xas, >i_pages, index);
+
+   if (cookie == ~0UL)
+   return;
+
+   dax_unlock_entry(, (void *)cookie);
+}
+
 /*
  * Find page cache entry at given index. If it is a DAX entry, return it
  * with the entry locked. If the page cache doesn't contain an entry at
diff --git a/include/linux/dax.h b/include/linux/dax.h
index cf85fc36da5f..7116681b48c0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -161,6 +161,10 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t 
start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+   unsigned long index, struct page **page);
+void dax_unlock_mapping_entry(struct address_space *mapping,
+   unsigned long index, dax_entry_t cookie);
 #else
 static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 {
@@ -188,6 +192,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+
+static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+   unsigned long index, struct page **page)
+{
+   return 0;
+}
+
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+   unsigned long index, dax_entry_t cookie)
+{
+}
 #endif
 
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-- 
2.36.1

[PATCH v2 03/14] pagemap,pmem: Introduce ->memory_failure()

2022-06-02 Thread Shiyang Ruan

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Naoya Horiguchi 
---
 drivers/nvdimm/pmem.c| 17 +
 include/linux/memremap.h | 12 
 mm/memory-failure.c  | 14 ++
 3 files changed, 43 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 629d10fcf53b..107c9cb3d57d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -453,6 +453,21 @@ static void pmem_release_disk(void *__pmem)
blk_cleanup_disk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+   unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+   struct pmem_device *pmem =
+   container_of(pgmap, struct pmem_device, pgmap);
+   u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+   u64 len = nr_pages << PAGE_SHIFT;
+
+   return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+   .memory_failure = pmem_pagemap_memory_failure,
+};
+
 static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
 {
@@ -514,6 +529,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -527,6 +543,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.range.end = res->end;
pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9f752ebed613..334ce79a3b91 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,18 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Handle the memory failure happens on a range of pfns.  Notify the
+* processes who are using these pfns, and try to recover the data on
+* them if necessary.  The mf_flags is finally passed to the recover
+* function through the whole notify routine.
+*
+* When this is not implemented, or it returns -EOPNOTSUPP, the caller
+* will fall back to a common handler called mf_generic_kill_procs().
+*/
+   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+ unsigned long nr_pages, int mf_flags);
 };
 
 #define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b39424da7625..a9d93c30a1e4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1737,6 +1737,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
if (!pgmap_pfn_valid(pgmap, pfn))
goto out;
 
+   /*
+* Call driver's implementation to handle the memory failure, otherwise
+* fall back to generic handler.
+*/
+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
+   /*
+* Fall back to generic handler too if operation is not
+* supported inside the driver/device/filesystem.
+*/
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   }
+
rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
/* drop pgmap ref acquired in caller */
-- 
2.36.1

[PATCH v2 02/14] mm: factor helpers for memory_failure_dev_pagemap

2022-06-02 Thread Shiyang Ruan

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax.  So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Miaohe Lin 
---
 mm/memory-failure.c | 167 
 1 file changed, 92 insertions(+), 75 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 36072c10658a..b39424da7625 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1502,6 +1502,95 @@ static int try_to_split_thp_page(struct page *page, 
const char *msg)
return 0;
 }
 
+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+   struct address_space *mapping, pgoff_t index, int flags)
+{
+   struct to_kill *tk;
+   unsigned long size = 0;
+
+   list_for_each_entry(tk, to_kill, nd)
+   if (tk->size_shift)
+   size = max(size, 1UL << tk->size_shift);
+
+   if (size) {
+   /*
+* Unmap the largest mapping to avoid breaking up device-dax
+* mappings which are constant size. The actual size of the
+* mapping being torn down is communicated in siginfo, see
+* kill_proc()
+*/
+   loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+   unmap_mapping_range(mapping, start, size, 0);
+   }
+
+   kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+   struct dev_pagemap *pgmap)
+{
+   struct page *page = pfn_to_page(pfn);
+   LIST_HEAD(to_kill);
+   dax_entry_t cookie;
+   int rc = 0;
+
+   /*
+* Pages instantiated by device-dax (not filesystem-dax)
+* may be compound pages.
+*/
+   page = compound_head(page);
+
+   /*
+* Prevent the inode from being freed while we are interrogating
+* the address_space, typically this would be handled by
+* lock_page(), but dax pages do not use the page lock. This
+* also prevents changes to the mapping of this pfn until
+* poison signaling is complete.
+*/
+   cookie = dax_lock_page(page);
+   if (!cookie)
+   return -EBUSY;
+
+   if (hwpoison_filter(page)) {
+   rc = -EOPNOTSUPP;
+   goto unlock;
+   }
+
+   switch (pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
+   /*
+* TODO: Handle device pages which may need coordination
+* with device-side memory.
+*/
+   rc = -ENXIO;
+   goto unlock;
+   default:
+   break;
+   }
+
+   /*
+* Use this flag as an indication that the dax page has been
+* remapped UC to prevent speculative consumption of poison.
+*/
+   SetPageHWPoison(page);
+
+   /*
+* Unlike System-RAM there is no possibility to swap in a
+* different physical page at a given virtual address, so all
+* userspace consumption of ZONE_DEVICE memory necessitates
+* SIGBUS (i.e. MF_MUST_KILL)
+*/
+   flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+   collect_procs(page, _kill, true);
+
+   unmap_and_kill(_kill, pfn, page->mapping, page->index, flags);
+unlock:
+   dax_unlock_page(page, cookie);
+   return rc;
+}
+
 /*
  * Called from hugetlb code with hugetlb_lock held.
  *
@@ -1636,12 +1725,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
struct dev_pagemap *pgmap)
 {
struct page *page = pfn_to_page(pfn);
-   unsigned long size = 0;
-   struct to_kill *tk;
-   LIST_HEAD(tokill);
-   int rc = -EBUSY;
-   loff_t start;
-   dax_entry_t cookie;
+   int rc = -ENXIO;
 
if (flags & MF_COUNT_INCREASED)
/*
@@ -1650,77 +1734,10 @@ static int memory_failure_dev_pagemap(unsigned long 
pfn, int flags,
put_page(page);
 
/* device metadata space is not recoverable */
-   if (!pgmap_pfn_valid(pgmap, pfn)) {
-   rc = -ENXIO;
-   goto out;
-   }
-
-   /*
-* Pages instantiated by device-dax (not filesystem-dax)
-* may be compound pages.
-*/
-   page = compound_head(page);
-
-   /*
-* Prevent the inode from being freed while we are interrogating
-* the address_space, typically this would be handled by
-* lock_page(), but dax pages do not use the page lock. This
-* also prevents changes to the mapping of this pfn until
-* poison signaling is complete.
-*/
-   cookie = dax_lock_page(page);
-   if (!cookie)

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Shiyang Ruan

在 2022/6/3 9:07, Shiyang Ruan 写道:

在 2022/6/3 2:56, Andrew Morton 写道:
On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan 
 wrote:

This is a combination of two patchsets:
  1.fsdax-rmap: 
https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/ 

  2.fsdax-reflink: 
https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/ 

I'm getting lost in conflicts trying to get this merged up.  Mainly
memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
support".

Could you please take a look at what's in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
now?  Or the next linux-next.

OK, let me rebase this patchset on your mm-unstable branch.

--
Thanks,
Ruan.

And I suggest that converting it all into a single 14-patch series
would be more straightforward.

The patchset in this thread is the 14-patch series.  I have solved many 
conflicts.  It's an updated / newest version, and a combination of the 2 
urls quoted above.  In an other word, instead of using this two:

 >> This is a combination of two patchsets:
 >>   1.fsdax-rmap: https://...
 >>   2.fsdax-reflink: https://...

you could take this (the url of the current thread):
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/ 

My description misleaded you.  Sorry for that.

--
Thanks,
Ruan.

Thanks.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Shiyang Ruan

在 2022/6/3 2:56, Andrew Morton 写道:

On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan  wrote:

This is a combination of two patchsets:
  1.fsdax-rmap: 
https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
  2.fsdax-reflink: 
https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/

I'm getting lost in conflicts trying to get this merged up.  Mainly
memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
support".

Could you please take a look at what's in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
now?  Or the next linux-next.

And I suggest that converting it all into a single 14-patch series
would be more straightforward.

The patchset in this thread is the 14-patch series.  I have solved many 
conflicts.  It's an updated / newest version, and a combination of the 2 
urls quoted above.  In an other word, instead of using this two:

>> This is a combination of two patchsets:
>>   1.fsdax-rmap: https://...
>>   2.fsdax-reflink: https://...

you could take this (the url of the current thread):
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/

My description misleaded you.  Sorry for that.

--
Thanks,
Ruan.

Thanks.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Shiyang Ruan


Hi,

Is there any other work I should do with these two patchsets?  I think 
they are good for now.  So... since the 5.19-rc1 is coming, could the 
notify_failure() part be merged as your plan?



--
Thanks,
Ruan.


在 2022/5/12 20:27, Shiyang Ruan 写道:



在 2022/5/11 23:46, Dan Williams 写道:
On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong  
wrote:


Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" 
 wrote:



On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
On Tue, 10 May 2022 18:55:50 -0700 Dan Williams 
 wrote:



It'll need to be a stable branch somewhere, but I don't think it
really matters where al long as it's merged into the xfs for-next
tree so it gets filesystem test coverage...


So how about let the notify_failure() bits go through -mm this 
cycle,
if Andrew will have it, and then the reflnk work has a clean 
v5.19-rc1

baseline to build from?


What are we referring to here?  I think a minimal thing would be the
memremap.h and memory-failure.c changes from
https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com 
?


Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
would probably be straining things to slip it into 5.19.

The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
right thing, but it's a networking errno.  I suppose livable with 
if it

never escapes the kernel, but if it can get back to userspace then a
user would be justified in wondering how the heck a filesystem
operation generated a networking errno?


 most filesystems return EOPNOTSUPP rather enthusiastically 
when

they don't know how to do something...


Can it propagate back to userspace?


AFAICT, the new code falls back to the current (mf_generic_kill_procs)
failure code if the filesystem doesn't provide a ->memory_failure
function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
convert that to 0 before returning it to userspace.

I suppose the weirder question is going to be what happens when madvise
starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
loses half its brains and even the fs can't deal with it.


Even then that notification is not in a system call context so it
would still result in a SIGBUS notification not a EOPNOTSUPP return
code. The only potential gap I see are what are the possible error
codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
offline failure codes. Shiyang, that's something to check / update if
necessary.


According to the code around MADV_SOFT_OFFLINE, it will return -EIO when 
the backend is NVDIMM.


Here is the logic:
  madvise_inject_error() {
  ...
  if (MADV_SOFT_OFFLINE) {
  ret = soft_offline_page() {
  ...
  /* Only online pages can be soft-offlined (esp., not 
ZONE_DEVICE). */

  page = pfn_to_online_page(pfn);
  if (!page) {
  put_ref_page(ref_page);
  return -EIO;
  }
  ...
  }
  } else {
  ret = memory_failure()
  }
  return ret
  }


--
Thanks,
Ruan.

[RFC PATCH v2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-05-19 Thread Shiyang Ruan

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
So do not shutdown filesystem directly if something not supported, or if
failure range includes metadata area.  Make sure all files and processes
are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
[2]: 
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/

Signed-off-by: Shiyang Ruan 

==
Changes since v1:
  1. Drop the needless change of moving {kill,put}_dax()
  2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]

---
 drivers/dax/super.c | 2 +-
 fs/xfs/xfs_notify_failure.c | 6 +-
 include/linux/mm.h  | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5ddb159c4653..44ca3b488e2a 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -313,7 +313,7 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index aa8dc27c599c..91d3f05d4241 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -73,7 +73,9 @@ xfs_dax_failure_fn(
struct failure_info *notify = data;
int error = 0;
 
-   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+   /* Do not shutdown so early when device is to be removed */
+   if (!(notify->mf_flags & MF_MEM_REMOVE) ||
+   XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -182,6 +184,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_REMOVE)
+   return -EOPNOTSUPP;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2c0f69f0660..ebcb5a7f3295 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3226,6 +3226,7 @@ enum mf_flags {
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
+   MF_MEM_REMOVE = 1 << 5,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
-- 
2.35.1

Re: [RFC PATCH] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-05-19 Thread Shiyang Ruan





在 2022/4/11 15:06, Christoph Hellwig 写道:

On Mon, Apr 11, 2022 at 01:16:23AM +0800, Shiyang Ruan wrote:

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bd502957cfdf..72d9e69aea98 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -359,7 +359,6 @@ static void pmem_release_disk(void *__pmem)
struct pmem_device *pmem = __pmem;
  
  	dax_remove_host(pmem->disk);

-   kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
del_gendisk(pmem->disk);
  
@@ -597,6 +596,8 @@ static void nd_pmem_remove(struct device *dev)

pmem->bb_state = NULL;
}
nvdimm_flush(to_nd_region(dev->parent), NULL);
+
+   kill_dax(pmem->dax_dev);


I think the put_dax will have to move as well.


After reading the implementation of 'devm_add_action_or_reset()', I 
think there is no need to move kill_dax() and put_dax() into ->remove().


In unbind, it will call both drv->remove() and devres_release_all(). 
The action, pmem_release_disk(), added in devm_add_action_or_reset() 
will be execute in devres_release_all().  So, during the unbind process, 
{kill,put}_dax() will finally be called to notify the REMOVE signal.


In addition, if devm_add_action_or_reset() fails in pmem_attach_disk(), 
pmem_release_disk() will be called to cleanup the pmem->dax_dev.



--
Thanks,
Ruan.



This part should probably also be a separate, well-documented
cleanup patch.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-12 Thread Shiyang Ruan





在 2022/5/11 23:46, Dan Williams 写道:

On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong  wrote:


Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:

On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"  wrote:


On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:

On Tue, 10 May 2022 18:55:50 -0700 Dan Williams  
wrote:


It'll need to be a stable branch somewhere, but I don't think it
really matters where al long as it's merged into the xfs for-next
tree so it gets filesystem test coverage...


So how about let the notify_failure() bits go through -mm this cycle,
if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
baseline to build from?


What are we referring to here?  I think a minimal thing would be the
memremap.h and memory-failure.c changes from
https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com ?

Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
would probably be straining things to slip it into 5.19.

The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
right thing, but it's a networking errno.  I suppose livable with if it
never escapes the kernel, but if it can get back to userspace then a
user would be justified in wondering how the heck a filesystem
operation generated a networking errno?


 most filesystems return EOPNOTSUPP rather enthusiastically when
they don't know how to do something...


Can it propagate back to userspace?


AFAICT, the new code falls back to the current (mf_generic_kill_procs)
failure code if the filesystem doesn't provide a ->memory_failure
function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
convert that to 0 before returning it to userspace.

I suppose the weirder question is going to be what happens when madvise
starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
loses half its brains and even the fs can't deal with it.


Even then that notification is not in a system call context so it
would still result in a SIGBUS notification not a EOPNOTSUPP return
code. The only potential gap I see are what are the possible error
codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
offline failure codes. Shiyang, that's something to check / update if
necessary.


According to the code around MADV_SOFT_OFFLINE, it will return -EIO when 
the backend is NVDIMM.


Here is the logic:
 madvise_inject_error() {
 ...
 if (MADV_SOFT_OFFLINE) {
 ret = soft_offline_page() {
 ...
 /* Only online pages can be soft-offlined (esp., not 
ZONE_DEVICE). */

 page = pfn_to_online_page(pfn);
 if (!page) {
 put_ref_page(ref_page);
 return -EIO;
 }
 ...
 }
 } else {
 ret = memory_failure()
 }
 return ret
 }


--
Thanks,
Ruan.

[PATCH v11.1 06/07] xfs: support CoW in fsdax mode

2022-05-10 Thread Shiyang Ruan

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/xfs/xfs_file.c  | 33 -
 fs/xfs/xfs_iomap.c | 30 +-
 fs/xfs/xfs_iomap.h |  1 +
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index af954a5b71f8..fe9f92586acf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -25,6 +25,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -669,7 +670,7 @@ xfs_file_dax_write(
pos = iocb->ki_pos;
 
trace_xfs_file_dax_write(iocb, from);
-   ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
+   ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
error = xfs_setfilesize(ip, pos, ret);
@@ -1254,6 +1255,31 @@ xfs_file_llseek(
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+   (write_fault && !vmf->cow_page) ?
+   _dax_write_iomap_ops :
+   _read_iomap_ops);
+}
+#else
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return 0;
+}
+#endif
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1285,10 +1311,7 @@ __xfs_filemap_fault(
pfn_t pfn;
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_iomap_fault(vmf, pe_size, , NULL,
-   (write_fault && !vmf->cow_page) ?
-_direct_write_iomap_ops :
-_read_iomap_ops);
+   ret = xfs_dax_fault(vmf, pe_size, write_fault, );
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..4c07f5e718fb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -773,7 +773,8 @@ xfs_direct_write_iomap_begin(
 
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , , ,
-   , flags & IOMAP_DIRECT);
+   ,
+   (flags & IOMAP_DIRECT) || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -867,6 +868,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+   struct inode*inode,
+   loff_t  pos,
+   loff_t  length,
+   ssize_t written,
+   unsignedflags,
+   struct iomap*iomap)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   if (!xfs_is_cow_inode(ip))
+   return 0;
+
+   if (!written) {
+   xfs_reflink_cancel_cow_range(ip, pos, length, true);
+   return 0;
+   }
+
+   return xfs_reflink_end_cow(ip, pos, written);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+   .iomap_begin= xfs_direct_write_iomap_begin,
+   .iomap_end  = xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
struct inode*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index e88dc162c785..c782e8c0479c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -51,5 +51,6 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.35.1

Re: [PATCH v11 06/07] xfs: support CoW in fsdax mode

2022-05-10 Thread Shiyang Ruan





在 2022/5/10 13:45, Christoph Hellwig 写道:

+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+   (write_fault && !vmf->cow_page) ?
+   _dax_write_iomap_ops :
+   _read_iomap_ops);
+}
+#endif


Is there any reason this is in xfs_iomap.c and not xfs_file.c?


Yes, It's better to put it in xfs_file.c since it's the only caller.  I 
didn't notice it...



--
Thanks,
Ruan.



Otherwise the patch looks good:


Reviewed-by: Christoph Hellwig

[PATCH v14 05/07] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-05-08 Thread Shiyang Ruan

This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing
a DAX mapping.  It is intended to be called by the file systems as part
of the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Dan Williams 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Miaohe Lin 
---
 include/linux/mm.h  |  2 +
 mm/memory-failure.c | 96 -
 2 files changed, 88 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index de32c0383387..e2c0f69f0660 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3227,6 +3227,8 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
 };
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+ unsigned long count, int mf_flags);
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index aeb19593af9c..aedfc5097420 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -295,10 +295,9 @@ void shake_page(struct page *p)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-   struct vm_area_struct *vma)
+static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
+   unsigned long address)
 {
-   unsigned long address = vma_address(page, vma);
unsigned long ret = 0;
pgd_t *pgd;
p4d_t *p4d;
@@ -338,10 +337,14 @@ static unsigned long dev_pagemap_mapping_shift(struct 
page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ *
+ * Notice: @fsdax_pgoff is used only when @p is a fsdax page.
+ *   In other cases, such as anonymous and file-backend page, the address to be
+ *   killed can be caculated by @p itself.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-  struct vm_area_struct *vma,
-  struct list_head *to_kill)
+   pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
+   struct list_head *to_kill)
 {
struct to_kill *tk;
 
@@ -352,9 +355,15 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
}
 
tk->addr = page_address_in_vma(p, vma);
-   if (is_zone_device_page(p))
-   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-   else
+   if (is_zone_device_page(p)) {
+   /*
+* Since page->mapping is not used for fsdax, we need
+* calculate the address based on the vma.
+*/
+   if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
+   tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
+   tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
+   } else
tk->size_shift = page_shift(compound_head(p));
 
/*
@@ -503,7 +512,7 @@ static void collect_procs_anon(struct page *page, struct 
list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, 0, vma, to_kill);
}
}
read_unlock(_lock);
@@ -539,13 +548,41 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
 * to be informed of all such data corruptions.
 */
if (vma->vm_mm == t->mm)
-   add_to_kill(t, page, vma, to_kill);
+   add_to_kill(t, page, 0, vma, to_kill);
}
}
read_unlock(_lock);
i_mmap_unlock_read(mapping);
 }
 
+#ifdef CONFIG_FS_DAX
+/*
+ * Collect processes when the error hit a fsdax page.
+ */
+static void collect_procs_fsdax(struct page *page,
+   struct address_space *mapping, pgoff_t pgoff,
+   struct list_head *to_kill)
+{
+   struct vm_area_struct *vma;
+   struct task_struct *tsk;
+
+   i_mmap_lock_read(mapping);
+   read_lock(_lock);
+   for_each_process(tsk) {
+   struct task_struct *t = task_early_kill(tsk, true);
+
+   if (!t)
+   continue;
+   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
+   if (vma->vm_mm == t->mm)
+

[PATCH v14 06/07] xfs: Implement ->notify_failure() for XFS

2022-05-08 Thread Shiyang Ruan

Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/Makefile |   5 +
 fs/xfs/xfs_buf.c|  11 +-
 fs/xfs/xfs_fsops.c  |   3 +
 fs/xfs/xfs_mount.h  |   1 +
 fs/xfs/xfs_notify_failure.c | 220 
 fs/xfs/xfs_super.h  |   1 +
 6 files changed, 238 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 04611a1068b4..09f5560e29f2 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -128,6 +128,11 @@ xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
 
+# notify failure
+ifeq ($(CONFIG_MEMORY_FAILURE),y)
+xfs-$(CONFIG_FS_DAX)   += xfs_notify_failure.o
+endif
+
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 63b8d9b5096f..93e4248240fc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -5,6 +5,7 @@
  */
 #include "xfs.h"
 #include 
+#include 
 
 #include "xfs_shared.h"
 #include "xfs_format.h"
@@ -1911,7 +1912,7 @@ xfs_free_buftarg(
list_lru_destroy(>bt_lru);
 
blkdev_issue_flush(btp->bt_bdev);
-   fs_put_dax(btp->bt_daxdev, NULL);
+   fs_put_dax(btp->bt_daxdev, btp->bt_mount);
 
kmem_free(btp);
 }
@@ -1958,14 +1959,18 @@ xfs_alloc_buftarg(
struct block_device *bdev)
 {
xfs_buftarg_t   *btp;
+   const struct dax_holder_operations *ops = NULL;
 
+#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
+   ops = _dax_holder_operations;
+#endif
btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
 
btp->bt_mount = mp;
btp->bt_dev =  bdev->bd_dev;
btp->bt_bdev = bdev;
-   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, NULL,
-   NULL);
+   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off,
+   mp, ops);
 
/*
 * Buffer IO error rate limiting. Limit it to no more than 10 messages
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 39e75d11..ea9159967eaa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -533,6 +533,9 @@ xfs_do_force_shutdown(
} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
tag = XFS_PTAG_SHUTDOWN_CORRUPT;
why = "Corruption of in-memory data";
+   } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
+   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+   why = "Corruption of on-disk metadata";
} else {
tag = XFS_PTAG_SHUTDOWN_IOERROR;
why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c42786e4942..540924b9e583 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -438,6 +438,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t 
flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR  (1u << 1) /* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT  (1u << 2) /* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE(1u << 3) /* corrupt in-memory 
structures */
+#define SHUTDOWN_CORRUPT_ONDISK(1u << 4)  /* corrupt metadata on 
device */
 
 #define XFS_SHUTDOWN_STRINGS \
{ SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
new file mode 100644
index ..aa8dc27c599c
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
+ */
+
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_alloc.h"
+#include "xfs_bit.h"
+#include "xfs_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_trans.h"
+
+#include 
+#include

[PATCH v11 04/07] fsdax: Add dax_iomap_cow_copy() for dax zero

2022-05-08 Thread Shiyang Ruan

Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
Otherwise, data in not aligned area will be not correct.  So, add the
CoW operation for not aligned case in dax_memzero().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 

==
This patch changed a lot when rebasing to next-20220504 branch.  Though it
has been tested by myself, I think it needs a re-review.
==
---
 fs/dax.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 78e26204697b..b3aa863e9fec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1220,17 +1220,27 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
*xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
-   unsigned int offset, size_t size)
+static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
 {
+   const struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = iomap_iter_srcmap(iter);
+   unsigned offset = offset_in_page(pos);
+   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
void *kaddr;
long ret;
 
-   ret = dax_direct_access(dax_dev, pgoff, 1, , NULL);
-   if (ret > 0) {
-   memset(kaddr + offset, 0, size);
-   dax_flush(dax_dev, kaddr + offset, size);
-   }
+   ret = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
+   if (ret < 0)
+   return ret;
+   memset(kaddr + offset, 0, size);
+   if (srcmap->addr != iomap->addr) {
+   ret = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+kaddr);
+   if (ret < 0)
+   return ret;
+   dax_flush(iomap->dax_dev, kaddr, PAGE_SIZE);
+   } else
+   dax_flush(iomap->dax_dev, kaddr + offset, size);
return ret;
 }
 
@@ -1257,7 +1267,7 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool 
*did_zero)
if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
else
-   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
+   rc = dax_memzero(iter, pos, size);
dax_read_unlock(id);
 
if (rc < 0)
-- 
2.35.1

[PATCH v14 01/07] dax: Introduce holder for dax_device

2022-05-08 Thread Shiyang Ruan

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
 - When it is the backend of a filesystem, the holder will be the
   instance of this filesystem.
 - When this pmem device is one of the targets in a mapped device, the
   holder will be this mapped device.  In this case, the mapped device
   has its own dax_device and it will follow the first rule.  So that we
   can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
---
 drivers/dax/super.c | 67 -
 drivers/md/dm.c |  2 +-
 fs/erofs/super.c| 13 +
 fs/ext2/super.c |  7 +++--
 fs/ext4/super.c |  9 +++---
 fs/xfs/xfs_buf.c|  5 ++--
 include/linux/dax.h | 33 --
 7 files changed, 112 insertions(+), 24 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..5ddb159c4653 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -22,6 +22,8 @@
  * @private: dax driver private data
  * @flags: state and boolean properties
  * @ops: operations for this device
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
+ * @holder_ops: operations for the inner holder
  */
 struct dax_device {
struct inode inode;
@@ -29,6 +31,8 @@ struct dax_device {
void *private;
unsigned long flags;
const struct dax_operations *ops;
+   void *holder_data;
+   const struct dax_holder_operations *holder_ops;
 };
 
 static dev_t dax_devt;
@@ -71,8 +75,11 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
  * fs_dax_get_by_bdev() - temporary lookup mechanism for filesystem-dax
  * @bdev: block device to find a dax_device for
  * @start_off: returns the byte offset into the dax_device that @bdev starts
+ * @holder: filesystem or mapped device inside the dax_device
+ * @ops: operations for the inner holder
  */
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
*start_off)
+struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
*start_off,
+   void *holder, const struct dax_holder_operations *ops)
 {
struct dax_device *dax_dev;
u64 part_size;
@@ -92,11 +99,26 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
*bdev, u64 *start_off)
dax_dev = xa_load(_hosts, (unsigned long)bdev->bd_disk);
if (!dax_dev || !dax_alive(dax_dev) || !igrab(_dev->inode))
dax_dev = NULL;
+   else if (holder) {
+   if (!cmpxchg(_dev->holder_data, NULL, holder))
+   dax_dev->holder_ops = ops;
+   else
+   dax_dev = NULL;
+   }
dax_read_unlock(id);
 
return dax_dev;
 }
 EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
+
+void fs_put_dax(struct dax_device *dax_dev, void *holder)
+{
+   if (dax_dev && holder &&
+   cmpxchg(_dev->holder_data, holder, NULL) == holder)
+   dax_dev->holder_ops = NULL;
+   put_dax(dax_dev);
+}
+EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
 enum dax_device_flags {
@@ -194,6 +216,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
+ u64 len, int mf_flags)
+{
+   int rc, id;
+
+   id = dax_read_lock();
+   if (!dax_alive(dax_dev)) {
+   rc = -ENXIO;
+   goto out;
+   }
+
+   if (!dax_dev->holder_ops) {
+   rc = -EOPNOTSUPP;
+   goto out;
+   }
+
+   rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+out:
+   dax_read_unlock(id);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -267,8 +312,15 @@ void kill_dax(struct dax_device *dax_dev)
if (!dax_dev)
return;
 
+   if (dax_dev->holder_data != NULL)
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
+
+   /* clear holder data */
+   dax_dev->holder_ops = NULL;
+   dax_dev->holder_data = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dax);
 
@@ -410,6 +462,19 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+/**
+ * dax_holder() - obtain the holder of a dax device
+ * @dax_dev: a dax_device instance
+
+ * Return: the holder's dat

[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-08 Thread Shiyang Ruan

This is a combination of two patchsets:
 1.fsdax-rmap: 
https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
 2.fsdax-reflink: 
https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/

 Changes since v13 of fsdax-rmap:
  1. Fixed mistakes during rebasing code to latest next-
  2. Rebased to next-20220504

 Changes since v10 of fsdax-reflink:
  1. Rebased to next-20220504 and fsdax-rmap
  2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
  iter model'
  3. Fixed many conflicts during rebasing
  4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
  shorter than smap->length or dmap->length.
  PS: There are many changes during rebasing.  I think it's better to
  review again.

==
Shiyang Ruan (14):
  fsdax-rmap:
dax: Introduce holder for dax_device
mm: factor helpers for memory_failure_dev_pagemap
pagemap,pmem: Introduce ->memory_failure()
fsdax: Introduce dax_lock_mapping_entry()
mm: Introduce mf_dax_kill_procs() for fsdax case
xfs: Implement ->notify_failure() for XFS
fsdax: set a CoW flag when associate reflink mappings
  fsdax-reflink:
fsdax: Output address in dax_iomap_pfn() and rename it
fsdax: Introduce dax_iomap_cow_copy()
fsdax: Replace mmap entry in case of CoW
fsdax: Add dax_iomap_cow_copy() for dax zero
fsdax: Dedup file range to use a compare function
xfs: support CoW in fsdax mode
xfs: Add dax dedupe support

 drivers/dax/super.c |  67 +-
 drivers/md/dm.c |   2 +-
 drivers/nvdimm/pmem.c   |  17 ++
 fs/dax.c| 398 ++--
 fs/erofs/super.c|  13 +-
 fs/ext2/super.c |   7 +-
 fs/ext4/super.c |   9 +-
 fs/remap_range.c|  31 ++-
 fs/xfs/Makefile |   5 +
 fs/xfs/xfs_buf.c|  10 +-
 fs/xfs/xfs_file.c   |   9 +-
 fs/xfs/xfs_fsops.c  |   3 +
 fs/xfs/xfs_inode.c  |  69 ++-
 fs/xfs/xfs_inode.h  |   1 +
 fs/xfs/xfs_iomap.c  |  46 -
 fs/xfs/xfs_iomap.h  |   3 +
 fs/xfs/xfs_mount.h  |   1 +
 fs/xfs/xfs_notify_failure.c | 220 
 fs/xfs/xfs_reflink.c|  12 +-
 fs/xfs/xfs_super.h  |   1 +
 include/linux/dax.h |  56 -
 include/linux/fs.h  |  12 +-
 include/linux/memremap.h|  12 ++
 include/linux/mm.h  |   2 +
 include/linux/page-flags.h  |   6 +
 mm/memory-failure.c | 257 ---
 26 files changed, 1087 insertions(+), 182 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

-- 
2.35.1

[PATCH v14 02/07] mm: factor helpers for memory_failure_dev_pagemap

2022-05-08 Thread Shiyang Ruan

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax.  So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Miaohe Lin 
---
 mm/memory-failure.c | 159 
 1 file changed, 88 insertions(+), 71 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9343cb28a58..9e4728103c0d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1495,6 +1495,91 @@ static int try_to_split_thp_page(struct page *page, 
const char *msg)
return 0;
 }
 
+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+   struct address_space *mapping, pgoff_t index, int flags)
+{
+   struct to_kill *tk;
+   unsigned long size = 0;
+
+   list_for_each_entry(tk, to_kill, nd)
+   if (tk->size_shift)
+   size = max(size, 1UL << tk->size_shift);
+
+   if (size) {
+   /*
+* Unmap the largest mapping to avoid breaking up device-dax
+* mappings which are constant size. The actual size of the
+* mapping being torn down is communicated in siginfo, see
+* kill_proc()
+*/
+   loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+   unmap_mapping_range(mapping, start, size, 0);
+   }
+
+   kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+   struct dev_pagemap *pgmap)
+{
+   struct page *page = pfn_to_page(pfn);
+   LIST_HEAD(to_kill);
+   dax_entry_t cookie;
+   int rc = 0;
+
+   /*
+* Pages instantiated by device-dax (not filesystem-dax)
+* may be compound pages.
+*/
+   page = compound_head(page);
+
+   /*
+* Prevent the inode from being freed while we are interrogating
+* the address_space, typically this would be handled by
+* lock_page(), but dax pages do not use the page lock. This
+* also prevents changes to the mapping of this pfn until
+* poison signaling is complete.
+*/
+   cookie = dax_lock_page(page);
+   if (!cookie)
+   return -EBUSY;
+
+   if (hwpoison_filter(page)) {
+   rc = -EOPNOTSUPP;
+   goto unlock;
+   }
+
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   /*
+* TODO: Handle HMM pages which may need coordination
+* with device-side memory.
+*/
+   rc = -EBUSY;
+   goto unlock;
+   }
+
+   /*
+* Use this flag as an indication that the dax page has been
+* remapped UC to prevent speculative consumption of poison.
+*/
+   SetPageHWPoison(page);
+
+   /*
+* Unlike System-RAM there is no possibility to swap in a
+* different physical page at a given virtual address, so all
+* userspace consumption of ZONE_DEVICE memory necessitates
+* SIGBUS (i.e. MF_MUST_KILL)
+*/
+   flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+   collect_procs(page, _kill, true);
+
+   unmap_and_kill(_kill, pfn, page->mapping, page->index, flags);
+unlock:
+   dax_unlock_page(page, cookie);
+   return rc;
+}
+
 /*
  * Called from hugetlb code with hugetlb_lock held.
  *
@@ -1643,12 +1728,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
struct dev_pagemap *pgmap)
 {
struct page *page = pfn_to_page(pfn);
-   unsigned long size = 0;
-   struct to_kill *tk;
-   LIST_HEAD(tokill);
-   int rc = -EBUSY;
-   loff_t start;
-   dax_entry_t cookie;
+   int rc = -ENXIO;
 
if (flags & MF_COUNT_INCREASED)
/*
@@ -1657,73 +1737,10 @@ static int memory_failure_dev_pagemap(unsigned long 
pfn, int flags,
put_page(page);
 
/* device metadata space is not recoverable */
-   if (!pgmap_pfn_valid(pgmap, pfn)) {
-   rc = -ENXIO;
-   goto out;
-   }
-
-   /*
-* Pages instantiated by device-dax (not filesystem-dax)
-* may be compound pages.
-*/
-   page = compound_head(page);
-
-   /*
-* Prevent the inode from being freed while we are interrogating
-* the address_space, typically this would be handled by
-* lock_page(), but dax pages do not use the page lock. This
-* also prevents changes to the mapping of this pfn until
-* poison signaling is complete.
-*/
-   cookie = dax_lock_page(page);
-   if (!cookie)
+   if (!pgmap_pfn_valid(pgmap, pfn))
goto out;
 
-   if (hwpoison_filter(pa

[PATCH v14 03/07] pagemap,pmem: Introduce ->memory_failure()

2022-05-08 Thread Shiyang Ruan

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Naoya Horiguchi 
---
 drivers/nvdimm/pmem.c| 17 +
 include/linux/memremap.h | 12 
 mm/memory-failure.c  | 14 ++
 3 files changed, 43 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..bd502957cfdf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -366,6 +366,21 @@ static void pmem_release_disk(void *__pmem)
blk_cleanup_disk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+   unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+   struct pmem_device *pmem =
+   container_of(pgmap, struct pmem_device, pgmap);
+   u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+   u64 len = nr_pages << PAGE_SHIFT;
+
+   return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+   .memory_failure = pmem_pagemap_memory_failure,
+};
+
 static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
 {
@@ -427,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +456,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.range.end = res->end;
pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..f1d413ef7c0d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,6 +79,18 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Handle the memory failure happens on a range of pfns.  Notify the
+* processes who are using these pfns, and try to recover the data on
+* them if necessary.  The mf_flags is finally passed to the recover
+* function through the whole notify routine.
+*
+* When this is not implemented, or it returns -EOPNOTSUPP, the caller
+* will fall back to a common handler called mf_generic_kill_procs().
+*/
+   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+ unsigned long nr_pages, int mf_flags);
 };
 
 #define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9e4728103c0d..aeb19593af9c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1740,6 +1740,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
if (!pgmap_pfn_valid(pgmap, pfn))
goto out;
 
+   /*
+* Call driver's implementation to handle the memory failure, otherwise
+* fall back to generic handler.
+*/
+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
+   /*
+* Fall back to generic handler too if operation is not
+* supported inside the driver/device/filesystem.
+*/
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   }
+
rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
/* drop pgmap ref acquired in caller */
-- 
2.35.1

[PATCH v11 05/07] fsdax: Dedup file range to use a compare function

2022-05-08 Thread Shiyang Ruan

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 82 
 fs/remap_range.c | 31 ++---
 fs/xfs/xfs_reflink.c |  8 +++--
 include/linux/dax.h  |  8 +
 include/linux/fs.h   | 12 ---
 5 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3aa863e9fec..601a23c6378c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1860,3 +1860,85 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
+   struct iomap_iter *it_dest, u64 len, bool *same)
+{
+   const struct iomap *smap = _src->iomap;
+   const struct iomap *dmap = _dest->iomap;
+   loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
+   void *saddr, *daddr;
+   int id, ret;
+
+   len = min(len, min(smap->length, dmap->length));
+
+   if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+   *same = true;
+   return len;
+   }
+
+   if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+   *same = false;
+   return 0;
+   }
+
+   id = dax_read_lock();
+   ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ , NULL);
+   if (ret < 0)
+   goto out_unlock;
+
+   *same = !memcmp(saddr, daddr, len);
+   if (!*same)
+   len = 0;
+   dax_read_unlock(id);
+   return len;
+
+out_unlock:
+   dax_read_unlock(id);
+   return -EIO;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+   struct inode *dst, loff_t dstoff, loff_t len, bool *same,
+   const struct iomap_ops *ops)
+{
+   struct iomap_iter src_iter = {
+   .inode  = src,
+   .pos= srcoff,
+   .len= len,
+   .flags  = IOMAP_DAX,
+   };
+   struct iomap_iter dst_iter = {
+   .inode  = dst,
+   .pos= dstoff,
+   .len= len,
+   .flags  = IOMAP_DAX,
+   };
+   int ret;
+
+   while ((ret = iomap_iter(_iter, ops)) > 0) {
+   while ((ret = iomap_iter(_iter, ops)) > 0) {
+   dst_iter.processed = dax_range_compare_iter(_iter,
+   _iter, len, same);
+   }
+   if (ret <= 0)
+   src_iter.processed = ret;
+   }
+   return ret;
+}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+   return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+  pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e112b5424cdb..231de627c1b9 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -271,9 +272,11 @@ static int vfs_dedupe_file_range_compare(struct file *src, 
loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   loff_t *len, unsigned int remap_flags,
+   const struct iomap_ops *dax_read_ops)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -333,8 +336,18 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
boolis_

[PATCH v11 06/07] xfs: support CoW in fsdax mode

2022-05-08 Thread Shiyang Ruan

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c  |  7 ++-
 fs/xfs/xfs_iomap.c | 46 +-
 fs/xfs/xfs_iomap.h |  3 +++
 3 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index af954a5b71f8..5a4508b23b51 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -669,7 +669,7 @@ xfs_file_dax_write(
pos = iocb->ki_pos;
 
trace_xfs_file_dax_write(iocb, from);
-   ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
+   ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
error = xfs_setfilesize(ip, pos, ret);
@@ -1285,10 +1285,7 @@ __xfs_filemap_fault(
pfn_t pfn;
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_iomap_fault(vmf, pe_size, , NULL,
-   (write_fault && !vmf->cow_page) ?
-_direct_write_iomap_ops :
-_read_iomap_ops);
+   ret = xfs_dax_fault(vmf, pe_size, write_fault, );
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..e35842215d22 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,6 +27,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "linux/dax.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -773,7 +774,8 @@ xfs_direct_write_iomap_begin(
 
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , , ,
-   , flags & IOMAP_DIRECT);
+   ,
+   (flags & IOMAP_DIRECT) || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -867,6 +869,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+   struct inode*inode,
+   loff_t  pos,
+   loff_t  length,
+   ssize_t written,
+   unsignedflags,
+   struct iomap*iomap)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   if (!xfs_is_cow_inode(ip))
+   return 0;
+
+   if (!written) {
+   xfs_reflink_cancel_cow_range(ip, pos, length, true);
+   return 0;
+   }
+
+   return xfs_reflink_end_cow(ip, pos, written);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+   .iomap_begin= xfs_direct_write_iomap_begin,
+   .iomap_end  = xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
struct inode*inode,
@@ -1358,3 +1387,18 @@ xfs_truncate_page(
return iomap_truncate_page(inode, pos, did_zero,
   _buffered_write_iomap_ops);
 }
+
+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+   struct vm_fault *vmf,
+   enum page_entry_sizepe_size,
+   boolwrite_fault,
+   pfn_t   *pfn)
+{
+   return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+   (write_fault && !vmf->cow_page) ?
+   _dax_write_iomap_ops :
+   _read_iomap_ops);
+}
+#endif
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index e88dc162c785..89dfa3bb099f 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,6 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap 
*iomap,
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
bool *did_zero);
 int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
+int xfs_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+   bool write_fault, pfn_t *pfn);
 
 static inline xfs_filblks_t
 xfs_aligned_fsb_count(
@@ -51,5 +53,6 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.35.1

[PATCH v11 07/07] xfs: Add dax dedupe support

2022-05-08 Thread Shiyang Ruan

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped.  After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/xfs/xfs_file.c|  2 +-
 fs/xfs/xfs_inode.c   | 69 +---
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  4 +--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5a4508b23b51..cf78eb393258 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -807,7 +807,7 @@ xfs_wait_dax_page(
xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 }
 
-static int
+int
 xfs_break_dax_layouts(
struct inode*inode,
bool*retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b2879870a17e..96308065a2b3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3767,6 +3767,50 @@ xfs_iolock_two_inodes_and_break_layout(
return 0;
 }
 
+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+   struct xfs_inode*ip1,
+   struct xfs_inode*ip2)
+{
+   int error;
+   boolretry;
+   struct page *page;
+
+   if (ip1->i_ino > ip2->i_ino)
+   swap(ip1, ip2);
+
+again:
+   retry = false;
+   /* Lock the first inode */
+   xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+   error = xfs_break_dax_layouts(VFS_I(ip1), );
+   if (error || retry) {
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   if (error == 0 && retry)
+   goto again;
+   return error;
+   }
+
+   if (ip1 == ip2)
+   return 0;
+
+   /* Nested lock the second inode */
+   xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+   /*
+* We cannot use xfs_break_dax_layouts() directly here because it may
+* need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+* for this nested lock case.
+*/
+   page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+   if (page && page_ref_count(page) != 1) {
+   xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   goto again;
+   }
+
+   return 0;
+}
+
 /*
  * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
  * mmap activity.
@@ -3781,8 +3825,19 @@ xfs_ilock2_io_mmap(
ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
if (ret)
return ret;
-   filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
-   VFS_I(ip2)->i_mapping);
+
+   if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+   ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+   if (ret) {
+   inode_unlock(VFS_I(ip2));
+   if (ip1 != ip2)
+   inode_unlock(VFS_I(ip1));
+   return ret;
+   }
+   } else
+   filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
+   VFS_I(ip2)->i_mapping);
+
return 0;
 }
 
@@ -3792,8 +3847,14 @@ xfs_iunlock2_io_mmap(
struct xfs_inode*ip1,
struct xfs_inode*ip2)
 {
-   filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
- VFS_I(ip2)->i_mapping);
+   if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+   xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+   if (ip1 != ip2)
+   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+   } else
+   filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
+ VFS_I(ip2)->i_mapping);
+
inode_unlock(VFS_I(ip2));
if (ip1 != ip2)
inode_unlock(VFS_I(ip1));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7be6f8e705ab..8313cc83b6ee 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -467,6 +467,7 @@ xfs_itruncate_extents(
 }
 
 /* from xfs_file.c */
+intxfs_break_dax_layouts(struct inode *inode, bool *retry);
 intxfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 10a9947e35d9..7cceea510a01 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1338,8 +1338,8 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;
 
-   /* Don't share DAX file data for now. */
-   if (IS_DAX(inode_in) || IS_DAX(inode_out))
+   /* Don't share DAX file data with non-DAX file. */
+   if

[PATCH v11 03/07] fsdax: Replace mmap entry in case of CoW

2022-05-08 Thread Shiyang Ruan

Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 77 ++--
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 00d2cb72ec58..78e26204697b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -828,6 +828,23 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const 
struct iomap_iter *iter
return 0;
 }
 
+/*
+ * MAP_SYNC on a dax mapping guarantees dirty metadata is
+ * flushed on write-faults (non-cow), but not read-faults.
+ */
+static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
+   struct vm_area_struct *vma)
+{
+   return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
+   (iter->iomap.flags & IOMAP_F_DIRTY);
+}
+
+static bool dax_fault_is_cow(const struct iomap_iter *iter)
+{
+   return (iter->flags & IOMAP_WRITE) &&
+   (iter->iomap.flags & IOMAP_F_SHARED);
+}
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -835,16 +852,19 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const 
struct iomap_iter *iter
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-   struct address_space *mapping, struct vm_fault *vmf,
-   void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+   const struct iomap_iter *iter, void *entry, pfn_t pfn,
+   unsigned long flags)
 {
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
+   bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
+   bool cow = dax_fault_is_cow(iter);
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-   if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -856,12 +876,12 @@ static void *dax_insert_entry(struct xa_state *xas,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-   false);
+   cow);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -881,6 +901,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+   if (cow)
+   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
 }
@@ -1122,17 +1145,15 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
length, size_t align_size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
-   struct address_space *mapping, void **entry,
-   struct vm_fault *vmf)
+static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   const struct iomap_iter *iter, void **entry)
 {
-   struct inode *inode = mapping->host;
+   struct inode *inode = iter->inode;
unsigned long vaddr = vmf->address;
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;
 
-   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-   DAX_ZERO_PAGE, false);
+   *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1141,7 +1162,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 
 #ifdef CONFIG_FS_DAX_PMD
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vm

[PATCH v11 02/07] fsdax: Introduce dax_iomap_cow_copy()

2022-05-08 Thread Shiyang Ruan

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

In this case, the destination (iomap->addr) points to a newly allocated
extent.  It is needed to copy the data from srcmap to the extent.  In
theory, it is better to copy the head and tail ranges which is outside
of the non-aligned area instead of copying the whole aligned range. But
in dax page fault, it will always be an aligned range. So copy the whole
range in this case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 88 
 1 file changed, 83 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d4f195aeaa12..a4d56cfa33d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,6 +1061,60 @@ static int dax_iomap_direct_access(const struct iomap 
*iomap, loff_t pos,
return rc;
 }
 
+/**
+ * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * @pos:   address to do copy from.
+ * @length:size of copy operation.
+ * @align_size:aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap:iomap srcmap
+ * @daddr: destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault (page
+ * aligned), to copy the length size data to daddr. Or, while doing normal DAX
+ * write operation, dax_iomap_actor() might call this to do the copy of either
+ * start or end unaligned address. In the latter case the rest of the copy of
+ * aligned ranges is taken care by dax_iomap_actor() itself.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+   const struct iomap *srcmap, void *daddr)
+{
+   loff_t head_off = pos & (align_size - 1);
+   size_t size = ALIGN(head_off + length, align_size);
+   loff_t end = pos + length;
+   loff_t pg_end = round_up(end, align_size);
+   bool copy_all = head_off == 0 && end == pg_end;
+   void *saddr = 0;
+   int ret = 0;
+
+   ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
+   if (ret)
+   return ret;
+
+   if (copy_all) {
+   ret = copy_mc_to_kernel(daddr, saddr, length);
+   return ret ? -EIO : 0;
+   }
+
+   /* Copy the head part of the range */
+   if (head_off) {
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   if (ret)
+   return -EIO;
+   }
+
+   /* Copy the tail part of the range */
+   if (end < pg_end) {
+   loff_t tail_off = head_off + length;
+   loff_t tail_len = pg_end - end;
+
+   ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+   tail_len);
+   if (ret)
+   return -EIO;
+   }
+   return 0;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1231,15 +1285,17 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
struct iov_iter *iter)
 {
const struct iomap *iomap = >iomap;
+   const struct iomap *srcmap = >srcmap;
loff_t length = iomap_length(iomi);
loff_t pos = iomi->pos;
struct dax_device *dax_dev = iomap->dax_dev;
loff_t end = pos + length, done = 0;
+   bool write = iov_iter_rw(iter) == WRITE;
ssize_t ret = 0;
size_t xfer;
int id;
 
-   if (iov_iter_rw(iter) == READ) {
+   if (!write) {
end = min(end, i_size_read(iomi->inode));
if (pos >= end)
return 0;
@@ -1248,7 +1304,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
return iov_iter_zero(min(length, end - pos), iter);
}
 
-   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+   /*
+* In DAX mode, enforce either pure overwrites of written extents, or
+* writes to unwritten extents as part of a copy-on-write operation.
+*/
+   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+   !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;
 
/*
@@ -1282,13 +1343,21 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
*iomi,
break;
}
 
+   if (write &&
+   srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+   ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+kaddr);
+   if (ret)
+   break;
+   }
+
map_len = PFN_PHYS(map_len);
kad

[PATCH v11 01/07] fsdax: Output address in dax_iomap_pfn() and rename it

2022-05-08 Thread Shiyang Ruan

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ritesh Harjani 
Reviewed-by: Dan Williams 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4d3dfc8bee33..d4f195aeaa12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1025,8 +1025,8 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+   size_t size, void **kaddr, pfn_t *pfnp)
 {
pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
int id, rc;
@@ -1034,11 +1034,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
loff_t pos, size_t size,
 
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-  NULL, pfnp);
+  kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+   if (!pfnp)
+   goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1048,6 +1050,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+   if (!kaddr)
+   goto out;
+   if (!*kaddr)
+   rc = -EFAULT;
 out:
dax_read_unlock(id);
return rc;
@@ -1444,7 +1452,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
}
 
-   err = dax_iomap_pfn(>iomap, pos, size, );
+   err = dax_iomap_direct_access(>iomap, pos, size, NULL, );
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-- 
2.35.1

[PATCH v14 07/07] fsdax: set a CoW flag when associate reflink mappings

2022-05-08 Thread Shiyang Ruan

Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
mappings.  In this case, since the dax-rmap has already took the
responsibility to look up for shared files by given dax page,
the page->mapping is no longer to used for rmap but for marking that
this dax page is shared.  And to make sure disassociation works fine, we
use page->index as refcount, and clear page->mapping to the initial
state when page->index is decreased to 0.

With the help of this new flag, it is able to distinguish normal case
and CoW case, and keep the warning in normal case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c   | 50 +++---
 include/linux/page-flags.h |  6 +
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 57efd3f73655..4d3dfc8bee33 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,13 +334,35 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
+static inline bool dax_mapping_is_cow(struct address_space *mapping)
+{
+   return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+}
+
 /*
- * TODO: for reflink+dax we need a way to associate a single page with
- * multiple address_space instances at different linear_page_index()
- * offsets.
+ * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ */
+static inline void dax_mapping_set_cow(struct page *page)
+{
+   if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+   /*
+* Reset the index if the page was already mapped
+* regularly before.
+*/
+   if (page->mapping)
+   page->index = 1;
+   page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+   }
+   page->index++;
+}
+
+/*
+ * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * whether this entry is shared by multiple files.  If so, set the 
page->mapping
+ * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-   struct vm_area_struct *vma, unsigned long address)
+   struct vm_area_struct *vma, unsigned long address, bool cow)
 {
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -352,9 +374,13 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(page->mapping);
-   page->mapping = mapping;
-   page->index = index + i++;
+   if (cow) {
+   dax_mapping_set_cow(page);
+   } else {
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   page->index = index + i++;
+   }
}
 }
 
@@ -370,7 +396,12 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   if (dax_mapping_is_cow(page->mapping)) {
+   /* keep the CoW flag if this page is still shared */
+   if (page->index-- > 0)
+   continue;
+   } else
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
}
@@ -829,7 +860,8 @@ static void *dax_insert_entry(struct xa_state *xas,
void *old;
 
dax_disassociate_entry(entry, mapping, false);
-   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
+   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
+   false);
/*
 * Only swap our new entry into the page cache if the current
 * entry is a zero page or an empty entry.  If a normal PTE or
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fe47ee8dc258..cad9aeb5e75c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -650,6 +650,12 @@ __PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
 #define PAGE_MAPPING_KSM   (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
+/*
+ * Different with flags above, this flag is used only for fsdax mode.  It
+ * indicates that this page->mapping is now under reflink case.
+ */
+#de

[PATCH v14 04/07] fsdax: Introduce dax_lock_mapping_entry()

2022-05-08 Thread Shiyang Ruan

The current dax_lock_page() locks dax entry by obtaining mapping and
index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
to lock a specific dax entry corresponding to this file's mapping,index.
And output the page corresponding to the specific dax entry for caller
use.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/dax.c| 63 +
 include/linux/dax.h | 15 +++
 2 files changed, 78 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 1ac12e877f4f..57efd3f73655 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
dax_unlock_entry(, (void *)cookie);
 }
 
+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t 
index,
+   struct page **page)
+{
+   XA_STATE(xas, NULL, 0);
+   void *entry;
+
+   rcu_read_lock();
+   for (;;) {
+   entry = NULL;
+   if (!dax_mapping(mapping))
+   break;
+
+   xas.xa = >i_pages;
+   xas_lock_irq();
+   xas_set(, index);
+   entry = xas_load();
+   if (dax_is_locked(entry)) {
+   rcu_read_unlock();
+   wait_entry_unlocked(, entry);
+   rcu_read_lock();
+   continue;
+   }
+   if (!entry ||
+   dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   /*
+* Because we are looking for entry from file's mapping
+* and index, so the entry may not be inserted for now,
+* or even a zero/empty entry.  We don't think this is
+* an error case.  So, return a special value and do
+* not output @page.
+*/
+   entry = (void *)~0UL;
+   } else {
+   *page = pfn_to_page(dax_to_pfn(entry));
+   dax_lock_entry(, entry);
+   }
+   xas_unlock_irq();
+   break;
+   }
+   rcu_read_unlock();
+   return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+   dax_entry_t cookie)
+{
+   XA_STATE(xas, >i_pages, index);
+
+   if (cookie == ~0UL)
+   return;
+
+   dax_unlock_entry(, (void *)cookie);
+}
+
 /*
  * Find page cache entry at given index. If it is a DAX entry, return it
  * with the entry locked. If the page cache doesn't contain an entry at
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c426a207ba8..c152f315d1c9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -143,6 +143,10 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t 
start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+   unsigned long index, struct page **page);
+void dax_unlock_mapping_entry(struct address_space *mapping,
+   unsigned long index, dax_entry_t cookie);
 #else
 static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 {
@@ -170,6 +174,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+
+static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+   unsigned long index, struct page **page)
+{
+   return 0;
+}
+
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+   unsigned long index, dax_entry_t cookie)
+{
+}
 #endif
 
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-- 
2.35.1

Re: [PATCH v13 3/7] pagemap,pmem: Introduce ->memory_failure()

2022-04-22 Thread Shiyang Ruan





在 2022/4/21 16:24, Miaohe Lin 写道:

On 2022/4/19 12:50, Shiyang Ruan wrote:

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
---
  drivers/nvdimm/pmem.c| 17 +
  include/linux/memremap.h | 12 
  mm/memory-failure.c  | 14 ++
  3 files changed, 43 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..bd502957cfdf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -366,6 +366,21 @@ static void pmem_release_disk(void *__pmem)
blk_cleanup_disk(pmem->disk);
  }
  
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,

+   unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+   struct pmem_device *pmem =
+   container_of(pgmap, struct pmem_device, pgmap);
+   u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+   u64 len = nr_pages << PAGE_SHIFT;
+
+   return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+   .memory_failure = pmem_pagemap_memory_failure,
+};
+
  static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
  {
@@ -427,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +456,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.range.end = res->end;
pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = _pagemap_ops;
addr = devm_memremap_pages(dev, >pgmap);
pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ad6062d736cd..bcfb6bf4ce5a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,6 +79,18 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Handle the memory failure happens on a range of pfns.  Notify the
+* processes who are using these pfns, and try to recover the data on
+* them if necessary.  The mf_flags is finally passed to the recover
+* function through the whole notify routine.
+*
+* When this is not implemented, or it returns -EOPNOTSUPP, the caller
+* will fall back to a common handler called mf_generic_kill_procs().
+*/
+   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+ unsigned long nr_pages, int mf_flags);
  };
  
  #define PGMAP_ALTMAP_VALID	(1 << 0)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7c8c047bfdc8..a40e79e634a4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1741,6 +1741,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
if (!pgmap_pfn_valid(pgmap, pfn))
goto out;
  
+	/*

+* Call driver's implementation to handle the memory failure, otherwise
+* fall back to generic handler.
+*/
+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
+   /*
+* Fall back to generic handler too if operation is not
+* supported inside the driver/device/filesystem.
+*/
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   }
+


Thanks for your patch. There are two questions:

1.Is dax_lock_page + dax_unlock_page pair needed here?


They are moved into mf_generic_kill_procs() in Patch2.  Callback will 
implement its own dax lock/unlock method.  For example, for 
mf_dax_kill_procs() in Patch4, we implemented 
dax_lock_mapping_entry()/dax_unlock_mapping_entry() for it.



2.hwpoison_filter and SetPageHWPoison will be handled by the callback or 
they're just ignored deliberately?


SetPageHWPoison() will be handled by callback or b

Re: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap

2022-04-21 Thread Shiyang Ruan





在 2022/4/21 14:13, HORIGUCHI NAOYA(堀口 直也) 写道:

On Tue, Apr 19, 2022 at 12:50:40PM +0800, Shiyang Ruan wrote:

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax.  So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 


Thanks for the refactoring.  As I commented to 0/7, the conflict with
"mm/hwpoison: fix race between hugetlb free/demotion and 
memory_failure_hugetlb()"
can be trivially resolved.

Another few comment below ...


---
  mm/memory-failure.c | 157 
  1 file changed, 87 insertions(+), 70 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e3fbff5bd467..7c8c047bfdc8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,6 +1498,90 @@ static int try_to_split_thp_page(struct page *page, 
const char *msg)
return 0;
  }

+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+   struct address_space *mapping, pgoff_t index, int flags)
+{
+   struct to_kill *tk;
+   unsigned long size = 0;
+
+   list_for_each_entry(tk, to_kill, nd)
+   if (tk->size_shift)
+   size = max(size, 1UL << tk->size_shift);
+
+   if (size) {
+   /*
+* Unmap the largest mapping to avoid breaking up device-dax
+* mappings which are constant size. The actual size of the
+* mapping being torn down is communicated in siginfo, see
+* kill_proc()
+*/
+   loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+   unmap_mapping_range(mapping, start, size, 0);
+   }
+
+   kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+   struct dev_pagemap *pgmap)
+{
+   struct page *page = pfn_to_page(pfn);
+   LIST_HEAD(to_kill);
+   dax_entry_t cookie;
+   int rc = 0;
+
+   /*
+* Pages instantiated by device-dax (not filesystem-dax)
+* may be compound pages.
+*/
+   page = compound_head(page);
+
+   /*
+* Prevent the inode from being freed while we are interrogating
+* the address_space, typically this would be handled by
+* lock_page(), but dax pages do not use the page lock. This
+* also prevents changes to the mapping of this pfn until
+* poison signaling is complete.
+*/
+   cookie = dax_lock_page(page);
+   if (!cookie)
+   return -EBUSY;
+
+   if (hwpoison_filter(page)) {
+   rc = -EOPNOTSUPP;
+   goto unlock;
+   }
+
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   /*
+* TODO: Handle HMM pages which may need coordination
+* with device-side memory.
+*/
+   return -EBUSY;


Don't we need to go to dax_unlock_page() as the origincal code do?


+   }
+
+   /*
+* Use this flag as an indication that the dax page has been
+* remapped UC to prevent speculative consumption of poison.
+*/
+   SetPageHWPoison(page);
+
+   /*
+* Unlike System-RAM there is no possibility to swap in a
+* different physical page at a given virtual address, so all
+* userspace consumption of ZONE_DEVICE memory necessitates
+* SIGBUS (i.e. MF_MUST_KILL)
+*/
+   flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+   collect_procs(page, _kill, true);
+
+   unmap_and_kill(_kill, pfn, page->mapping, page->index, flags);
+unlock:
+   dax_unlock_page(page, cookie);
+   return rc;
+}
+
  /*
   * Called from hugetlb code with hugetlb_lock held.
   *
@@ -1644,12 +1728,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
struct dev_pagemap *pgmap)
  {
struct page *page = pfn_to_page(pfn);
-   unsigned long size = 0;
-   struct to_kill *tk;
LIST_HEAD(tokill);


Is this variable unused in this function?


Yes, this one and the one above are mistakes I didn't notice when I 
resolving conflicts with the newer next- branch.  I'll fix them in next 
version.



--
Thanks,
Ruan.



Thanks,
Naoya Horiguchi

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-20 Thread Shiyang Ruan

Hi Dave,

在 2022/4/21 9:20, Dave Chinner 写道:

Hi Ruan,

On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:

This patchset is aimed to support shared pages tracking for fsdax.

Now that this is largely reviewed, it's time to work out the
logistics of merging it.

Thanks!

Changes since V12:
- Rebased onto next-20220414

What does this depend on that is in the linux-next kernel?

i.e. can this be applied successfully to a v5.18-rc2 kernel without
needing to drag in any other patchsets/commits/trees?

Firstly, I tried to apply to v5.18-rc2 but it failed.

There are some changes in memory-failure.c, which besides my Patch-02
"mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()"

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061

Then, why it is on linux-next is: I was told[1] there is a better fix
about "pgoff_address()" in linux-next:

"mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
so I rebased my patches to it and dropped one of mine.

[1] https://lore.kernel.org/linux-xfs/ykpuoogd139wp...@infradead.org/

What are your plans for the followup patches that enable
reflink+fsdax in XFS? AFAICT that patchset hasn't been posted for
while so I don't know what it's status is. Is that patchset anywhere
near ready for merge in this cycle?

If that patchset is not a candidate for this cycle, then it largely
doesn't matter what tree this is merged through as there shouldn't
be any major XFS or dax dependencies being built on top of it during
this cycle. The filesystem side changes are isolated and won't
conflict with other work in XFS, either, so this could easily go
through Dan's tree.

However, if the reflink enablement is ready to go, then this all
needs to be in the XFS tree so that we can run it through filesystem
level DAX+reflink testing. That will mean we need this in a stable
shared topic branch and tighter co-ordination between the trees.

So before we go any further we need to know if the dax+reflink
enablement patchset is near being ready to merge

The "reflink+fsdax" patchset is here:

https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/

It was based on v5.15-rc3, I think I should do a rebase.

--
Thanks,
Ruan.

Cheers,

Dave.

1 2 3 4 >

1 - 100 of 375 matches

Mail list logo