From: John Groves <[email protected]>
This series applies bug fixes (mostly found via sashiko) to the dax/fsdev
series. It has been soaking in the famfs CI pipeline and 1) won't affect
anything that doesn't use drivers/dax/fsdev.c, and 2) doesn't affect any
known workloads -- although the bugs would have manifested when multi-range
DCD dax devices are a thing (soon-ish).
Most of the series is confined to drivers/dax/fsdev.c. Two patches touch
shared DAX core in drivers/dax/super.c: patch 8 reads holder_ops once in
dax_holder_notify_failure() to close a double-fetch NULL dereference, and
patch 9 reorders fs_put_dax() and adds a WARN_ON(). fs_put_dax() is used by
ext2/ext4/erofs/xfs, but only holder-passing callers (like XFS in-tree) will
see a behavior change, and only a new warning if they misuse it.
Changes since V5:
- Patch 2 (multi-range memory_failure offset): reverted to walking
dev_dax->ranges[]. V5 walked the immutable pgmap->ranges[] to avoid the
lockless read of dev_dax->ranges[], but that regressed static devices:
pgmap->ranges[0].start can sit data_offset below the data start, so the
reported failure offset came out data_offset too high and the holder
would act on the wrong blocks. dev_dax->ranges[] gives the correct base
(ranges[0].start, the device data start) for both static and dynamic
devices. The lockless read is pre-existing -- the original single-range
code read dev_dax->ranges[0] locklessly too -- so this adds no new race;
closing it is left to a separate change. (Richard Cheng spotted the
static regression.)
- New patch 5 (dax/fsdev: clear pgmap ops and owner on unbind): fsdev sets
pgmap->ops and pgmap->owner at probe but nothing ever cleared them. For a
static device the pgmap is shared and long-lived, so after fsdev unbind
the stale fsdev ops survive; a later rebind to device_dax or an rmmod of
fsdev_dax could then dispatch memory_failure through a stale -- and
possibly freed -- handler. Add a devm action that clears both on unbind,
symmetric with setting them at probe. Suggested by Richard Cheng.
- Patch 8 (read holder_ops once): reworded the commit message and the code
comment to name only fs_put_dax() as the racing clearer. kill_dax()
clears holder_ops only after synchronize_srcu(), so it cannot race a
reader under dax_read_lock(); fs_put_dax() does no such synchronization.
Collected Reviewed-by from Richard Cheng.
- Patch 7 (fail probe on invalid pgmap offset): collected Reviewed-by from
Pankaj Gupta.
- Rebased onto the v7.1 release (V5 was based on v7.1-rc7); no other change
from the rebase.
Changes since V4:
- New patch 7 (dax: read holder_ops once in dax_holder_notify_failure()):
split the reader-side READ_ONCE() fix out of the fs_put_dax() patch and
placed it first, so the fs_put_dax() patch's "a concurrent
dax_holder_notify_failure() that sees NULL ops returns -EOPNOTSUPP
cleanly" reasoning actually holds when it lands. dax_holder_notify_failure()
read holder_ops twice without READ_ONCE(); a concurrent clear could make
the NULL check pass while the indirect call dereferenced NULL. Carries
Fixes: 8012b86608552 ("dax: introduce holder for dax_device"), the commit
that introduced the unmarked double fetch. Suggested by Richard Cheng (and
the Sashiko bot).
- Patch 2 (multi-range memory_failure offset): the ->memory_failure callback
now walks the pagemap's own immutable range array (pgmap->ranges[]) rather
than dev_dax->ranges[], which a concurrent sysfs mapping_store() can
krealloc() under dax_region_rwsem while this callback holds no such lock.
For dynamic devices the two arrays are identical, so the reported offset is
unchanged for the multi-range case this patch targets. Suggested by Richard
Cheng (and the Sashiko bot).
- Dropped the dax_dev_get()/dax_dev_find() patch (V4 patch 8) from this
revision. There is no in-tree caller yet; it will be sent together with the
famfs filesystem series that introduces the caller. (Per Richard Cheng /
Sashiko.)
- Patch 8 (holder_ops race in fs_put_dax()): unchanged from V4 (renumbered
from 7 to 8).
- Collected Reviewed-by from Dave Jiang on patches 4 and 6.
Changes since V3:
- Patch 4: Adopted Dave's suggested refactor -- factor out
fsdev_acquire_pgmap() and defer the dev_dax->pgmap assignment until
probe can no longer fail, replacing the goto-based cleanup. Did not
carry Alison's V3 Reviewed-by due to the rewrite.
- Patch 5: Also remove the now write-only dev_dax->virt_addr field,
per Dave's review.
- Patch 7: Fixed the WARN_ON() to tolerate holder_data == NULL, which
legitimately occurs when kill_dax() clears it during device removal
under a live holder (per Dave's review). Wrong-holder calls still
warn.
- Patch 8: Kept the Fixes tag -- the exported symbol itself is the
hazard; stable kernels carrying the export should want this fix.
Changes since V2:
* Patch 1 (comment fix): No change. Responded to Dave's question about
the dropped precondition -- the new comment correctly covers both
callers; fsdev_clear_folio_state() does not guarantee share==0 before
calling, so the old precondition was no longer universally true.
* V2 patch 2 (three fixes): Split into three separate patches (patches
2-4) per Dave's review.
* V2 patch 3 (two fixes): Split into two separate patches (patches 5-6)
per Dave's review.
* V2 patch 4 (clamp direct_access / remove cached_size): Dropped.
Dave's analysis correctly showed the claimed bug does not exist --
dax_pgoff_to_phys() already enforces that the full requested size fits
within a single range before returning, making the clamp a no-op in
every reachable path.
* V2 patch 5 (holder_ops race): Use WRITE_ONCE() for the holder_ops
store; add WARN_ON() on the cmpxchg result to catch wrong-holder and
double-put API contract violations; fix the inline comment, which
incorrectly claimed dax_holder_notify_failure() consults holder_ops
only when holder_data is non-NULL.
* V2 patch 6 (dax_dev_find): Add dax_alive() check under dax_read_lock()
after ilookup5() to prevent returning a device that is concurrently
being torn down by kill_dax().
* V2 patch 7 (formatting cleanup): Drop incorrect Fixes: tag; add
Dave's Reviewed-by.
* The series grows from 7 to 9 patches.
Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and
fs/famfs/famfs_inode.c, which are not upstream so it broke
attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
from Jonathan - minor cleanup
John Groves (10):
dax: fix misleading comment about share/index union in
dax_folio_reset_order()
dax/fsdev: fix multi-range offset in memory_failure handler
dax/fsdev: clear vmemmap_shift when binding static pgmap
dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
dax/fsdev: clear pgmap ops and owner on unbind
dax/fsdev: use __va(phys) for kaddr in direct_access
dax/fsdev: fail probe on invalid pgmap offset
dax: read holder_ops once in dax_holder_notify_failure()
dax: fix holder_ops race in fs_put_dax()
dax: fsdev.c minor formatting cleanup
drivers/dax/dax-private.h | 2 -
drivers/dax/fsdev.c | 148 ++++++++++++++++++++++++++------------
drivers/dax/super.c | 54 ++++++++++++--
fs/dax.c | 12 ++--
4 files changed, 159 insertions(+), 57 deletions(-)
base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
--
2.53.0