Re: [PATCH v6 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-17 Thread Jeff Layton
On Wed, 2024-07-17 at 13:31 +0200, Jan Kara wrote:
> On Mon 15-07-24 08:48:56, Jeff Layton wrote:
> > Add a high-level document that describes how multigrain timestamps work,
> > rationale for them, and some info about implementation and tradeoffs.
> > 
> > Reviewed-by: Josef Bacik 
> > Signed-off-by: Jeff Layton 
> 
> One comment below. With that fixed feel free to add:
> 
> Reviewed-by: Jan Kara 
> 
> > +Implementation Notes
> > +
> > +Multigrain timestamps are intended for use by local filesystems that get
> > +ctime values from the local clock. This is in contrast to network 
> > filesystems
> > +and the like that just mirror timestamp values from a server.
> > +
> > +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
> > +fstype->fs_flags in order to opt-in, providing the ctime is only ever set 
> > via
> > +inode_set_ctime_current(). If the filesystem has a ->getattr routine that
> > +doesn't call generic_fillattr, then you should have it call fill_mg_cmtime 
> > to
> > +fill those values.
> 
> I think you should explicitely mention that ->setattr() implementation
> needs to use setattr_copy() or otherwise mimic its behavior...
> 
>   Honza

I've added a sentence like you suggest to the patch in my tree. Thanks
for all the reviews!
-- 
Jeff Layton 



Re: [PATCH v6 0/9] fs: multigrain timestamp redux

2024-07-16 Thread Jeff Layton
On Tue, 2024-07-16 at 09:37 +0200, Christian Brauner wrote:
> On Mon, Jul 15, 2024 at 08:48:51AM GMT, Jeff Layton wrote:
> > I think this is pretty much ready for linux-next now. Since the latest
> > changes are pretty minimal, I've left the Reviewed-by's intact. It would
> > be nice to have acks or reviews from maintainers for ext4 and tmpfs too.
> > 
> > I did try to plumb this into bcachefs too, but the way it handles
> > timestamps makes that pretty difficult. It keeps the active copies in an
> > internal representation of the on-disk inode and periodically copies
> > them to struct inode. This is backward from the way most blockdev
> > filesystems do this.
> > 
> > Christian, would you be willing to pick these up  with an eye toward
> > v6.12 after the merge window settles?
> 
> Yup. About to queue it up. I'll try to find some time to go through it
> so I might have some replies later but that shouldn't hold up linux-next
> at all.

Great!

There is one minor update to the percpu counter patch to compile those
out when debugfs isn't enabled, so it may be best to pick the series
from the "mgtime" branch in my public git tree. Let me know if you'd
rather I re-post the series though.

Thanks!
-- 
Jeff Layton 



Re: [PATCH v6 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-15 Thread Jeff Layton
On Mon, 2024-07-15 at 11:32 -0700, Darrick J. Wong wrote:
> On Mon, Jul 15, 2024 at 08:48:54AM -0400, Jeff Layton wrote:
> > Four percpu counters for counting various stats around mgtimes, and
> > a
> > new debugfs file for displaying them:
> > 
> > - number of attempted ctime updates
> > - number of successful i_ctime_nsec swaps
> > - number of fine-grained timestamp fetches
> > - number of floor value swaps
> > 
> > Reviewed-by: Josef Bacik 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c | 70
> > +-
> >  1 file changed, 69 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 869994285e87..fff844345c35 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -21,6 +21,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  #include 
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -80,6 +82,10 @@ EXPORT_SYMBOL(empty_aops);
> >  
> >  static DEFINE_PER_CPU(unsigned long, nr_inodes);
> >  static DEFINE_PER_CPU(unsigned long, nr_unused);
> > +static DEFINE_PER_CPU(unsigned long, mg_ctime_updates);
> > +static DEFINE_PER_CPU(unsigned long, mg_fine_stamps);
> > +static DEFINE_PER_CPU(unsigned long, mg_floor_swaps);
> > +static DEFINE_PER_CPU(unsigned long, mg_ctime_swaps);
> 
> Should this all get switched off if CONFIG_DEBUG_FS=n?
> 
> --D
> 

Sure, why not. That's simple enough to do.

I pushed an updated mgtime branch to my git tree. Here's the updated
patch that's the only difference:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=mgtime&id=ee7fe6e9c0598754861c8620230f15f3de538ca5

Seems to build OK both with and without CONFIG_DEBUG_FS.
 
> >  
> >  static struct kmem_cache *inode_cachep __ro_after_init;
> >  
> > @@ -101,6 +107,42 @@ static inline long get_nr_inodes_unused(void)
> >     return sum < 0 ? 0 : sum;
> >  }
> >  
> > +static long get_mg_ctime_updates(void)
> > +{
> > +   int i;
> > +   long sum = 0;
> > +   for_each_possible_cpu(i)
> > +   sum += per_cpu(mg_ctime_updates, i);
> > +   return sum < 0 ? 0 : sum;
> > +}
> > +
> > +static long get_mg_fine_stamps(void)
> > +{
> > +   int i;
> > +   long sum = 0;
> > +   for_each_possible_cpu(i)
> > +   sum += per_cpu(mg_fine_stamps, i);
> > +   return sum < 0 ? 0 : sum;
> > +}
> > +
> > +static long get_mg_floor_swaps(void)
> > +{
> > +   int i;
> > +   long sum = 0;
> > +   for_each_possible_cpu(i)
> > +   sum += per_cpu(mg_floor_swaps, i);
> > +   return sum < 0 ? 0 : sum;
> > +}
> > +
> > +static long get_mg_ctime_swaps(void)
> > +{
> > +   int i;
> > +   long sum = 0;
> > +   for_each_possible_cpu(i)
> > +   sum += per_cpu(mg_ctime_swaps, i);
> > +   return sum < 0 ? 0 : sum;
> > +}
> > +
> >  long get_nr_dirty_inodes(void)
> >  {
> >     /* not actually dirty inodes, but a wild approximation */
> > @@ -2655,6 +2697,7 @@ struct timespec64
> > inode_set_ctime_current(struct inode *inode)
> >  
> >     /* Get a fine-grained time */
> >     fine = ktime_get();
> > +   this_cpu_inc(mg_fine_stamps);
> >  
> >     /*
> >  * If the cmpxchg works, we take the new
> > floor value. If
> > @@ -2663,11 +2706,14 @@ struct timespec64
> > inode_set_ctime_current(struct inode *inode)
> >  * as good, so keep it.
> >  */
> >     old = floor;
> > -   if (!atomic64_try_cmpxchg(&ctime_floor,
> > &old, fine))
> > +   if (atomic64_try_cmpxchg(&ctime_floor,
> > &old, fine))
> > +   this_cpu_inc(mg_floor_swaps);
> > +   else
> >     fine = old;
> >     now = ktime_mono_to_real(fine);
> >     }
> >     }
> > +   this_cpu_inc(mg_ctime_updates);
> >     now_ts = timestamp_truncate(ktime_to_timespec64(now),
> > inode);
> >     cur = cns;
> >  
> > @@ -2682,6 +2728,7 @@ struct timespec64
> > inode_set_ctime_current(struct inode *inode)
> >     /* If swap occurred, then we're (mostly) done */
> >     inode->i_ctime_sec = now_ts.t

[PATCH v6 9/9] tmpfs: add support for multigrain timestamps

2024-07-15 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7f2b609945a5..75a9a73a769f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4660,7 +4660,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH v6 8/9] btrfs: convert to multigrain timestamps

2024-07-15 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d90138683a0a..409628c0c3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH v6 7/9] ext4: switch to multigrain timestamps

2024-07-15 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb899628e121..95d4d7c0957a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7294,7 +7294,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH v6 6/9] xfs: switch to multigrain timestamps

2024-07-15 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
 fs/xfs/xfs_iops.c   | 10 +++---
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12..d25872f818fa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -592,8 +592,9 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
if (xfs_has_v3inodes(mp)) {
@@ -603,11 +604,6 @@ xfs_vn_getattr(
}
}
 
-   if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
-   stat->change_cookie = inode_query_iversion(inode);
-   stat->result_mask |= STATX_CHANGE_COOKIE;
-   }
-
/*
 * Note: If you add another clause to set an attribute flag, please
 * update attributes_mask below.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH v6 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-15 Thread Jeff Layton
Add a high-level document that describes how multigrain timestamps work,
rationale for them, and some info about implementation and tradeoffs.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/multigrain-ts.rst | 120 
 1 file changed, 120 insertions(+)

diff --git a/Documentation/filesystems/multigrain-ts.rst 
b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index ..5cefc204ecec
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Multigrain Timestamps
+=
+
+Introduction
+
+Historically, the kernel has always used coarse time values to stamp
+inodes. This value is updated on every jiffy, so any change that happens
+within that jiffy will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+  The inode change time. This is stamped with the current time whenever
+  the inode's metadata is changed. Note that this value is not settable
+  from userland.
+
+mtime:
+  The inode modification time. This is stamped with the current time
+  any time a file's contents change.
+
+atime:
+  The inode access time. This is stamped whenever an inode's contents are
+  read. Widely considered to be a terrible mistake. Usually avoided with
+  options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Multigrain Timestamps
+=
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fi

[PATCH v6 4/9] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-15 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Reviewed-by: Darrick J. Wong 
Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH v6 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-15 Thread Jeff Layton
Four percpu counters for counting various stats around mgtimes, and a
new debugfs file for displaying them:

- number of attempted ctime updates
- number of successful i_ctime_nsec swaps
- number of fine-grained timestamp fetches
- number of floor value swaps

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/inode.c | 70 +-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 869994285e87..fff844345c35 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -80,6 +82,10 @@ EXPORT_SYMBOL(empty_aops);
 
 static DEFINE_PER_CPU(unsigned long, nr_inodes);
 static DEFINE_PER_CPU(unsigned long, nr_unused);
+static DEFINE_PER_CPU(unsigned long, mg_ctime_updates);
+static DEFINE_PER_CPU(unsigned long, mg_fine_stamps);
+static DEFINE_PER_CPU(unsigned long, mg_floor_swaps);
+static DEFINE_PER_CPU(unsigned long, mg_ctime_swaps);
 
 static struct kmem_cache *inode_cachep __ro_after_init;
 
@@ -101,6 +107,42 @@ static inline long get_nr_inodes_unused(void)
return sum < 0 ? 0 : sum;
 }
 
+static long get_mg_ctime_updates(void)
+{
+   int i;
+   long sum = 0;
+   for_each_possible_cpu(i)
+   sum += per_cpu(mg_ctime_updates, i);
+   return sum < 0 ? 0 : sum;
+}
+
+static long get_mg_fine_stamps(void)
+{
+   int i;
+   long sum = 0;
+   for_each_possible_cpu(i)
+   sum += per_cpu(mg_fine_stamps, i);
+   return sum < 0 ? 0 : sum;
+}
+
+static long get_mg_floor_swaps(void)
+{
+   int i;
+   long sum = 0;
+   for_each_possible_cpu(i)
+   sum += per_cpu(mg_floor_swaps, i);
+   return sum < 0 ? 0 : sum;
+}
+
+static long get_mg_ctime_swaps(void)
+{
+   int i;
+   long sum = 0;
+   for_each_possible_cpu(i)
+   sum += per_cpu(mg_ctime_swaps, i);
+   return sum < 0 ? 0 : sum;
+}
+
 long get_nr_dirty_inodes(void)
 {
/* not actually dirty inodes, but a wild approximation */
@@ -2655,6 +2697,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 
/* Get a fine-grained time */
fine = ktime_get();
+   this_cpu_inc(mg_fine_stamps);
 
/*
 * If the cmpxchg works, we take the new floor value. If
@@ -2663,11 +2706,14 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 * as good, so keep it.
 */
old = floor;
-   if (!atomic64_try_cmpxchg(&ctime_floor, &old, fine))
+   if (atomic64_try_cmpxchg(&ctime_floor, &old, fine))
+   this_cpu_inc(mg_floor_swaps);
+   else
fine = old;
now = ktime_mono_to_real(fine);
}
}
+   this_cpu_inc(mg_ctime_updates);
now_ts = timestamp_truncate(ktime_to_timespec64(now), inode);
cur = cns;
 
@@ -2682,6 +2728,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
+   this_cpu_inc(mg_ctime_swaps);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
@@ -2751,3 +2798,24 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   long ctime_updates = get_mg_ctime_updates();
+   long ctime_swaps = get_mg_ctime_swaps();
+   long fine_stamps = get_mg_fine_stamps();
+   long floor_swaps = get_mg_floor_swaps();
+
+   seq_printf(s, "%lu %lu %lu %lu\n",
+  ctime_updates, ctime_swaps, fine_stamps, floor_swaps);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH v6 2/9] fs: tracepoints around multigrain timestamp events

2024-07-15 Thread Jeff Layton
Add some tracepoints around various multigrain timestamp events.

Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/inode.c   |   9 ++-
 fs/stat.c|   3 +
 include/trace/events/timestamp.h | 124 +++
 3 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 417acbeabef3..869994285e87 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,9 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "internal.h"
 
 /*
@@ -2569,6 +2572,7 @@ EXPORT_SYMBOL(inode_nohighmem);
 
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
 {
+   trace_inode_set_ctime_to_ts(inode, &ts);
set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec);
inode->i_ctime_sec = ts.tv_sec;
inode->i_ctime_nsec = ts.tv_nsec;
@@ -2668,13 +2672,16 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
cur = cns;
 
/* No need to cmpxchg if it's exactly the same */
-   if (cns == now_ts.tv_nsec && inode->i_ctime_sec == now_ts.tv_sec)
+   if (cns == now_ts.tv_nsec && inode->i_ctime_sec == now_ts.tv_sec) {
+   trace_ctime_xchg_skip(inode, &now_ts);
goto out;
+   }
 retry:
/* Try to swap the nsec value into place. */
if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) {
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
+   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
diff --git a/fs/stat.c b/fs/stat.c
index df7fdd3afed9..552dfd67688b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 #include "internal.h"
 #include "mount.h"
 
@@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
struct inode *inode)
stat->mtime = inode_get_mtime(inode);
stat->ctime.tv_sec = inode->i_ctime_sec;
stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
~I_CTIME_QUERIED;
+   trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
 }
 EXPORT_SYMBOL(fill_mg_cmtime);
 
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..c9e5ec930054
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+#define CTIME_QUERIED_FLAGS \
+   { I_CTIME_QUERIED, "Q" }
+
+DECLARE_EVENT_CLASS(ctime,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+
+   TP_ARGS(inode, ctime),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   ctime_s)
+   __field(u32,ctime_ns)
+   __field(u32,gen)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->ctime_s= ctime->tv_sec;
+   __entry->ctime_ns   = ctime->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->ctime_s, __entry->ctime_ns
+   )
+);
+
+DEFINE_EVENT(ctime, inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+   TP_ARGS(inode, ctime));
+
+DEFINE_EVENT(ctime, ctime_xchg_skip,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+   TP_ARGS(inode, ctime));
+
+TRACE_EVENT(ctime_ns_xchg,
+   TP_PROTO(struct inode *inode,
+u32 old,
+u32 new,
+u32 cur),
+
+   TP_ARGS(inode, old, new, cur),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(u32,gen)
+   __field(u32,old)
+   __field(u32,new)
+   __field(u32,cur)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino  

[PATCH v6 1/9] fs: add infrastructure for multigrain timestamps

2024-07-15 Thread Jeff Layton
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.

To remedy this, keep a global monotonic atomic64_t value that acts as a
timestamp floor.  When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.

If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.

We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.

Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).

Reviewed-by: Darrick J. Wong 
Reviewed-by: Josef Bacik 
Signed-off-by: Jeff Layton 
---
 fs/inode.c | 176 +
 fs/stat.c  |  36 ++-
 include/linux/fs.h |  34 ---
 3 files changed, 209 insertions(+), 37 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..417acbeabef3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,13 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/*
+ * This represents the latest fine-grained time that we have handed out as a
+ * timestamp on the system. Tracked as a monotonic value, and converted to the
+ * realtime clock on an as-needed basis.
+ */
+static __cacheline_aligned_in_smp atomic64_t ctime_floor;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2127,19 +2134,72 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current (monotonic) ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained timestamps, converted to realtime
+ * clock value.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t coarse = ktime_get_coarse();
+
+   /* If coarse time is already newer, return that */
+   if (!ktime_after(floor, coarse))
+   return ktime_get_coarse_real();
+   return ktime_mono_to_real(floor);
+}
+
+/**
+ * current_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_time(struct inode *inode)
+{
+   ktime_t floor = atomic64_read(&ctime_floor);
+   ktime_t now = coarse_ctime(floor);
+   struct timespec64 now_ts = ktime_to_timespec64(now);
+   u32 cns;
+
+   if (!is_mgtime(inode))
+   goto out;
+
+   /* If nothing has quer

[PATCH v6 0/9] fs: multigrain timestamp redux

2024-07-15 Thread Jeff Layton
I think this is pretty much ready for linux-next now. Since the latest
changes are pretty minimal, I've left the Reviewed-by's intact. It would
be nice to have acks or reviews from maintainers for ext4 and tmpfs too.

I did try to plumb this into bcachefs too, but the way it handles
timestamps makes that pretty difficult. It keeps the active copies in an
internal representation of the on-disk inode and periodically copies
them to struct inode. This is backward from the way most blockdev
filesystems do this.

Christian, would you be willing to pick these up  with an eye toward
v6.12 after the merge window settles?

Thanks!

Signed-off-by: Jeff Layton 
---
Changes in v6:
- Normalize timespec64 in inode_set_ctime_to_ts
- use DEFINE_PER_CPU counters for better vfs consistency
- skip ctime cmpxchg if the result means nothing will change
- add trace_ctime_xchg_skip to track skipped ctime updates
- use __print_flags in ctime_ns_xchg tracepoint
- Link to v5: 
https://lore.kernel.org/r/20240711-mgtime-v5-0-37bb5b465...@kernel.org

Changes in v5:
- refetch coarse time in coarse_ctime if not returning floor
- timestamp_truncate before swapping new ctime value into place
- track floor value as atomic64_t
- cleanups to Documentation file
- Link to v4: 
https://lore.kernel.org/r/20240708-mgtime-v4-0-a0f3c6fb5...@kernel.org

Changes in v4:
- reordered tracepoint fields for better packing
- rework percpu counters again to also count fine grained timestamps
- switch to try_cmpxchg for better efficiency
- Link to v3: 
https://lore.kernel.org/r/20240705-mgtime-v3-0-85b2daa9b...@kernel.org

Changes in v3:
- Drop the conversion of i_ctime fields to ktime_t, and use an unused bit
  of the i_ctime_nsec field as QUERIED flag.
- Better tracepoints for tracking floor and ctime updates
- Reworked percpu counters to be more useful
- Track floor as monotonic value, which eliminates clock-jump problem

Changes in v2:
- Added Documentation file
- Link to v1: 
https://lore.kernel.org/r/20240626-mgtime-v1-0-a189352d0...@kernel.org

---
Jeff Layton (9):
  fs: add infrastructure for multigrain timestamps
  fs: tracepoints around multigrain timestamp events
  fs: add percpu counters for significant multigrain timestamp events
  fs: have setattr_copy handle multigrain timestamps appropriately
  Documentation: add a new file documenting multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps

 Documentation/filesystems/multigrain-ts.rst | 120 +
 fs/attr.c   |  52 +-
 fs/btrfs/file.c |  25 +--
 fs/btrfs/super.c|   3 +-
 fs/ext4/super.c |   2 +-
 fs/inode.c  | 251 +---
 fs/stat.c   |  39 -
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c   |  10 +-
 fs/xfs/xfs_super.c  |   2 +-
 include/linux/fs.h  |  34 +++-
 include/trace/events/timestamp.h| 124 ++
 mm/shmem.c  |   2 +-
 13 files changed, 592 insertions(+), 78 deletions(-)
---
base-commit: bb83a76c647a96db4c9ae77b0577170da4d7bd77
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton 




Re: [PATCH v5 6/9] xfs: switch to multigrain timestamps

2024-07-11 Thread Jeff Layton
On Thu, 2024-07-11 at 12:14 -0700, Darrick J. Wong wrote:
> On Thu, Jul 11, 2024 at 11:58:59AM -0400, Jeff Layton wrote:
> > On Thu, 2024-07-11 at 08:09 -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 11, 2024 at 07:08:10AM -0400, Jeff Layton wrote:
> > > > Enable multigrain timestamps, which should ensure that there is an
> > > > apparent change to the timestamp whenever it has been written after
> > > > being actively observed via getattr.
> > > > 
> > > > Also, anytime the mtime changes, the ctime must also change, and those
> > > > are now the only two options for xfs_trans_ichgtime. Have that function
> > > > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > > > always set.
> > > > 
> > > > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > > > should give us better semantics now.
> > > 
> > > Following up on "As long as the fs isn't touching i_ctime_nsec directly,
> > > you shouldn't need to worry about this" from:
> > > https://lore.kernel.org/linux-xfs/cae5c28f172ac57b7eaaa98a00b23f342f01ba64.ca...@kernel.org/
> > > 
> > > xfs /does/ touch i_ctime_nsec directly when it's writing inodes to disk.
> > > From xfs_inode_to_disk, see:
> > > 
> > >   to->di_ctime = xfs_inode_to_disk_ts(ip, inode_get_ctime(inode));
> > > 
> > > AFAICT, inode_get_ctime itself remains unchanged, and still returns
> > > inode->__i_ctime, right?  In which case it's returning a raw timespec64,
> > > which can include the QUERIED flag in tv_nsec, right?
> > > 
> > 
> > No, in the first patch in the series, inode_get_ctime becomes this:
> > 
> > #define I_CTIME_QUERIED ((u32)BIT(31))
> > 
> > static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> > {
> > return inode->i_ctime_sec;
> > }
> > 
> > static inline long inode_get_ctime_nsec(const struct inode *inode)
> > {
> > return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
> > }
> > 
> > static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> > {
> > struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
> >  .tv_nsec = inode_get_ctime_nsec(inode) };
> > 
> > return ts;
> > }
> 
> Doh!  I forgot that this has already been soaking in the vfs tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/include/linux/fs.h?h=next-20240711&id=3aa63a569c64e708df547a8913c84e64a06e7853
> 
> > ...which should ensure that you never store the QUERIED bit.
> 
> So yep, we're fine here.  Sorry about the noise; this was the very
> subtle clue in the diff that the change had already been applied:
> 
>  static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> @@ -1626,13 +1637,7 @@ static inline struct timespec64 inode_get_ctime(const 
> struct inode *inode)
>   return ts;
>  }
> 
> (Doh doh doh doh doh...)
> 
> > > Now let's look at the consumer:
> > > 
> > > static inline xfs_timestamp_t
> > > xfs_inode_to_disk_ts(
> > >   struct xfs_inode*ip,
> > >   const struct timespec64 tv)
> > > {
> > >   struct xfs_legacy_timestamp *lts;
> > >   xfs_timestamp_t ts;
> > > 
> > >   if (xfs_inode_has_bigtime(ip))
> > >   return cpu_to_be64(xfs_inode_encode_bigtime(tv));
> > > 
> > >   lts = (struct xfs_legacy_timestamp *)&ts;
> > >   lts->t_sec = cpu_to_be32(tv.tv_sec);
> > >   lts->t_nsec = cpu_to_be32(tv.tv_nsec);
> > > 
> > >   return ts;
> > > }
> > > 
> > > For the !bigtime case (aka before we added y2038 support) the queried
> > > flag gets encoded into the tv_nsec field since xfs doesn't filter the
> > > queried flag.
> > > 
> > > For the bigtime case, the timespec is turned into an absolute nsec count
> > > since the xfs epoch (which is the minimum timestamp possible under the
> > > old encoding scheme):
> > > 
> > > static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
> > > {
> > >   return xfs_unix_to_bigtime(tv.tv_sec) * NSEC_PER_SEC + tv.tv_nsec;
> > > }
> > > 
> > > Here we'd also be mixing in the QUERIED flag, only now we've encoded a
> > > time that's a second in the fut

Re: [PATCH v5 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-11 Thread Jeff Layton
On Thu, 2024-07-11 at 12:12 -0700, Darrick J. Wong wrote:
> On Thu, Jul 11, 2024 at 07:08:09AM -0400, Jeff Layton wrote:
> > Add a high-level document that describes how multigrain timestamps work,
> > rationale for them, and some info about implementation and tradeoffs.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  Documentation/filesystems/multigrain-ts.rst | 120 
> > 
> >  1 file changed, 120 insertions(+)
> > 
> > diff --git a/Documentation/filesystems/multigrain-ts.rst 
> > b/Documentation/filesystems/multigrain-ts.rst
> > new file mode 100644
> > index ..5cefc204ecec
> > --- /dev/null
> > +++ b/Documentation/filesystems/multigrain-ts.rst
> > @@ -0,0 +1,120 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=
> > +Multigrain Timestamps
> > +=
> > +
> > +Introduction
> > +
> > +Historically, the kernel has always used coarse time values to stamp
> > +inodes. This value is updated on every jiffy, so any change that happens
> > +within that jiffy will end up with the same timestamp.
> > +
> > +When the kernel goes to stamp an inode (due to a read or write), it first 
> > gets
> > +the current time and then compares it to the existing timestamp(s) to see
> > +whether anything will change. If nothing changed, then it can avoid 
> > updating
> > +the inode's metadata.
> > +
> > +Coarse timestamps are therefore good from a performance standpoint, since 
> > they
> > +reduce the need for metadata updates, but bad from the standpoint of
> > +determining whether anything has changed, since a lot of things can happen 
> > in a
> > +jiffy.
> > +
> > +They are particularly troublesome with NFSv3, where unchanging timestamps 
> > can
> > +make it difficult to tell whether to invalidate caches. NFSv4 provides a
> > +dedicated change attribute that should always show a visible change, but 
> > not
> > +all filesystems implement this properly, causing the NFS server to 
> > substitute
> > +the ctime in many cases.
> > +
> > +Multigrain timestamps aim to remedy this by selectively using fine-grained
> > +timestamps when a file has had its timestamps queried recently, and the 
> > current
> > +coarse-grained time does not cause a change.
> > +
> > +Inode Timestamps
> > +
> > +There are currently 3 timestamps in the inode that are updated to the 
> > current
> > +wallclock time on different activity:
> > +
> > +ctime:
> > +  The inode change time. This is stamped with the current time whenever
> > +  the inode's metadata is changed. Note that this value is not settable
> > +  from userland.
> > +
> > +mtime:
> > +  The inode modification time. This is stamped with the current time
> > +  any time a file's contents change.
> > +
> > +atime:
> > +  The inode access time. This is stamped whenever an inode's contents are
> > +  read. Widely considered to be a terrible mistake. Usually avoided with
> > +  options like noatime or relatime.
> 
> And for btime/crtime (aka creation time) a filesystem can take the
> coarse timestamp, right?  It's not settable by userspace, and I think
> statx is the only way those are ever exposed.  QUERIED is never set when
> the file is being created.
> 

Yep. I'd just copy the ctime to the btime after it's set on creation so
that everything lines up nicely.

> > +Updating the mtime always implies a change to the ctime, but updating the
> > +atime due to a read request does not.
> > +
> > +Multigrain timestamps are only tracked for the ctime and the mtime. atimes 
> > are
> > +not affected and always use the coarse-grained value (subject to the 
> > floor).
> 
> Is it ok if an atime update uses the same timespec as was used for a
> ctime update?  There's a pending update for 6.11 that changes
> xfs_trans_ichgtime to do:
>
>   tv = current_time(inode);
> 
>   if (flags & XFS_ICHGTIME_MOD)
>   inode_set_mtime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CHG)
>   inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_ACCESS)
>   inode_set_atime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
> 

Yeah, that should be fine. If you were doing some (hypothetical)
operation that needs to set both the ctime and the atime, then the
natural thing to do is to just l

Re: [PATCH v5 2/9] fs: tracepoints around multigrain timestamp events

2024-07-11 Thread Jeff Layton
On Thu, 2024-07-11 at 09:49 -0700, Darrick J. Wong wrote:
> On Thu, Jul 11, 2024 at 07:08:06AM -0400, Jeff Layton wrote:
> > Add some tracepoints around various multigrain timestamp events.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c   |   5 ++
> >  fs/stat.c    |   3 ++
> >  include/trace/events/timestamp.h | 109
> > +++
> >  3 files changed, 117 insertions(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 2b5889ff7b36..81b45e0a95a6 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -22,6 +22,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#define CREATE_TRACE_POINTS
> > +#include 
> > +
> >  #include "internal.h"
> >  
> >  /*
> > @@ -2571,6 +2574,7 @@ struct timespec64
> > inode_set_ctime_to_ts(struct inode *inode, struct timespec64 t
> >  {
> >     inode->i_ctime_sec = ts.tv_sec;
> >     inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
> > +   trace_inode_set_ctime_to_ts(inode, &ts);
> >     return ts;
> >  }
> >  EXPORT_SYMBOL(inode_set_ctime_to_ts);
> > @@ -2670,6 +2674,7 @@ struct timespec64
> > inode_set_ctime_current(struct inode *inode)
> >     if (try_cmpxchg(&inode->i_ctime_nsec, &cur,
> > now_ts.tv_nsec)) {
> >     /* If swap occurred, then we're (mostly) done */
> >     inode->i_ctime_sec = now_ts.tv_sec;
> > +   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec,
> > cur);
> >     } else {
> >     /*
> >  * Was the change due to someone marking the old
> > ctime QUERIED?
> > diff --git a/fs/stat.c b/fs/stat.c
> > index df7fdd3afed9..552dfd67688b 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -23,6 +23,8 @@
> >  #include 
> >  #include 
> >  
> > +#include 
> > +
> >  #include "internal.h"
> >  #include "mount.h"
> >  
> > @@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32
> > request_mask, struct inode *inode)
> >     stat->mtime = inode_get_mtime(inode);
> >     stat->ctime.tv_sec = inode->i_ctime_sec;
> >     stat->ctime.tv_nsec =
> > ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & ~I_CTIME_QUERIED;
> > +   trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
> >  }
> >  EXPORT_SYMBOL(fill_mg_cmtime);
> >  
> > diff --git a/include/trace/events/timestamp.h
> > b/include/trace/events/timestamp.h
> > new file mode 100644
> > index ..3a603190b46c
> > --- /dev/null
> > +++ b/include/trace/events/timestamp.h
> > @@ -0,0 +1,109 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM timestamp
> > +
> > +#if !defined(_TRACE_TIMESTAMP_H) ||
> > defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_TIMESTAMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(inode_set_ctime_to_ts,
> > +   TP_PROTO(struct inode *inode,
> > +struct timespec64 *ctime),
> > +
> > +   TP_ARGS(inode, ctime),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(dev_t,  dev)
> > +   __field(ino_t,  ino)
> > +   __field(time64_t,   ctime_s)
> > +   __field(u32,ctime_ns)
> > +   __field(u32,gen)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->dev= inode->i_sb->s_dev;
> 
> Odd indenting of the second columns between the struct definition
> above
> and the assignment code here.
> 
> > +   __entry->ino= inode->i_ino;
> > +   __entry->gen= inode->i_generation;
> > +   __entry->ctime_s= ctime->tv_sec;
> > +   __entry->ctime_ns   = ctime->tv_nsec;
> > +   ),
> > +
> > +   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
> > +   MAJOR(__entry->dev), MINOR(__entry->dev), __entry-
> > >ino, __entry->gen,
> > +   __entry->ctime_s, __entry->ctime_ns
> > +   )
> > +);
> > +
> > +TRACE_EVENT(ctime_ns_xchg,
> > +   TP_PROTO(struct inode *inode,
> > +u32 old,
> > +u32 new,
> > +u32 cur),
> > +
> > +   TP_ARGS(inode, old, new, cur),
>

Re: [PATCH v5 6/9] xfs: switch to multigrain timestamps

2024-07-11 Thread Jeff Layton
On Thu, 2024-07-11 at 08:09 -0700, Darrick J. Wong wrote:
> On Thu, Jul 11, 2024 at 07:08:10AM -0400, Jeff Layton wrote:
> > Enable multigrain timestamps, which should ensure that there is an
> > apparent change to the timestamp whenever it has been written after
> > being actively observed via getattr.
> > 
> > Also, anytime the mtime changes, the ctime must also change, and those
> > are now the only two options for xfs_trans_ichgtime. Have that function
> > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > always set.
> > 
> > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > should give us better semantics now.
> 
> Following up on "As long as the fs isn't touching i_ctime_nsec directly,
> you shouldn't need to worry about this" from:
> https://lore.kernel.org/linux-xfs/cae5c28f172ac57b7eaaa98a00b23f342f01ba64.ca...@kernel.org/
> 
> xfs /does/ touch i_ctime_nsec directly when it's writing inodes to disk.
> From xfs_inode_to_disk, see:
> 
>   to->di_ctime = xfs_inode_to_disk_ts(ip, inode_get_ctime(inode));
> 
> AFAICT, inode_get_ctime itself remains unchanged, and still returns
> inode->__i_ctime, right?  In which case it's returning a raw timespec64,
> which can include the QUERIED flag in tv_nsec, right?
> 

No, in the first patch in the series, inode_get_ctime becomes this:

#define I_CTIME_QUERIED ((u32)BIT(31))

static inline time64_t inode_get_ctime_sec(const struct inode *inode)
{
return inode->i_ctime_sec;
}

static inline long inode_get_ctime_nsec(const struct inode *inode)
{
return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
}

static inline struct timespec64 inode_get_ctime(const struct inode *inode)
{
struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
 .tv_nsec = inode_get_ctime_nsec(inode) };

return ts;
}

...which should ensure that you never store the QUERIED bit.

> Now let's look at the consumer:
> 
> static inline xfs_timestamp_t
> xfs_inode_to_disk_ts(
>   struct xfs_inode*ip,
>   const struct timespec64 tv)
> {
>   struct xfs_legacy_timestamp *lts;
>   xfs_timestamp_t ts;
> 
>   if (xfs_inode_has_bigtime(ip))
>   return cpu_to_be64(xfs_inode_encode_bigtime(tv));
> 
>   lts = (struct xfs_legacy_timestamp *)&ts;
>   lts->t_sec = cpu_to_be32(tv.tv_sec);
>   lts->t_nsec = cpu_to_be32(tv.tv_nsec);
> 
>   return ts;
> }
> 
> For the !bigtime case (aka before we added y2038 support) the queried
> flag gets encoded into the tv_nsec field since xfs doesn't filter the
> queried flag.
> 
> For the bigtime case, the timespec is turned into an absolute nsec count
> since the xfs epoch (which is the minimum timestamp possible under the
> old encoding scheme):
> 
> static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
> {
>   return xfs_unix_to_bigtime(tv.tv_sec) * NSEC_PER_SEC + tv.tv_nsec;
> }
> 
> Here we'd also be mixing in the QUERIED flag, only now we've encoded a
> time that's a second in the future.  I think the solution is to add a:
> 
> static inline struct timespec64
> inode_peek_ctime(const struct inode *inode)
> {
>   return (struct timespec64){
>   .tv_sec = inode->__i_ctime.tv_sec,
>   .tv_nsec = inode->__i_ctime.tv_nsec & ~I_CTIME_QUERIED,
>   };
> }
> 
> similar to what inode_peek_iversion does for iversion; and then
> xfs_inode_to_disk can do:
> 
>   to->di_ctime = xfs_inode_to_disk_ts(ip, inode_peek_ctime(inode));
> 
> which would prevent I_CTIME_QUERIED from going out to disk.
> 
> At load time, xfs_inode_from_disk uses inode_set_ctime_to_ts so I think
> xfs won't accidentally introduce QUERIED when it's loading an inode from
> disk.
> 
> 

Also already done in this patchset:

struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
{
inode->i_ctime_sec = ts.tv_sec;
inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
}
EXPORT_SYMBOL(inode_set_ctime_to_ts);

Basically, we never want to store or fetch the QUERIED flag from disk,
and since it's in an unused bit, we can just universally mask it off
when dealing with "external" users of it.

One caveat -- I am using the sign bit for the QUERIED flag, so I'm
assuming that no one should ever pass inode_set_ctime_to_ts a negative
tv_nsec value.

Maybe I should add a WARN_ON_ONCE here to check for that? It seems
nonsensical, but y

[PATCH v5 9/9] tmpfs: add support for multigrain timestamps

2024-07-11 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7f2b609945a5..75a9a73a769f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4660,7 +4660,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH v5 8/9] btrfs: convert to multigrain timestamps

2024-07-11 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d90138683a0a..409628c0c3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH v5 7/9] ext4: switch to multigrain timestamps

2024-07-11 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb899628e121..95d4d7c0957a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7294,7 +7294,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH v5 6/9] xfs: switch to multigrain timestamps

2024-07-11 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
 fs/xfs/xfs_iops.c   | 10 +++---
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12..d25872f818fa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -592,8 +592,9 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
if (xfs_has_v3inodes(mp)) {
@@ -603,11 +604,6 @@ xfs_vn_getattr(
}
}
 
-   if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
-   stat->change_cookie = inode_query_iversion(inode);
-   stat->result_mask |= STATX_CHANGE_COOKIE;
-   }
-
/*
 * Note: If you add another clause to set an attribute flag, please
 * update attributes_mask below.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH v5 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-11 Thread Jeff Layton
Add a high-level document that describes how multigrain timestamps work,
rationale for them, and some info about implementation and tradeoffs.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/multigrain-ts.rst | 120 
 1 file changed, 120 insertions(+)

diff --git a/Documentation/filesystems/multigrain-ts.rst 
b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index ..5cefc204ecec
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Multigrain Timestamps
+=
+
+Introduction
+
+Historically, the kernel has always used coarse time values to stamp
+inodes. This value is updated on every jiffy, so any change that happens
+within that jiffy will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+  The inode change time. This is stamped with the current time whenever
+  the inode's metadata is changed. Note that this value is not settable
+  from userland.
+
+mtime:
+  The inode modification time. This is stamped with the current time
+  any time a file's contents change.
+
+atime:
+  The inode access time. This is stamped whenever an inode's contents are
+  read. Widely considered to be a terrible mistake. Usually avoided with
+  options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Multigrain Timestamps
+=
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and 

[PATCH v5 4/9] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-11 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH v5 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-11 Thread Jeff Layton
Four percpu counters for counting various stats around mgtimes, and a
new debugfs file for displaying them:

- number of attempted ctime updates
- number of successful i_ctime_nsec swaps
- number of fine-grained timestamp fetches
- number of floor value swaps

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 60 +++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 81b45e0a95a6..011148c82901 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -70,6 +72,11 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(inode_hash_lock);
  */
 static __cacheline_aligned_in_smp atomic64_t ctime_floor;
 
+static struct percpu_counter mg_ctime_updates;
+static struct percpu_counter mg_floor_swaps;
+static struct percpu_counter mg_ctime_swaps;
+static struct percpu_counter mg_fine_stamps;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2654,6 +2661,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 
/* Get a fine-grained time */
fine = ktime_get();
+   percpu_counter_inc(&mg_fine_stamps);
 
/*
 * If the cmpxchg works, we take the new floor value. If
@@ -2662,11 +2670,14 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 * as good, so keep it.
 */
old = floor;
-   if (!atomic64_try_cmpxchg(&ctime_floor, &old, fine))
+   if (atomic64_try_cmpxchg(&ctime_floor, &old, fine))
+   percpu_counter_inc(&mg_floor_swaps);
+   else
fine = old;
now = ktime_mono_to_real(fine);
}
}
+   percpu_counter_inc(&mg_ctime_updates);
now_ts = timestamp_truncate(ktime_to_timespec64(now), inode);
cur = cns;
 retry:
@@ -2675,6 +2686,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
+   percpu_counter_inc(&mg_ctime_swaps);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
@@ -2744,3 +2756,49 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   u64 ctime_updates = percpu_counter_sum(&mg_ctime_updates);
+   u64 ctime_swaps = percpu_counter_sum(&mg_ctime_swaps);
+   u64 fine_stamps = percpu_counter_sum(&mg_fine_stamps);
+   u64 floor_swaps = percpu_counter_sum(&mg_floor_swaps);
+
+   seq_printf(s, "%llu %llu %llu %llu\n",
+  ctime_updates, ctime_swaps, fine_stamps, floor_swaps);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   int ret = percpu_counter_init(&mg_ctime_updates, 0, GFP_KERNEL);
+
+   if (ret)
+   return ret;
+
+   ret = percpu_counter_init(&mg_floor_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+
+   ret = percpu_counter_init(&mg_ctime_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_floor_swaps);
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+
+   ret = percpu_counter_init(&mg_fine_stamps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_floor_swaps);
+   percpu_counter_destroy(&mg_ctime_updates);
+   percpu_counter_destroy(&mg_ctime_swaps);
+   return ret;
+   }
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH v5 2/9] fs: tracepoints around multigrain timestamp events

2024-07-11 Thread Jeff Layton
Add some tracepoints around various multigrain timestamp events.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   |   5 ++
 fs/stat.c|   3 ++
 include/trace/events/timestamp.h | 109 +++
 3 files changed, 117 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 2b5889ff7b36..81b45e0a95a6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,9 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "internal.h"
 
 /*
@@ -2571,6 +2574,7 @@ struct timespec64 inode_set_ctime_to_ts(struct inode 
*inode, struct timespec64 t
 {
inode->i_ctime_sec = ts.tv_sec;
inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
+   trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
@@ -2670,6 +2674,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) {
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
+   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
diff --git a/fs/stat.c b/fs/stat.c
index df7fdd3afed9..552dfd67688b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 #include "internal.h"
 #include "mount.h"
 
@@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
struct inode *inode)
stat->mtime = inode_get_mtime(inode);
stat->ctime.tv_sec = inode->i_ctime_sec;
stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
~I_CTIME_QUERIED;
+   trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
 }
 EXPORT_SYMBOL(fill_mg_cmtime);
 
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..3a603190b46c
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+
+   TP_ARGS(inode, ctime),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   ctime_s)
+   __field(u32,ctime_ns)
+   __field(u32,gen)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->ctime_s= ctime->tv_sec;
+   __entry->ctime_ns   = ctime->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->ctime_s, __entry->ctime_ns
+   )
+);
+
+TRACE_EVENT(ctime_ns_xchg,
+   TP_PROTO(struct inode *inode,
+u32 old,
+u32 new,
+u32 cur),
+
+   TP_ARGS(inode, old, new, cur),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(u32,gen)
+   __field(u32,old)
+   __field(u32,new)
+   __field(u32,cur)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->old= old;
+   __entry->new= new;
+   __entry->cur= cur;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u old=%u:%c new=%u cur=%u:%c",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->old & ~I_CTIME_QUERIED, __entry->old & I_CTIME_QUERIED 
? 'Q' : '-',
+   __entry->new,
+   __entry->cur & ~I_CTIME_QUERIED, __entry->cur & I_CTIME_QUERIED 
? 'Q' : '-'
+   )
+);
+
+TRACE_EVENT(fill_mg_cmtime

[PATCH v5 1/9] fs: add infrastructure for multigrain timestamps

2024-07-11 Thread Jeff Layton
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.

To remedy this, keep a global monotonic ktime_t value that acts as a
timestamp floor.  When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.

If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.

We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.

Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 171 -
 fs/stat.c  |  36 ++-
 include/linux/fs.h |  34 ---
 3 files changed, 204 insertions(+), 37 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..2b5889ff7b36 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,13 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/*
+ * This represents the latest fine-grained time that we have handed out as a
+ * timestamp on the system. Tracked as a monotonic value, and converted to the
+ * realtime clock on an as-needed basis.
+ */
+static __cacheline_aligned_in_smp atomic64_t ctime_floor;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2127,19 +2134,72 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current (monotonic) ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained timestamps, converted to realtime
+ * clock value.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t coarse = ktime_get_coarse();
+
+   /* If coarse time is already newer, return that */
+   if (!ktime_after(floor, coarse))
+   return ktime_get_coarse_real();
+   return ktime_mono_to_real(floor);
+}
+
+/**
+ * current_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_time(struct inode *inode)
+{
+   ktime_t floor = atomic64_read(&ctime_floor);
+   ktime_t now = coarse_ctime(floor);
+   struct timespec64 now_ts = ktime_to_timespec64(now);
+   u32 cns;
+
+   if (!is_mgtime(inode))
+   goto out;
+
+   /* If nothing has queried it, then coarse time is fine */
+   cns = smp_load

[PATCH v5 0/9] fs: multigrain timestamp redux

2024-07-11 Thread Jeff Layton
tl;dr for those who have been following along:

There are several changes in this version. The conversion of ctime to
be a ktime_t value has been dropped, and we now use an unused bit in
the nsec field as the QUERIED flag (like the earlier patchset did).

The floor value is now tracked as a monotonic clock value, and is
converted to a realtime value on an as-needed basis. This eliminates the
problem of trying to detect when the realtime clock jumps backward.

Longer patch description for those just joining in:

At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].

That version had to be reverted because it was possible for a file to
get a coarse grained timestamp that appeared to be earlier than another
file that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a per-time_namespace
ctime_floor value that should prevent this from occurring. In the above
situation, the two files might end up with the same timestamp value, but
they won't appear to have been modified in the wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that is in the process of
being merged.

The testing I've done seems to show performance parity with multigrain
timestamps enabled vs. disabled, but it's hard to rule this out
regressing some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to have this considered for v6.12,
after a nice long full-cycle soak in linux-next.

PS: I took a stab at a conversion for bcachefs too, but it's not
trivial. bcachefs handles timestamps backward from the way most
block-based filesystems do. Instead of updating them in struct inode and
eventually copying them to a disk-based representation, it does the
reverse and updates the timestamps in its in-core image of the on-disk
inode, and then copies that into struct inode. Either that will need to
be changed, or we'll need to come up with a different way to do this for
bcachefs.

[1]: 
https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/

Signed-off-by: Jeff Layton 
---
Changes in v5:
- refetch coarse time in coarse_ctime if not returning floor
- timestamp_truncate before swapping new ctime value into place
- track floor value as atomic64_t
- cleanups to Documentation file
- Link to v4: 
https://lore.kernel.org/r/20240708-mgtime-v4-0-a0f3c6fb5...@kernel.org

Changes in v4:
- reordered tracepoint fields for better packing
- rework percpu counters again to also count fine grained timestamps
- switch to try_cmpxchg for better efficiency
- Link to v3: 
https://lore.kernel.org/r/20240705-mgtime-v3-0-85b2daa9b...@kernel.org

Changes in v3:
- Drop the conversion of i_ctime fields to ktime_t, and use an unused bit
  of the i_ctime_nsec field as QUERIED flag.
- Better tracepoints for tracking floor and ctime updates
- Reworked percpu counters to be more useful
- Track floor as monotonic value, which eliminates clock-jump problem

Changes in v2:
- Added Documentation file
- Link to v1: 
https://lore.kernel.org/r/20240626-mgtime-v1-0-a189352d0...@kernel.org

---
Jeff Layton (9):
  fs: add infrastructure for multigrain timestamps
  fs: tracepoints around multigrain timestamp events
  fs: add percpu counters for significant multigrain timestamp events
  fs: have setattr_copy handle multigrain timestamps appropriately
  Documentation: add a new file documenting multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps

 Documentation/filesystems/multigrain-ts.rst | 120 ++
 fs/attr.c   |  52 ++-
 fs/btrfs/file.c |  25 +--
 fs/btrfs/super.c|   3 +-
 fs/ext4/super.c |   2 +-
 fs/inode.c  | 234 
 fs/stat.c   |  39 -
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c   |  10 +-
 fs/xfs/xfs_super.c  |   2 +-
 include/linux/fs.h  |  34 +++-
 include/trace/events/timestamp.h| 109 +
 mm/shmem.c  |   2 +-
 13 files changed, 560 insert

Re: [PATCH v4 6/9] xfs: switch to multigrain timestamps

2024-07-08 Thread Jeff Layton
On Mon, 2024-07-08 at 14:51 -0400, Jeff Layton wrote:
> On Mon, 2024-07-08 at 11:47 -0700, Darrick J. Wong wrote:
> > On Mon, Jul 08, 2024 at 11:53:39AM -0400, Jeff Layton wrote:
> > > Enable multigrain timestamps, which should ensure that there is an
> > > apparent change to the timestamp whenever it has been written after
> > > being actively observed via getattr.
> > > 
> > > Also, anytime the mtime changes, the ctime must also change, and those
> > > are now the only two options for xfs_trans_ichgtime. Have that function
> > > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > > always set.
> > > 
> > > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > > should give us better semantics now.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
> > >  fs/xfs/xfs_iops.c   | 10 +++---
> > >  fs/xfs/xfs_super.c  |  2 +-
> > >  3 files changed, 7 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c 
> > > b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 69fc5b981352..1f3639bbf5f0 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
> > >   ASSERT(tp);
> > >   xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> > >  
> > > - tv = current_time(inode);
> > > + /* If the mtime changes, then ctime must also change */
> > > + ASSERT(flags & XFS_ICHGTIME_CHG);
> > >  
> > > + tv = inode_set_ctime_current(inode);
> > >   if (flags & XFS_ICHGTIME_MOD)
> > >   inode_set_mtime_to_ts(inode, tv);
> > > - if (flags & XFS_ICHGTIME_CHG)
> > > - inode_set_ctime_to_ts(inode, tv);
> > >   if (flags & XFS_ICHGTIME_CREATE)
> > >   ip->i_crtime = tv;
> > >  }
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a00dcbc77e12..d25872f818fa 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -592,8 +592,9 @@ xfs_vn_getattr(
> > >   stat->gid = vfsgid_into_kgid(vfsgid);
> > >   stat->ino = ip->i_ino;
> > >   stat->atime = inode_get_atime(inode);
> > > - stat->mtime = inode_get_mtime(inode);
> > > - stat->ctime = inode_get_ctime(inode);
> > > +
> > > + fill_mg_cmtime(stat, request_mask, inode);
> > 
> > Sooo... for setting up a commit-range operation[1], XFS_IOC_START_COMMIT
> > could populate its freshness data by calling:
> > 
> > struct kstat dummy;
> > 
> > fill_mg_ctime(&dummy, STATX_CTIME | STATX_MTIME, inode);
> > 
> > and then using dummy.[cm]time to populate the freshness data that it
> > gives to userspace, right?  Having set QUERIED, a write to the file
> > immediately afterwards will cause a (tiny) increase in ctime_nsec which
> > will cause the XFS_IOC_COMMIT_RANGE to reject the commit[2].  Right?
> > 
> 
> Yes. Once you call fill_mg_ctime, the first write after that point
> should cause the kernel to ensure that there is a distinct change in
> the ctime.
> 
> IOW, I think this should alleviate the concerns I had before with using
> timestamps with the XFS_IOC_COMMIT_RANGE interface.
> 
> 

Oh, and to be clear, if you're _only_ worried about changes to the
contents of the file (and not the metadata), you should be able to do
this instead:

fill_mg_ctime(&dummy, STATX_MTIME, inode);

...and that should avoid false positives from metadata-only changes.

Querying only the mtime still causes the QUERIED flag to be set, and
the kernel to give you distinct timestamps.

> > --D
> > 
> > [1] https://lore.kernel.org/linux-xfs/20240227174649.GL6184@frogsfrogsfrogs/
> > [2] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=atomic-file-commits&id=0520d89c2698874c1f56ddf52ec4b8a3595baa14
> > 
> > > +
> > >   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
> > >  
> > >   if (xfs_has_v3inodes(mp)) {
> > > @@ -603,11 +604,6 @@ xfs_vn_getattr(
> > >   }
> > >   }
> > >  
> > > - if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> > > - stat->change_cookie = inode_query_iversion(inode);
> > > - stat->result_mask |= STATX_CHANGE_COOKIE;
> > > - }
> > > -
> > >   /*
> > >* Note: If you add another clause to set an attribute flag, please
> > >* update attributes_mask below.
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 27e9f749c4c7..210481b03fdb 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
> > >   .init_fs_context= xfs_init_fs_context,
> > >   .parameters = xfs_fs_parameters,
> > >   .kill_sb= xfs_kill_sb,
> > > - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> > > + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
> > >  };
> > >  MODULE_ALIAS_FS("xfs");
> > >  
> > > 
> > > -- 
> > > 2.45.2
> > > 
> > > 
> 

-- 
Jeff Layton 



Re: [PATCH v4 6/9] xfs: switch to multigrain timestamps

2024-07-08 Thread Jeff Layton
On Mon, 2024-07-08 at 11:47 -0700, Darrick J. Wong wrote:
> On Mon, Jul 08, 2024 at 11:53:39AM -0400, Jeff Layton wrote:
> > Enable multigrain timestamps, which should ensure that there is an
> > apparent change to the timestamp whenever it has been written after
> > being actively observed via getattr.
> > 
> > Also, anytime the mtime changes, the ctime must also change, and those
> > are now the only two options for xfs_trans_ichgtime. Have that function
> > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > always set.
> > 
> > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > should give us better semantics now.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
> >  fs/xfs/xfs_iops.c   | 10 +++---
> >  fs/xfs/xfs_super.c  |  2 +-
> >  3 files changed, 7 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c 
> > b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 69fc5b981352..1f3639bbf5f0 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
> > ASSERT(tp);
> > xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> >  
> > -   tv = current_time(inode);
> > +   /* If the mtime changes, then ctime must also change */
> > +   ASSERT(flags & XFS_ICHGTIME_CHG);
> >  
> > +   tv = inode_set_ctime_current(inode);
> > if (flags & XFS_ICHGTIME_MOD)
> > inode_set_mtime_to_ts(inode, tv);
> > -   if (flags & XFS_ICHGTIME_CHG)
> > -   inode_set_ctime_to_ts(inode, tv);
> > if (flags & XFS_ICHGTIME_CREATE)
> > ip->i_crtime = tv;
> >  }
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index a00dcbc77e12..d25872f818fa 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -592,8 +592,9 @@ xfs_vn_getattr(
> > stat->gid = vfsgid_into_kgid(vfsgid);
> > stat->ino = ip->i_ino;
> > stat->atime = inode_get_atime(inode);
> > -   stat->mtime = inode_get_mtime(inode);
> > -   stat->ctime = inode_get_ctime(inode);
> > +
> > +   fill_mg_cmtime(stat, request_mask, inode);
> 
> Sooo... for setting up a commit-range operation[1], XFS_IOC_START_COMMIT
> could populate its freshness data by calling:
> 
>   struct kstat dummy;
> 
>   fill_mg_ctime(&dummy, STATX_CTIME | STATX_MTIME, inode);
> 
> and then using dummy.[cm]time to populate the freshness data that it
> gives to userspace, right?  Having set QUERIED, a write to the file
> immediately afterwards will cause a (tiny) increase in ctime_nsec which
> will cause the XFS_IOC_COMMIT_RANGE to reject the commit[2].  Right?
> 

Yes. Once you call fill_mg_ctime, the first write after that point
should cause the kernel to ensure that there is a distinct change in
the ctime.

IOW, I think this should alleviate the concerns I had before with using
timestamps with the XFS_IOC_COMMIT_RANGE interface.


> --D
> 
> [1] https://lore.kernel.org/linux-xfs/20240227174649.GL6184@frogsfrogsfrogs/
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=atomic-file-commits&id=0520d89c2698874c1f56ddf52ec4b8a3595baa14
> 
> > +
> > stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
> >  
> > if (xfs_has_v3inodes(mp)) {
> > @@ -603,11 +604,6 @@ xfs_vn_getattr(
> > }
> > }
> >  
> > -   if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> > -   stat->change_cookie = inode_query_iversion(inode);
> > -   stat->result_mask |= STATX_CHANGE_COOKIE;
> > -   }
> > -
> > /*
> >  * Note: If you add another clause to set an attribute flag, please
> >  * update attributes_mask below.
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 27e9f749c4c7..210481b03fdb 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
> > .init_fs_context= xfs_init_fs_context,
> > .parameters = xfs_fs_parameters,
> > .kill_sb= xfs_kill_sb,
> > -   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> > +   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
> >  };
> >  MODULE_ALIAS_FS("xfs");
> >  
> > 
> > -- 
> > 2.45.2
> > 
> > 

-- 
Jeff Layton 



Re: [PATCH v4 1/9] fs: add infrastructure for multigrain timestamps

2024-07-08 Thread Jeff Layton
On Mon, 2024-07-08 at 11:39 -0700, Darrick J. Wong wrote:
> On Mon, Jul 08, 2024 at 11:53:34AM -0400, Jeff Layton wrote:
> > The VFS has always used coarse-grained timestamps when updating the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide when to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> > as a flag that indicates whether the current timestamps have been
> > queried via stat() or the like. When it's set, we allow the kernel to
> > use a fine-grained timestamp iff it's necessary to make the ctime show
> > a different value.
> 
> I appreciate the v3->v4 change that we hide the QUERIED flag in the
> upper bit of the ctime nanoseconds, instead of all support for post-2262
> timestamps.  Thank you. :)
> 

Yeah, it was a nice idea, but there are too many unknowns with doing it
that way. This should work just as well.

> > This solves the problem of being able to distinguish the timestamp
> > between updates, but introduces a new problem: it's now possible for a
> > file being changed to get a fine-grained timestamp. A file that is
> > altered just a bit later can then get a coarse-grained one that appears
> > older than the earlier fine-grained time. This violates timestamp
> > ordering guarantees.
> > 
> > To remedy this, keep a global monotonic ktime_t value that acts as a
> > timestamp floor.  When we go to stamp a file, we first get the latter of
> > the current floor value and the current coarse-grained time. If the
> > inode ctime hasn't been queried then we just attempt to stamp it with
> > that value.
> > 
> > If it has been queried, then first see whether the current coarse time
> > is later than the existing ctime. If it is, then we accept that value.
> > If it isn't, then we get a fine-grained time and try to swap that into
> > the global floor. Whether that succeeds or fails, we take the resulting
> > floor time, convert it to realtime and try to swap that into the ctime.
> 
> Makes sense to me.  One question, though -- mgtime filesystems that want
> to persist a ctime to disk are going to have to do something like this,
> right?
> 
> di_ctime_ns = cpu_to_be32(atomic_read(&inode->i_ctime_nsec) &
> ~I_CTIME_QUERIED);
> 
> IOWs, they need to mask off the QUERIED flag (aka bit 31) so that they
> never store a strange looking nanoseconds value.  Probably they should
> already be doing this, but I wouldn't trust them already to be clamping
> the nsec value.
> 
> I'm mostly thinking of xfs_inode_to_disk, which currently calls
> inode_get_ctime() but doesn't clamp nsec at all before writing it to
> disk.  Does that need to mask off I_CTIME_QUERIED explicitly?
> 

The accessors should already take care of that. For instance:

static inline long inode_get_ctime_nsec(const struct inode *inode)
{
return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
}

As long as the fs isn't touching i_ctime_nsec directly, you shouldn't
need to worry about this.

> > We take the result of the ctime swap whether it succeeds or fails, since
> > either is just as valid.
> > 
> > Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> > Others should be unaffected (other than being subject to the same floor
> > value as multigrain filesystems).
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c | 171 
> > -
> >  fs/stat.c  |  36 ++-
> >  include/linux/fs.h |  34 ---
> >  3 files changed, 204 insertions(+), 37 deletions(-)
> > 

[PATCH v4 9/9] tmpfs: add support for multigrain timestamps

2024-07-08 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 440e2a9d8726..6dc817064140 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4649,7 +4649,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH v4 8/9] btrfs: convert to multigrain timestamps

2024-07-08 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d90138683a0a..409628c0c3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH v4 7/9] ext4: switch to multigrain timestamps

2024-07-08 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb899628e121..95d4d7c0957a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7294,7 +7294,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH v4 6/9] xfs: switch to multigrain timestamps

2024-07-08 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
 fs/xfs/xfs_iops.c   | 10 +++---
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12..d25872f818fa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -592,8 +592,9 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
if (xfs_has_v3inodes(mp)) {
@@ -603,11 +604,6 @@ xfs_vn_getattr(
}
}
 
-   if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
-   stat->change_cookie = inode_query_iversion(inode);
-   stat->result_mask |= STATX_CHANGE_COOKIE;
-   }
-
/*
 * Note: If you add another clause to set an attribute flag, please
 * update attributes_mask below.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH v4 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-08 Thread Jeff Layton
Add a high-level document that describes how multigrain timestamps work,
rationale for them, and some info about implementation and tradeoffs.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/multigrain-ts.rst | 120 
 1 file changed, 120 insertions(+)

diff --git a/Documentation/filesystems/multigrain-ts.rst 
b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index ..e4f52a9e3c51
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Multigrain Timestamps
+=
+
+Introduction
+
+Historically, the kernel has always used coarse time values to stamp
+inodes. This value is updated on every jiffy, so any change that happens
+within that jiffy will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+  The inode change time. This is stamped with the current time whenever
+  the inode's metadata is changed. Note that this value is not settable
+  from userland.
+
+mtime:
+  The inode modification time. This is stamped with the current time
+  any time a file's contents change.
+
+atime:
+  The inode access time. This is stamped whenever an inode's contents are
+  read. Widely considered to be a terrible mistake. Usually avoided with
+  options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+
+
+In addition just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Multigrain Timestamps
+=
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and 

[PATCH v4 4/9] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-08 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH v4 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-08 Thread Jeff Layton
Four percpu counters for counting various stats around mgtimes, and a
new debugfs file for displaying them:

- number of attempted ctime updates
- number of successful i_ctime_nsec swaps
- number of fine-grained timestamp fetches
- number of floor value swaps

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 60 +++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index b2ff309a400a..9b93d0a47e55 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -70,6 +72,11 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(inode_hash_lock);
  */
 static __cacheline_aligned_in_smp ktime_t ctime_floor;
 
+static struct percpu_counter mg_ctime_updates;
+static struct percpu_counter mg_floor_swaps;
+static struct percpu_counter mg_ctime_swaps;
+static struct percpu_counter mg_fine_stamps;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2654,6 +2661,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 
/* Get a fine-grained time */
fine = ktime_get();
+   percpu_counter_inc(&mg_fine_stamps);
 
/*
 * If the cmpxchg works, we take the new floor value. If
@@ -2662,11 +2670,14 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 * as good, so keep it.
 */
old = floor;
-   if (!try_cmpxchg(&ctime_floor, &old, fine))
+   if (try_cmpxchg(&ctime_floor, &old, fine))
+   percpu_counter_inc(&mg_floor_swaps);
+   else
fine = old;
now = ktime_mono_to_real(fine);
}
}
+   percpu_counter_inc(&mg_ctime_updates);
now_ts = ktime_to_timespec64(now);
cur = cns;
 retry:
@@ -2675,6 +2686,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
+   percpu_counter_inc(&mg_ctime_swaps);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
@@ -2744,3 +2756,49 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   u64 ctime_updates = percpu_counter_sum(&mg_ctime_updates);
+   u64 ctime_swaps = percpu_counter_sum(&mg_ctime_swaps);
+   u64 fine_stamps = percpu_counter_sum(&mg_fine_stamps);
+   u64 floor_swaps = percpu_counter_sum(&mg_floor_swaps);
+
+   seq_printf(s, "%llu %llu %llu %llu\n",
+  ctime_updates, ctime_swaps, fine_stamps, floor_swaps);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   int ret = percpu_counter_init(&mg_ctime_updates, 0, GFP_KERNEL);
+
+   if (ret)
+   return ret;
+
+   ret = percpu_counter_init(&mg_floor_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+
+   ret = percpu_counter_init(&mg_ctime_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_floor_swaps);
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+
+   ret = percpu_counter_init(&mg_fine_stamps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_floor_swaps);
+   percpu_counter_destroy(&mg_ctime_updates);
+   percpu_counter_destroy(&mg_ctime_swaps);
+   return ret;
+   }
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH v4 2/9] fs: tracepoints around multigrain timestamp events

2024-07-08 Thread Jeff Layton
Add some tracepoints around various multigrain timestamp events.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   |   5 ++
 fs/stat.c|   3 ++
 include/trace/events/timestamp.h | 109 +++
 3 files changed, 117 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 10ed1d3d9b52..b2ff309a400a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,9 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "internal.h"
 
 /*
@@ -2571,6 +2574,7 @@ struct timespec64 inode_set_ctime_to_ts(struct inode 
*inode, struct timespec64 t
 {
inode->i_ctime_sec = ts.tv_sec;
inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
+   trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
@@ -2670,6 +2674,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) {
/* If swap occurred, then we're (mostly) done */
inode->i_ctime_sec = now_ts.tv_sec;
+   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
diff --git a/fs/stat.c b/fs/stat.c
index df7fdd3afed9..552dfd67688b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 #include "internal.h"
 #include "mount.h"
 
@@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
struct inode *inode)
stat->mtime = inode_get_mtime(inode);
stat->ctime.tv_sec = inode->i_ctime_sec;
stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
~I_CTIME_QUERIED;
+   trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
 }
 EXPORT_SYMBOL(fill_mg_cmtime);
 
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..3a603190b46c
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+
+   TP_ARGS(inode, ctime),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   ctime_s)
+   __field(u32,ctime_ns)
+   __field(u32,gen)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->ctime_s= ctime->tv_sec;
+   __entry->ctime_ns   = ctime->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->ctime_s, __entry->ctime_ns
+   )
+);
+
+TRACE_EVENT(ctime_ns_xchg,
+   TP_PROTO(struct inode *inode,
+u32 old,
+u32 new,
+u32 cur),
+
+   TP_ARGS(inode, old, new, cur),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(u32,gen)
+   __field(u32,old)
+   __field(u32,new)
+   __field(u32,cur)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->old= old;
+   __entry->new= new;
+   __entry->cur= cur;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u old=%u:%c new=%u cur=%u:%c",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->old & ~I_CTIME_QUERIED, __entry->old & I_CTIME_QUERIED 
? 'Q' : '-',
+   __entry->new,
+   __entry->cur & ~I_CTIME_QUERIED, __entry->cur & I_CTIME_QUERIED 
? 'Q' : '-'
+   )
+);
+
+TRACE_EVENT(fill_mg_cmtime

[PATCH v4 1/9] fs: add infrastructure for multigrain timestamps

2024-07-08 Thread Jeff Layton
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.

To remedy this, keep a global monotonic ktime_t value that acts as a
timestamp floor.  When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.

If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.

We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.

Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 171 -
 fs/stat.c  |  36 ++-
 include/linux/fs.h |  34 ---
 3 files changed, 204 insertions(+), 37 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..10ed1d3d9b52 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,13 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/*
+ * This represents the latest fine-grained time that we have handed out as a
+ * timestamp on the system. Tracked as a monotonic value, and converted to the
+ * realtime clock on an as-needed basis.
+ */
+static __cacheline_aligned_in_smp ktime_t ctime_floor;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2127,19 +2134,72 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current (monotonic) ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained timestamps, converted to realtime
+ * clock value.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t coarse = ktime_get_coarse();
+
+   /* If coarse time is already newer, return that */
+   if (!ktime_after(floor, coarse))
+   return ktime_mono_to_real(coarse);
+   return ktime_mono_to_real(floor);
+}
+
+/**
+ * current_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_time(struct inode *inode)
+{
+   ktime_t floor = smp_load_acquire(&ctime_floor);
+   ktime_t now = coarse_ctime(floor);
+   struct timespec64 now_ts = ktime_to_timespec64(now);
+   u32 cns;
+
+   if (!is_mgtime(inode))
+   goto out;
+
+   /* If nothing has queried it, then coarse time is fine */
+   cns = smp_load

[PATCH v4 0/9] fs: multigrain timestamp redux

2024-07-08 Thread Jeff Layton
tl;dr for those who have been following along:

There are several changes in this version. The conversion of ctime to
be a ktime_t value has been dropped, and we now use an unused bit in
the nsec field as the QUERIED flag (like the earlier patchset did).

The floor value is now tracked as a monotonic clock value, and is
converted to a realtime value on an as-needed basis. This eliminates the
problem of trying to detect when the realtime clock jumps backward.

Longer patch description for those just joining in:

At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].

That version had to be reverted because it was possible for a file to
get a coarse grained timestamp that appeared to be earlier than another
file that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a per-time_namespace
ctime_floor value that should prevent this from occurring. In the above
situation, the two files might end up with the same timestamp value, but
they won't appear to have been modified in the wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that is in the process of
being merged.

The testing I've done seems to show performance parity with multigrain
timestamps enabled vs. disabled, but it's hard to rule this out
regressing some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to have this considered for v6.12,
after a nice long full-cycle soak in linux-next.

PS: I took a stab at a conversion for bcachefs too, but it's not
trivial. bcachefs handles timestamps backward from the way most
block-based filesystems do. Instead of updating them in struct inode and
eventually copying them to a disk-based representation, it does the
reverse and updates the timestamps in its in-core image of the on-disk
inode, and then copies that into struct inode. Either that will need to
be changed, or we'll need to come up with a different way to do this for
bcachefs.

[1]: 
https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/

Signed-off-by: Jeff Layton 
---
Changes in v4:
- reordered tracepoint fields for better packing
- rework percpu counters again to also count fine grained timestamps
- switch to try_cmpxchg for better efficiency
- Link to v3: 
https://lore.kernel.org/r/20240705-mgtime-v3-0-85b2daa9b...@kernel.org

Changes in v3:
- Drop the conversion of i_ctime fields to ktime_t, and use an unused bit
  of the i_ctime_nsec field as QUERIED flag.
- Better tracepoints for tracking floor and ctime updates
- Reworked percpu counters to be more useful
- Track floor as monotonic value, which eliminates clock-jump problem

Changes in v2:
- Added Documentation file
- Link to v1: 
https://lore.kernel.org/r/20240626-mgtime-v1-0-a189352d0...@kernel.org

---
Jeff Layton (9):
  fs: add infrastructure for multigrain timestamps
  fs: tracepoints around multigrain timestamp events
  fs: add percpu counters for significant multigrain timestamp events
  fs: have setattr_copy handle multigrain timestamps appropriately
  Documentation: add a new file documenting multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps

 Documentation/filesystems/multigrain-ts.rst | 120 ++
 fs/attr.c   |  52 ++-
 fs/btrfs/file.c |  25 +--
 fs/btrfs/super.c|   3 +-
 fs/ext4/super.c |   2 +-
 fs/inode.c  | 234 
 fs/stat.c   |  39 -
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c   |  10 +-
 fs/xfs/xfs_super.c  |   2 +-
 include/linux/fs.h  |  34 +++-
 include/trace/events/timestamp.h| 109 +
 mm/shmem.c  |   2 +-
 13 files changed, 560 insertions(+), 78 deletions(-)
---
base-commit: 49cb2d11beee730253f6f87263602a8c75f81f9b
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton 




Re: [PATCH v3 1/9] fs: add infrastructure for multigrain timestamps

2024-07-08 Thread Jeff Layton
On Mon, 2024-07-08 at 08:30 -0400, Jeff Layton wrote:
> On Fri, 2024-07-05 at 13:02 -0400, Jeff Layton wrote:
> > The VFS has always used coarse-grained timestamps when updating the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide when to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> > as a flag that indicates whether the current timestamps have been
> > queried via stat() or the like. When it's set, we allow the kernel to
> > use a fine-grained timestamp iff it's necessary to make the ctime show
> > a different value.
> > 
> > This solves the problem of being able to distinguish the timestamp
> > between updates, but introduces a new problem: it's now possible for a
> > file being changed to get a fine-grained timestamp. A file that is
> > altered just a bit later can then get a coarse-grained one that appears
> > older than the earlier fine-grained time. This violates timestamp
> > ordering guarantees.
> > 
> > To remedy this, keep a global monotonic ktime_t value that acts as a
> > timestamp floor.  When we go to stamp a file, we first get the latter of
> > the current floor value and the current coarse-grained time. If the
> > inode ctime hasn't been queried then we just attempt to stamp it with
> > that value.
> > 
> > If it has been queried, then first see whether the current coarse time
> > is later than the existing ctime. If it is, then we accept that value.
> > If it isn't, then we get a fine-grained time and try to swap that into
> > the global floor. Whether that succeeds or fails, we take the resulting
> > floor time, convert it to realtime and try to swap that into the ctime.
> > 
> > We take the result of the ctime swap whether it succeeds or fails, since
> > either is just as valid.
> > 
> > Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> > Others should be unaffected (other than being subject to the same floor
> > value as multigrain filesystems).
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c | 172 
> > -
> >  fs/stat.c  |  36 ++-
> >  include/linux/fs.h |  34 ---
> >  3 files changed, 205 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index f356fe2ec2b6..844ff0750959 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -60,6 +60,12 @@ static unsigned int i_hash_shift __ro_after_init;
> >  static struct hlist_head *inode_hashtable __ro_after_init;
> >  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
> >  
> > +/*
> > + * This represents the latest time that we have handed out as a
> > + * timestamp on the system. Tracked as a MONOTONIC value, and
> > + * converted to the realtime clock on an as-needed basis.
> > + */
> > +static __cacheline_aligned_in_smp ktime_t ctime_floor;
> 
> 
> Now that this is being tracked as a monotonic value, I think I probably
> do need to move this to being per time_namespace. I'll plan to
> integrate that before the next posting.
> 

I take it back.

time_namespaces are all about virtualizing the clock for userland
consumption. They're implemented as a set of offsets from the global
timekeeper monotonic clock.

Since the floor value is an internal kernel value that is never
presented directly to userland, I don't think we need to make this per-
time_namespace after all. That would just mean dealing with extra
offset calculation.

I'll plan to resend with the latest changes here soon.

> >  /

Re: [PATCH v3 1/9] fs: add infrastructure for multigrain timestamps

2024-07-08 Thread Jeff Layton
On Fri, 2024-07-05 at 13:02 -0400, Jeff Layton wrote:
> The VFS has always used coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide when to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
> 
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> as a flag that indicates whether the current timestamps have been
> queried via stat() or the like. When it's set, we allow the kernel to
> use a fine-grained timestamp iff it's necessary to make the ctime show
> a different value.
> 
> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp. A file that is
> altered just a bit later can then get a coarse-grained one that appears
> older than the earlier fine-grained time. This violates timestamp
> ordering guarantees.
> 
> To remedy this, keep a global monotonic ktime_t value that acts as a
> timestamp floor.  When we go to stamp a file, we first get the latter of
> the current floor value and the current coarse-grained time. If the
> inode ctime hasn't been queried then we just attempt to stamp it with
> that value.
> 
> If it has been queried, then first see whether the current coarse time
> is later than the existing ctime. If it is, then we accept that value.
> If it isn't, then we get a fine-grained time and try to swap that into
> the global floor. Whether that succeeds or fails, we take the resulting
> floor time, convert it to realtime and try to swap that into the ctime.
> 
> We take the result of the ctime swap whether it succeeds or fails, since
> either is just as valid.
> 
> Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> Others should be unaffected (other than being subject to the same floor
> value as multigrain filesystems).
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c | 172 
> -
>  fs/stat.c  |  36 ++-
>  include/linux/fs.h |  34 ---
>  3 files changed, 205 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..844ff0750959 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -60,6 +60,12 @@ static unsigned int i_hash_shift __ro_after_init;
>  static struct hlist_head *inode_hashtable __ro_after_init;
>  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>  
> +/*
> + * This represents the latest time that we have handed out as a
> + * timestamp on the system. Tracked as a MONOTONIC value, and
> + * converted to the realtime clock on an as-needed basis.
> + */
> +static __cacheline_aligned_in_smp ktime_t ctime_floor;


Now that this is being tracked as a monotonic value, I think I probably
do need to move this to being per time_namespace. I'll plan to
integrate that before the next posting.

>  /*
>   * Empty aops. Can be used for the cases where the user does not
>   * define any of the address_space operations.
> @@ -2127,19 +2133,72 @@ int file_remove_privs(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_privs);
>  
> +/**
> + * coarse_ctime - return the current coarse-grained time
> + * @floor: current ctime_floor value
> + *
> + * Get the coarse-grained time, and then determine whether to
> + * return it or the current floor value. Returns the later of the
> + * floor and coarse grained timestamps, converted to realtime
> + * clock value.
> + */
> +static ktime_t coarse_ctime(ktime_t floor)
> +{
> + ktime_t coarse = ktime_get_coarse();
> +
> + /* If coarse time is already newer, return that */
> + if (!ktime_after(floor, coarse))
> + return ktime_mono_to_real(coarse);
> + return ktime_mono_to_real(floor);
>

Re: [PATCH v3 2/9] fs: tracepoints around multigrain timestamp events

2024-07-05 Thread Jeff Layton
On Fri, 2024-07-05 at 14:07 -0400, Steven Rostedt wrote:
> On Fri, 05 Jul 2024 13:02:36 -0400
> Jeff Layton  wrote:
> 
> > diff --git a/include/trace/events/timestamp.h
> > b/include/trace/events/timestamp.h
> > new file mode 100644
> > index ..a004e5572673
> > --- /dev/null
> > +++ b/include/trace/events/timestamp.h
> > @@ -0,0 +1,109 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM timestamp
> > +
> > +#if !defined(_TRACE_TIMESTAMP_H) ||
> > defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_TIMESTAMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(inode_set_ctime_to_ts,
> > +   TP_PROTO(struct inode *inode,
> > +struct timespec64 *ctime),
> > +
> > +   TP_ARGS(inode, ctime),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(dev_t,  dev)
> > +   __field(ino_t,  ino)
> > +   __field(u32,gen)
> 
> It's best to keep the above 4 byte word below 8 byte words,
> otherwise,
> it will likely create a 4 byte hole in between.
> 

Thanks, I'll fix up both!

> > +   __field(time64_t,   ctime_s)
> > +   __field(u32,ctime_ns)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->dev= inode->i_sb->s_dev;
> > +   __entry->ino= inode->i_ino;
> > +   __entry->gen= inode->i_generation;
> > +   __entry->ctime_s= ctime->tv_sec;
> > +   __entry->ctime_ns   = ctime->tv_nsec;
> > +   ),
> > +
> > +   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
> > +   MAJOR(__entry->dev), MINOR(__entry->dev), __entry-
> > >ino, __entry->gen,
> > +   __entry->ctime_s, __entry->ctime_ns
> > +   )
> > +);
> > +
> > +TRACE_EVENT(ctime_ns_xchg,
> > +   TP_PROTO(struct inode *inode,
> > +u32 old,
> > +u32 new,
> > +u32 cur),
> > +
> > +   TP_ARGS(inode, old, new, cur),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(dev_t,  dev)
> > +   __field(ino_t,  ino)
> > +   __field(u32,gen)
> > +   __field(u32,old)
> > +   __field(u32,new)
> > +   __field(u32,cur)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->dev= inode->i_sb->s_dev;
> > +   __entry->ino= inode->i_ino;
> > +   __entry->gen= inode->i_generation;
> > +   __entry->old= old;
> > +   __entry->new= new;
> > +   __entry->cur= cur;
> > +   ),
> > +
> > +   TP_printk("ino=%d:%d:%ld:%u old=%u:%c new=%u cur=%u:%c",
> > +   MAJOR(__entry->dev), MINOR(__entry->dev), __entry-
> > >ino, __entry->gen,
> > +   __entry->old & ~I_CTIME_QUERIED, __entry->old &
> > I_CTIME_QUERIED ? 'Q' : '-',
> > +   __entry->new,
> > +   __entry->cur & ~I_CTIME_QUERIED, __entry->cur &
> > I_CTIME_QUERIED ? 'Q' : '-'
> > +   )
> > +);
> > +
> > +TRACE_EVENT(fill_mg_cmtime,
> > +   TP_PROTO(struct inode *inode,
> > +struct timespec64 *ctime,
> > +struct timespec64 *mtime),
> > +
> > +   TP_ARGS(inode, ctime, mtime),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(dev_t,  dev)
> > +   __field(ino_t,  ino)
> > +   __field(u32,gen)
> 
> Same here.
> 
> -- Steve
> 
> > +   __field(time64_t,   ctime_s)
> > +   __field(time64_t,   mtime_s)
> > +   __field(u32,ctime_ns)
> > +   __field(u32,mtime_ns)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->dev= inode->i_sb->s_dev;
> > +   __entry->ino= inode->i_ino;
> > +   __entry->gen= inode->i_generation;
> > +   __entry->ctime_s= ctime->tv_sec;
> > +   __entry->mtime_s= mtime->tv_sec;
> > +   __entry->ctime_ns   = ctime->tv_nsec;
> > +   __entry->mtime_ns   = mtime->tv_nsec;
> > +   ),
> > +
> > +   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u mtime=%lld.%u",
> > +   MAJOR(__entry->dev), MINOR(__entry->dev), __entry-
> > >ino, __entry->gen,
> > +   __entry->ctime_s, __entry->ctime_ns,
> > +   __entry->mtime_s, __entry->mtime_ns
> > +   )
> > +);
> > +#endif /* _TRACE_TIMESTAMP_H */
> > +
> > +/* This part must be outside protection */
> > +#include 
> > 
> 

-- 
Jeff Layton 



[PATCH v3 9/9] tmpfs: add support for multigrain timestamps

2024-07-05 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 440e2a9d8726..6dc817064140 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4649,7 +4649,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH v3 8/9] btrfs: convert to multigrain timestamps

2024-07-05 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d90138683a0a..409628c0c3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH v3 7/9] ext4: switch to multigrain timestamps

2024-07-05 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb899628e121..95d4d7c0957a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7294,7 +7294,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH v3 6/9] xfs: switch to multigrain timestamps

2024-07-05 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
 fs/xfs/xfs_iops.c   | 10 +++---
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a00dcbc77e12..d25872f818fa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -592,8 +592,9 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
if (xfs_has_v3inodes(mp)) {
@@ -603,11 +604,6 @@ xfs_vn_getattr(
}
}
 
-   if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
-   stat->change_cookie = inode_query_iversion(inode);
-   stat->result_mask |= STATX_CHANGE_COOKIE;
-   }
-
/*
 * Note: If you add another clause to set an attribute flag, please
 * update attributes_mask below.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH v3 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-05 Thread Jeff Layton
Add a high-level document that describes how multigrain timestamps work,
rationale for them, and some info about implementation and tradeoffs.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/multigrain-ts.rst | 120 
 1 file changed, 120 insertions(+)

diff --git a/Documentation/filesystems/multigrain-ts.rst 
b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index ..70d36955bb83
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Multigrain Timestamps
+=
+
+Introduction
+
+Historically, the kernel has always used a coarse time values to stamp
+inodes. This value is updated on every jiffy, so any change that happens
+within that jiffy will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+  The inode change time. This is stamped with the current time whenever
+  the inode's metadata is changed. Note that this value is not settable
+  from userland.
+
+mtime:
+  The inode modification time. This is stamped with the current time
+  any time a file's contents change.
+
+atime:
+  The inode access time. This is stamped whenever an inode's contents are
+  read. Widely considered to be a terrible mistake. Usually avoided with
+  options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+
+
+In addition just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Multigrain Timestamps
+=
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizeable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and 

[PATCH v3 4/9] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-05 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH v3 3/9] fs: add percpu counters to count fine vs. coarse timestamps

2024-07-05 Thread Jeff Layton
Keep a pair of percpu counters so we can track what proportion of
timestamps is fine-grained.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 49 -
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4ab7aee3558c..2e5610ebb205 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -69,6 +71,11 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(inode_hash_lock);
  * converted to the realtime clock on an as-needed basis.
  */
 static __cacheline_aligned_in_smp ktime_t ctime_floor;
+
+static struct percpu_counter mg_ctime_updates;
+static struct percpu_counter mg_floor_swaps;
+static struct percpu_counter mg_ctime_swaps;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2662,11 +2669,14 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 * as good, so keep it.
 */
old = cmpxchg(&ctime_floor, floor, fine);
-   if (old != floor)
+   if (old == floor)
+   percpu_counter_inc(&mg_floor_swaps);
+   else
fine = old;
now = ktime_mono_to_real(fine);
}
}
+   percpu_counter_inc(&mg_ctime_updates);
now_ts = ktime_to_timespec64(now);
 retry:
/* Try to swap the nsec value into place. */
@@ -2676,6 +2686,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
/* If swap occurred, then we're (mostly) done */
if (cur == cns) {
inode->i_ctime_sec = now_ts.tv_sec;
+   percpu_counter_inc(&mg_ctime_swaps);
} else {
/*
 * Was the change due to someone marking the old ctime QUERIED?
@@ -2745,3 +2756,39 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   u64 ctime_updates = percpu_counter_sum(&mg_ctime_updates);
+   u64 floor_swaps = percpu_counter_sum(&mg_floor_swaps);
+   u64 ctime_swaps = percpu_counter_sum(&mg_ctime_swaps);
+
+   seq_printf(s, "%llu %llu %llu\n", ctime_updates, ctime_swaps, 
floor_swaps);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   int ret = percpu_counter_init(&mg_ctime_updates, 0, GFP_KERNEL);
+
+   if (ret)
+   return ret;
+
+   ret = percpu_counter_init(&mg_floor_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+
+   ret = percpu_counter_init(&mg_ctime_swaps, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_floor_swaps);
+   percpu_counter_destroy(&mg_ctime_updates);
+   return ret;
+   }
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH v3 2/9] fs: tracepoints around multigrain timestamp events

2024-07-05 Thread Jeff Layton
Add some tracepoints around various multigrain timestamp events.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   |   5 ++
 fs/stat.c|   3 ++
 include/trace/events/timestamp.h | 109 +++
 3 files changed, 117 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 844ff0750959..4ab7aee3558c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,9 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "internal.h"
 
 /*
@@ -2570,6 +2573,7 @@ struct timespec64 inode_set_ctime_to_ts(struct inode 
*inode, struct timespec64 t
 {
inode->i_ctime_sec = ts.tv_sec;
inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
+   trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
@@ -2667,6 +2671,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
 retry:
/* Try to swap the nsec value into place. */
cur = cmpxchg(&inode->i_ctime_nsec, cns, now_ts.tv_nsec);
+   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
 
/* If swap occurred, then we're (mostly) done */
if (cur == cns) {
diff --git a/fs/stat.c b/fs/stat.c
index df7fdd3afed9..552dfd67688b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 #include "internal.h"
 #include "mount.h"
 
@@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
struct inode *inode)
stat->mtime = inode_get_mtime(inode);
stat->ctime.tv_sec = inode->i_ctime_sec;
stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
~I_CTIME_QUERIED;
+   trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
 }
 EXPORT_SYMBOL(fill_mg_cmtime);
 
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..a004e5572673
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime),
+
+   TP_ARGS(inode, ctime),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(u32,gen)
+   __field(time64_t,   ctime_s)
+   __field(u32,ctime_ns)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->ctime_s= ctime->tv_sec;
+   __entry->ctime_ns   = ctime->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->ctime_s, __entry->ctime_ns
+   )
+);
+
+TRACE_EVENT(ctime_ns_xchg,
+   TP_PROTO(struct inode *inode,
+u32 old,
+u32 new,
+u32 cur),
+
+   TP_ARGS(inode, old, new, cur),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(u32,gen)
+   __field(u32,old)
+   __field(u32,new)
+   __field(u32,cur)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->gen= inode->i_generation;
+   __entry->old= old;
+   __entry->new= new;
+   __entry->cur= cur;
+   ),
+
+   TP_printk("ino=%d:%d:%ld:%u old=%u:%c new=%u cur=%u:%c",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
__entry->gen,
+   __entry->old & ~I_CTIME_QUERIED, __entry->old & I_CTIME_QUERIED 
? 'Q' : '-',
+   __entry->new,
+   __entry->cur & ~I_CTIME_QUERIED, __entry->cur & I_CTIME_QUERIED 
? 'Q' : '-'
+   )
+);
+
+TRACE_EVENT(fill_mg_cmtime,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ctime,
+struct timespec6

[PATCH v3 1/9] fs: add infrastructure for multigrain timestamps

2024-07-05 Thread Jeff Layton
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.

To remedy this, keep a global monotonic ktime_t value that acts as a
timestamp floor.  When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.

If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.

We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.

Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 172 -
 fs/stat.c  |  36 ++-
 include/linux/fs.h |  34 ---
 3 files changed, 205 insertions(+), 37 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..844ff0750959 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,12 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/*
+ * This represents the latest time that we have handed out as a
+ * timestamp on the system. Tracked as a MONOTONIC value, and
+ * converted to the realtime clock on an as-needed basis.
+ */
+static __cacheline_aligned_in_smp ktime_t ctime_floor;
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2127,19 +2133,72 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained timestamps, converted to realtime
+ * clock value.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t coarse = ktime_get_coarse();
+
+   /* If coarse time is already newer, return that */
+   if (!ktime_after(floor, coarse))
+   return ktime_mono_to_real(coarse);
+   return ktime_mono_to_real(floor);
+}
+
+/**
+ * current_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_time(struct inode *inode)
+{
+   ktime_t floor = smp_load_acquire(&ctime_floor);
+   ktime_t now = coarse_ctime(floor);
+   struct timespec64 now_ts = ktime_to_timespec64(now);
+   u32 cns;
+
+   if (!is_mgtime(inode))
+   goto out;
+
+   /* If nothing has queried it, then coarse time is fine */
+   cns = smp_load_acquire(&inod

[PATCH v3 0/9] fs: multigrain timestamp redux

2024-07-05 Thread Jeff Layton
tl;dr for those who have been following along:

There are several changes in this version. The conversion of ctime to
be a ktime_t value has been dropped, and we now use an unused bit in
the nsec field as the QUERIED flag (like the earlier patchset did).

The floor value is now tracked as a monotonic clock value, and is
converted to a realtime value on an as-needed basis. This eliminates the
problem of trying to detect when the realtime clock jumps backward.

Longer patch description for those just joining in:

At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].

That version had to be reverted because it was possible for a file to
get a coarse grained timestamp that appeared to be earlier than another
file that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a per-time_namespace
ctime_floor value that should prevent this from occurring. In the above
situation, the two files might end up with the same timestamp value, but
they won't appear to have been modified in the wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that is in the process of
being merged.

The testing I've done seems to show performance parity with multigrain
timestamps enabled vs. disabled, but it's hard to rule this out
regressing some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to let this soak in linux-next for a
bit to see if any problems shake out.

[1]: 
https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/

To: Alexander Viro 
To: Christian Brauner 
To: Jan Kara 
To: Steven Rostedt 
To: Masami Hiramatsu 
To: Mathieu Desnoyers 
To: Chandan Babu R 
To: Darrick J. Wong 
To: Theodore Ts'o 
To: Andreas Dilger 
To: Chris Mason 
To: Josef Bacik 
To: David Sterba 
To: Hugh Dickins 
To: Andrew Morton 
To: Jonathan Corbet 
Cc: Dave Chinner 
Cc: Andi Kleen 
Cc: Christoph Hellwig 
Cc: kernel-t...@fb.com
Cc: linux-fsde...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-trace-ker...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-e...@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Jeff Layton 

Changes in v3:
- Drop the conversion of i_ctime fields to ktime_t, and use an unused bit
  of the i_ctime_nsec field as QUERIED flag.
- Better tracepoints for tracking floor and ctime updates
- Reworked percpu counters to be more useful
- Track floor as monotonic value, which eliminates clock-jump problem

Changes in v2:
- Added Documentation file
- Link to v1: 
https://lore.kernel.org/r/20240626-mgtime-v1-0-a189352d0...@kernel.org

---
Jeff Layton (9):
  fs: add infrastructure for multigrain timestamps
  fs: tracepoints around multigrain timestamp events
  fs: add percpu counters to count fine vs. coarse timestamps
  fs: have setattr_copy handle multigrain timestamps appropriately
  Documentation: add a new file documenting multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps

 Documentation/filesystems/multigrain-ts.rst | 120 +++
 fs/attr.c   |  52 ++-
 fs/btrfs/file.c |  25 +---
 fs/btrfs/super.c|   3 +-
 fs/ext4/super.c |   2 +-
 fs/inode.c  | 224 
 fs/stat.c   |  39 -
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c   |  10 +-
 fs/xfs/xfs_super.c  |   2 +-
 include/linux/fs.h  |  34 -
 include/trace/events/timestamp.h| 109 ++
 mm/shmem.c  |   2 +-
 13 files changed, 550 insertions(+), 78 deletions(-)
---
base-commit: cc8223373449ecbd4c18932820714235db6006c4
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton 




Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-02 Thread Jeff Layton
On Tue, 2024-07-02 at 08:12 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 08:21:42AM -0400, Jeff Layton wrote:
> > Many of the existing callers of inode_ctime_to_ts are in void
> > return
> > functions. They're just copying data from an internal
> > representation to
> > struct inode and assume it always succeeds. For those we'll
> > probably
> > have to catch bad ctime values earlier.
> > 
> > So, I think I'll probably have to roll bespoke error handling in
> > all of
> > the relevant filesystems if we go this route. There are also
> > differences between filesystems -- does it make sense to refuse to
> > load
> > an inode with a bogus ctime on NFS or AFS? Probably not.
> > 
> > Hell, it may be simpler to just ditch this patch and reimplement
> > mgtimes using the nanosecond fields like the earlier versions did.
> 
> Thatdoes for sure sound simpler.  What is the big advantage of the
> ktime_t?  Smaller size?
> 

Yeah, mostly. We shrink struct inode by 8 bytes with that patch, and we
(probably) get a better cache footprint, since i_version ends up in the
same cacheline as the ctime. That's really a separate issue though, so
I'm not too worked up about dropping that patch.

As a bonus, leaving it split across separate fields means that we can
use unused bits in the nsec field for the flag, so we don't need to
sacrifice any timestamp granularity either.

I've got a draft rework that does this that I'm testing now. Assuming
it works OK, I'll resend in a few days.

Thanks for the feedback!
-- 
Jeff Layton 



Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-02 Thread Jeff Layton
On Tue, 2024-07-02 at 05:15 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 08:09:46AM -0400, Jeff Layton wrote:
> > > > corrupt timestamps like this?
> > > 
> > > inode_set_ctime_to_ts should return an error if things are out of
> > > range.
> > 
> > Currently it just returns the timespec64 we're setting it to (which
> > makes it easy to do several assignments), so we'd need to change
> > its
> > prototype to handle this case, and fix up the callers to recognize
> > the
> > error.
> > 
> > Alternately it may be easier to just add in a test for when
> > __i_ctime == KTIME_MAX in the appropriate callers and have them
> > error
> > out. I'll have a look and see what makes sense.
> 
> The seems like a more awkward interface vs one that explicitly
> checks.
> 

Many of the existing callers of inode_ctime_to_ts are in void return
functions. They're just copying data from an internal representation to
struct inode and assume it always succeeds. For those we'll probably
have to catch bad ctime values earlier.

So, I think I'll probably have to roll bespoke error handling in all of
the relevant filesystems if we go this route. There are also
differences between filesystems -- does it make sense to refuse to load
an inode with a bogus ctime on NFS or AFS? Probably not.

Hell, it may be simpler to just ditch this patch and reimplement
mgtimes using the nanosecond fields like the earlier versions did.

I'll need to study this a bit and figure out what's best.

> > 
> > > How do we currently catch this when it comes from userland?
> > > 
> > 
> > Not sure I understand this question. ctime values should never come
> > from userland. They should only ever come from the system clock.
> 
> Ah, yes, utimes only changes mtime.

Yep. That's the main reason I see the ctime as very different from the
atime or mtime.
-- 
Jeff Layton 



Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-02 Thread Jeff Layton
On Tue, 2024-07-02 at 05:04 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 07:44:19AM -0400, Jeff Layton wrote:
> > Complaining about it is fairly simple. We could just throw a pr_warn in
> > inode_set_ctime_to_ts when the time comes back as KTIME_MAX. This might
> > also be what we need to do for filesystems like NFS, where a future
> > ctime on the server is not necessarily a problem for the client.
> > 
> > Refusing to load the inode on disk-based filesystems is harder, but is
> > probably possible. There are ~90 calls to inode_set_ctime_to_ts in the
> > kernel, so we'd need to vet the places that are loading times from disk
> > images or the like and fix them to return errors in this situation.
> > 
> > Is warning acceptable, or do we really need to reject inodes that have
> > corrupt timestamps like this?
> 
> inode_set_ctime_to_ts should return an error if things are out of range.

Currently it just returns the timespec64 we're setting it to (which
makes it easy to do several assignments), so we'd need to change its
prototype to handle this case, and fix up the callers to recognize the
error.

Alternately it may be easier to just add in a test for when
__i_ctime == KTIME_MAX in the appropriate callers and have them error
out. I'll have a look and see what makes sense.

> How do we currently catch this when it comes from userland?
> 

Not sure I understand this question. ctime values should never come
from userland. They should only ever come from the system clock.
-- 
Jeff Layton 



Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-02 Thread Jeff Layton
On Tue, 2024-07-02 at 12:19 +0200, Jan Kara wrote:
> On Tue 02-07-24 05:56:37, Jeff Layton wrote:
> > On Tue, 2024-07-02 at 00:37 -0700, Christoph Hellwig wrote:
> > > On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> > > > 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> > > > 
> > > > None of these seem like legitimate use cases so I'm arguing that we
> > > > shouldn't worry about them.
> > > 
> > > Not worry seems like the wrong answer here.  Either we decide they
> > > are legitimate enough and we preserve them, or we decide they are
> > > bogus and refuse reading the inode.  But we'll need to consciously
> > > deal with the case.
> > > 
> > 
> > Is there a problem with consciously dealing with it by clamping the
> > time at KTIME_MAX? If I had a fs with corrupt timestamps, the last
> > thing I'd want is the filesystem refusing to let me at my data because
> > of them.
> 
> Well, you could also view it differently: If I have a fs that corrupts time
> stamps, the last thing I'd like is that the kernel silently accepts it
> without telling me about it :)
> 

Fair enough.

> But more seriously, my filesystem development experience shows that if the
> kernel silently tries to accept and fixup the breakage, it is nice in the
> short term (no complaining users) but it tends to get ugly in the long term
> (where tend people come up with nasty cases where it was wrong to fix it
> up). So I think Christoph's idea of refusing to load inodes with ctimes out
> of range makes sense. Or at least complain about it if nothing else (which
> has some precedens in the year 2038 problem).

Complaining about it is fairly simple. We could just throw a pr_warn in
inode_set_ctime_to_ts when the time comes back as KTIME_MAX. This might
also be what we need to do for filesystems like NFS, where a future
ctime on the server is not necessarily a problem for the client.

Refusing to load the inode on disk-based filesystems is harder, but is
probably possible. There are ~90 calls to inode_set_ctime_to_ts in the
kernel, so we'd need to vet the places that are loading times from disk
images or the like and fix them to return errors in this situation.

Is warning acceptable, or do we really need to reject inodes that have
corrupt timestamps like this?
-- 
Jeff Layton 



Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-02 Thread Jeff Layton
On Tue, 2024-07-02 at 00:37 -0700, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> > 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> > 
> > None of these seem like legitimate use cases so I'm arguing that we
> > shouldn't worry about them.
> 
> Not worry seems like the wrong answer here.  Either we decide they
> are legitimate enough and we preserve them, or we decide they are
> bogus and refuse reading the inode.  But we'll need to consciously
> deal with the case.
> 

Is there a problem with consciously dealing with it by clamping the
time at KTIME_MAX? If I had a fs with corrupt timestamps, the last
thing I'd want is the filesystem refusing to let me at my data because
of them.
-- 
Jeff Layton 



Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-01 Thread Jeff Layton
On Mon, 2024-07-01 at 15:49 -0700, Darrick J. Wong wrote:
> On Wed, Jun 26, 2024 at 09:00:21PM -0400, Jeff Layton wrote:
> > The ctime is not settable to arbitrary values. It always comes from the
> > system clock, so we'll never stamp an inode with a value that can't be
> > represented there. If we disregard people setting their system clock
> > past the year 2262, there is no reason we can't replace the ctime fields
> > with a ktime_t.
> > 
> > Switch the ctime fields to a single ktime_t. Move the i_generation down
> > above i_fsnotify_mask and then move the i_version into the resulting 8
> > byte hole. This shrinks struct inode by 8 bytes total, and should
> > improve the cache footprint as the i_version and ctime are usually
> > updated together.
> > 
> > The one downside I can see to switching to a ktime_t is that if someone
> > has a filesystem with files on it that has ctimes outside the ktime_t
> > range (before ~1678 AD or after ~2262 AD), we won't be able to display
> > them properly in stat() without some special treatment in the
> > filesystem. The operating assumption here is that that is not a
> > practical problem.
> 
> What happens if a filesystem with the ability to store ctimes beyond
> whatever ktime_t supports (AFAICT 2^63-1 nanonseconds on either side of
> the Unix epoch)?  I think the behavior with your patch is that ktime_set
> clamps the ctime on iget because the kernel can't handle it?
> 
> It's a little surprising that the ctime will suddenly jump back in time
> to 2262, but maybe you're right that nobody will notice or care? ;)
> 
> 

Yeah, it'd be clamped at KTIME_MAX when we pull in the inode from disk,
a'la ktime_set.

I think it's important to note that the ctime is not settable from
userland, so if an on-disk ctime is outside of the ktime_t range, there
are only two possibilities:

1) the system clock was set to some time (far) in the future when the
file's metadata was last altered (bad clock? time traveling fs?).

...or...

2) the filesystem has been altered (fuzzing? deliberate doctoring?).

None of these seem like legitimate use cases so I'm arguing that we
shouldn't worry about them.

(...ok maybe the time travel one could be legit, but someone needs to
step up and make a case for it, if so.)

> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/fs.h | 26 +++---
> >  1 file changed, 11 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 5ff362277834..5139dec085f2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -662,11 +662,10 @@ struct inode {
> > loff_t  i_size;
> > time64_ti_atime_sec;
> > time64_ti_mtime_sec;
> > -   time64_ti_ctime_sec;
> > u32 i_atime_nsec;
> > u32 i_mtime_nsec;
> > -   u32 i_ctime_nsec;
> > -   u32 i_generation;
> > +   ktime_t __i_ctime;
> > +   atomic64_t  i_version;
> > spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
> > unsigned short  i_bytes;
> > u8  i_blkbits;
> > @@ -701,7 +700,6 @@ struct inode {
> > struct hlist_head   i_dentry;
> > struct rcu_head i_rcu;
> > };
> > -   atomic64_t  i_version;
> > atomic64_t  i_sequence; /* see futex */
> > atomic_ti_count;
> > atomic_ti_dio_count;
> > @@ -724,6 +722,8 @@ struct inode {
> > };
> >  
> >  
> > +   u32 i_generation;
> > +
> >  #ifdef CONFIG_FSNOTIFY
> > __u32   i_fsnotify_mask; /* all events this inode cares 
> > about */
> > /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > @@ -1608,29 +1608,25 @@ static inline struct timespec64 
> > inode_set_mtime(struct inode *inode,
> > return inode_set_mtime_to_ts(inode, ts);
> >  }
> >  
> > -static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> > +static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> >  {
> > -   return inode->i_ctime_sec;
> > +   return ktime_to_timespec64(inode->__i_ctime);
> >  }
> >  
> > -static inline long inode_get_ctime_nsec(const struct inode *inode)
> > +static inline time64_t inode_get_ctime_sec(const st

Re: [PATCH v2 00/11] fs: multigrain timestamp redux

2024-07-01 Thread Jeff Layton
On Mon, 2024-07-01 at 09:53 -0400, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 06:26:36AM -0400, Jeff Layton wrote:
> > This set is essentially unchanged from the last one, aside from the
> > new file in Documentation/. I had a review comment from Andi Kleen
> > suggesting that the ctime_floor should be per time_namespace, but I
> > think that's incorrect as the realtime clock is not namespaced.
> > 
> > At LSF/MM this year, we had a discussion about the inode change
> > attribute. At the time I mentioned that I thought I could salvage the
> > multigrain timestamp work that had to be reverted last year [1].  That
> > version had to be reverted because it was possible for a file to get a
> > coarse grained timestamp that appeared to be earlier than another file
> > that had recently gotten a fine-grained stamp.
> > 
> > This version corrects the problem by establishing a per-time_namespace
> > ctime_floor value that should prevent this from occurring. In the above
> > situation that was problematic before, the two files might end up with
> > the same timestamp value, but they won't appear to have been modified in
> > the wrong order.
> > 
> > That problem was discovered by the test-stat-time gnulib test. Note that
> > that test still fails on multigrain timestamps, but that's because its
> > method of determining the minimum delay that will show a timestamp
> > change will no longer work with multigrain timestamps. I have a patch to
> > change the testcase to use a different method that I've posted to the
> > bug-gnulib mailing list.
> > 
> > The big question with this set is whether the performance will be
> > suitable. The testing I've done seems to show performance parity with
> > multigrain timestamps enabled, but it's hard to rule this out regressing
> > some workload.
> > 
> > This set is based on top of Christian's vfs.misc branch (which has the
> > earlier change to track inode timestamps as discrete integers). If there
> > are no major objections, I'd like to let this soak in linux-next for a
> > bit to see if any problems shake out.
> > 
> > [1]: 
> > https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/
> > 
> > Signed-off-by: Jeff Layton 
> 
> I have a few nits that need to be addressed, but you can add
> 
> Reviewed-by: Josef Bacik 
> 
> to the series once they're addressed.  Thanks,
> 

Thanks! Fixed them up in my tree. I left the IS_I_VERSION check out as
well, and added a note to the changelog on the btrfs patch.
-- 
Jeff Layton 



Re: [PATCH v2 09/11] btrfs: convert to multigrain timestamps

2024-07-01 Thread Jeff Layton
On Mon, 2024-07-01 at 09:49 -0400, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 06:26:45AM -0400, Jeff Layton wrote:
> > Enable multigrain timestamps, which should ensure that there is an
> > apparent change to the timestamp whenever it has been written after
> > being actively observed via getattr.
> > 
> > Beyond enabling the FS_MGTIME flag, this patch eliminates
> > update_time_for_write, which goes to great pains to avoid in-memory
> > stores. Just have it overwrite the timestamps unconditionally.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/btrfs/file.c  | 25 -
> >  fs/btrfs/super.c |  3 ++-
> >  2 files changed, 6 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index d90138683a0a..409628c0c3cc 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct
> > btrfs_inode *inode)
> >     btrfs_drew_write_unlock(&inode->root->snapshot_lock);
> >  }
> >  
> > -static void update_time_for_write(struct inode *inode)
> > -{
> > -   struct timespec64 now, ts;
> > -
> > -   if (IS_NOCMTIME(inode))
> > -   return;
> > -
> > -   now = current_time(inode);
> > -   ts = inode_get_mtime(inode);
> > -   if (!timespec64_equal(&ts, &now))
> > -   inode_set_mtime_to_ts(inode, now);
> > -
> > -   ts = inode_get_ctime(inode);
> > -   if (!timespec64_equal(&ts, &now))
> > -   inode_set_ctime_to_ts(inode, now);
> > -
> > -   if (IS_I_VERSION(inode))
> > -   inode_inc_iversion(inode);
> > -}
> > -
> >  static int btrfs_write_check(struct kiocb *iocb, struct iov_iter
> > *from,
> >      size_t count)
> >  {
> > @@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb
> > *iocb, struct iov_iter *from,
> >  * need to start yet another transaction to update the
> > inode as we will
> >  * update the inode when we finish writing whatever data
> > we write.
> >  */
> > -   update_time_for_write(inode);
> > +   if (!IS_NOCMTIME(inode)) {
> > +   inode_set_mtime_to_ts(inode,
> > inode_set_ctime_current(inode));
> > +   inode_inc_iversion(inode);
> 
> You've dropped the
> 
> if (IS_I_VERSION(inode))
> 
> check here, and it doesn't appear to be in inode_inc_iversion.  Is
> there a
> reason for this?  Thanks,
> 

AFAICT, btrfs always sets SB_I_VERSION. Are there any cases where it
isn't? If so, then I can put this check back. I'll make a note about it
in the changelog if not.

-- 
Jeff Layton 



[PATCH v2 11/11] Documentation: add a new file documenting multigrain timestamps

2024-07-01 Thread Jeff Layton
Add a high-level document that describes how multigrain timestamps work,
rationale for them, and some info about implementation and tradeoffs.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/multigrain-ts.rst | 126 
 1 file changed, 126 insertions(+)

diff --git a/Documentation/filesystems/multigrain-ts.rst 
b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index ..beef7f79108c
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Multigrain Timestamps
+=
+
+Introduction
+
+Historically, the kernel has always used a coarse time values to stamp
+inodes. This value is updated on every jiffy, so any change that happens
+within that jiffy will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, and many just populating this with
+the ctime.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+  The inode change time. This is stamped with the current time whenever
+  the inode's metadata is changed. Note that this value is not settable
+  from userland.
+
+mtime:
+  The inode modification time. This is stamped with the current time
+  any time a file's contents change.
+
+atime:
+  The inode access time. This is stamped whenever an inode's contents are
+  read. Widely considered to be a terrible mistake. Usually avoided with
+  options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+
+
+In addition just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Multigrain Timestamps
+=
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizeable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses the lowest-order bit in the timestamp as a flag that indicates whether
+the mtime or ctime have been queried. If either or both have, then the kernel
+takes special care to ensure the next timestamp update will display a visible
+change. This ensures tight cache coherency for use-cases like NFS, without
+sacrificing the benefits of reduced metadata updates when files aren't being
+watched.
+
+The ctime Floor Value
+=
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and 

[PATCH v2 10/11] tmpfs: add support for multigrain timestamps

2024-07-01 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8cdd27db042b..60a8e05eed34 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4653,7 +4653,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH v2 09/11] btrfs: convert to multigrain timestamps

2024-07-01 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d90138683a0a..409628c0c3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH v2 08/11] ext4: switch to multigrain timestamps

2024-07-01 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c682fb927b64..9ae48763f81f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7310,7 +7310,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH v2 07/11] xfs: switch to multigrain timestamps

2024-07-01 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c   | 6 --
 fs/xfs/xfs_super.c  | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..ed6e6d9507df 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -590,10 +590,12 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+
if (xfs_has_v3inodes(mp)) {
if (request_mask & STATX_BTIME) {
stat->result_mask |= STATX_BTIME;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH v2 06/11] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-01 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH v2 05/11] fs: add percpu counters to count fine vs. coarse timestamps

2024-07-01 Thread Jeff Layton
Keep a pair of percpu counters so we can track what proportion of
timestamps is fine-grained.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 12790a26102c..5b5a1a8c0bb7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -64,6 +66,10 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(inode_hash_lock);
 
 /* Don't send out a ctime lower than this (modulo backward clock jumps). */
 static __cacheline_aligned_in_smp ktime_t ctime_floor;
+
+static struct percpu_counter mg_fine_ts;
+static struct percpu_counter mg_coarse_ts;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2636,6 +2642,9 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
trace_ctime_floor_update(inode, floor, now, old);
if (old != floor)
now = old;
+   percpu_counter_inc(&mg_fine_ts);
+   } else {
+   percpu_counter_inc(&mg_coarse_ts);
}
 retry:
/* Try to swap the ctime into place. */
@@ -2711,3 +2720,32 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   u64 fine = percpu_counter_sum(&mg_fine_ts);
+   u64 coarse = percpu_counter_sum(&mg_coarse_ts);
+
+   seq_printf(s, "%llu %llu\n", fine, coarse);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   int ret = percpu_counter_init(&mg_fine_ts, 0, GFP_KERNEL);
+
+   if (ret)
+   return ret;
+
+   ret = percpu_counter_init(&mg_coarse_ts, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_fine_ts);
+   return ret;
+   }
+
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH v2 04/11] fs: add infrastructure for multigrain timestamps

2024-07-01 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Now that the ctime is stored as a ktime_t, we
can sacrifice the lowest bit in the word to act as a flag marking
whether the current timestamp has been queried via stat() or the like.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp and then a file that
was altered later to get a coarse-grained one that appears older than
the earlier fine-grained time. To remedy this, keep a global ktime_t
value that acts as a timestamp floor.

When we go to stamp a file, we first get the latter of the current floor
value and the current coarse-grained time (call this "now"). If the
current inode ctime hasn't been queried then we just attempt to stamp it
with that value using a cmpxchg() operation.

If it has been queried, then first see whether the current coarse time
appears later than what we have. If it does, then we accept that value.
If it doesn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time and try to swap that into the ctime.

There is still one remaining problem:

All of this works as long as the realtime clock is monotonically
increasing. If the clock ever jumps backwards, then we could end up in a
situation where the floor value is "stuck" far in advance of the clock.

To remedy this, sanity check the floor value and if it's more than 6ms
(~2 jiffies) ahead of the current coarse-grained clock, disregard the
floor value, and just accept the current coarse-grained clock.

Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
caveat: those that do will always present ctimes that have the lowest
bit unset, even when the on-disk ctime has it set.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   | 168 +--
 fs/stat.c|  39 -
 include/linux/fs.h   |  30 +++
 include/trace/events/timestamp.h |  97 ++
 4 files changed, 306 insertions(+), 28 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5d2b0dfe48c3..12790a26102c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/* Don't send out a ctime lower than this (modulo backward clock jumps). */
+static __cacheline_aligned_in_smp ktime_t ctime_floor;
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/*
+ * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
+ * current floor is >6ms in the future, assume that the clock has jumped
+ * backward.
+ */
+#define CTIME_FLOOR_MAX_NS 600
+
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained time, unless the floor value is too
+ * far into the future. If that happens, assume the clock has jumped
+ * backward, and that the floor should be ignored.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
+
+   /* If coarse time is already newer, return that */
+   if (ktime_before(floor, now))
+   return now;
+
+   /* Ensure floor is not _too_ far in the future */
+   if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
+   return now;
+
+   return floor;
+}
+
+/**
+ * current_time - Return FS

[PATCH v2 03/11] fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts

2024-07-01 Thread Jeff Layton
Add a new tracepoint for when we're testing whether the timestamps need
updating, and around the update itself.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   |  4 +++
 include/trace/events/timestamp.h | 76 
 2 files changed, 80 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 7b0a73ed499d..5d2b0dfe48c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
 #include "internal.h"
 
 /*
@@ -2096,6 +2098,7 @@ static int inode_needs_update_time(struct inode *inode)
if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
+   trace_inode_needs_update_time(inode, &now, &ts, sync_it);
return sync_it;
 }
 
@@ -2522,6 +2525,7 @@ EXPORT_SYMBOL(inode_get_ctime);
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
 {
inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+   trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..35ff875d3800
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(inode_needs_update_time,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *now,
+struct timespec64 *ctime,
+int sync_it),
+
+   TP_ARGS(inode, now, ctime, sync_it),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   now_sec)
+   __field(time64_t,   ctime_sec)
+   __field(long,   now_nsec)
+   __field(long,   ctime_nsec)
+   __field(int,sync_it)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->sync_it= sync_it;
+   __entry->now_sec= now->tv_sec;
+   __entry->ctime_sec  = ctime->tv_sec;
+   __entry->now_nsec   = now->tv_nsec;
+   __entry->ctime_nsec = ctime->tv_nsec;
+   __entry->sync_it= sync_it;
+   ),
+
+   TP_printk("ino=%d:%d:%ld sync_it=%d now=%llu.%ld ctime=%llu.%lu",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+   __entry->sync_it,
+   __entry->now_sec, __entry->now_nsec,
+   __entry->ctime_sec, __entry->ctime_nsec
+   )
+);
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ts),
+
+   TP_ARGS(inode, ts),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   ts_sec)
+   __field(long,   ts_nsec)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->ts_sec = ts->tv_sec;
+   __entry->ts_nsec= ts->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld ts=%llu.%lu",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+   __entry->ts_sec, __entry->ts_nsec
+   )
+);
+#endif /* _TRACE_TIMESTAMP_H */
+
+/* This part must be outside protection */
+#include 

-- 
2.45.2




[PATCH v2 02/11] fs: uninline inode_get_ctime and inode_set_ctime_to_ts

2024-07-01 Thread Jeff Layton
Move both functions to fs/inode.c as they have grown a little large for
inlining.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 25 +
 include/linux/fs.h | 13 ++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e0815acc5abb..7b0a73ed499d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2501,6 +2501,31 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_get_ctime - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime tv_nsec field from the inode, mask off the
+ * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
+ * internal consumers of the ctime that aren't concerned with ensuring a
+ * fine-grained update on the next change (e.g. when preparing to store
+ * the value in the backing store for later retrieval).
+ */
+struct timespec64 inode_get_ctime(const struct inode *inode)
+{
+   ktime_t ctime = inode->__i_ctime;
+
+   return ktime_to_timespec64(ctime);
+}
+EXPORT_SYMBOL(inode_get_ctime);
+
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
+{
+   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+   return ts;
+}
+EXPORT_SYMBOL(inode_set_ctime_to_ts);
+
 /**
  * inode_set_ctime_current - set the ctime to current_time
  * @inode: inode
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7110d6dc9aab..8e271c9e4a00 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1608,10 +1608,8 @@ static inline struct timespec64 inode_set_mtime(struct 
inode *inode,
return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
-{
-   return ktime_to_timespec64(inode->__i_ctime);
-}
+struct timespec64 inode_get_ctime(const struct inode *inode);
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts);
 
 static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
@@ -1623,13 +1621,6 @@ static inline long inode_get_ctime_nsec(const struct 
inode *inode)
return inode_get_ctime(inode).tv_nsec;
 }
 
-static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
- struct timespec64 ts)
-{
-   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
-   return ts;
-}
-
 /**
  * inode_set_ctime - set the ctime in the inode
  * @inode: inode in which to set the ctime

-- 
2.45.2




[PATCH v2 01/11] fs: turn inode ctime fields into a single ktime_t

2024-07-01 Thread Jeff Layton
The ctime is not settable to arbitrary values. It always comes from the
system clock, so we'll never stamp an inode with a value that can't be
represented there. If we disregard people setting their system clock
past the year 2262, there is no reason we can't replace the ctime fields
with a ktime_t.

Switch the ctime fields to a single ktime_t. Move the i_generation down
above i_fsnotify_mask and then move the i_version into the resulting 8
byte hole. This shrinks struct inode by 8 bytes total, and should
improve the cache footprint as the i_version and ctime are usually
updated together.

The one downside I can see to switching to a ktime_t is that if someone
has a filesystem with files on it that has ctimes outside the ktime_t
range (before ~1678 AD or after ~2262 AD), we won't be able to display
them properly in stat() without some special treatment in the
filesystem. The operating assumption here is that that is not a
practical problem.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2fa06a4d197a..7110d6dc9aab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -662,11 +662,10 @@ struct inode {
loff_t  i_size;
time64_ti_atime_sec;
time64_ti_mtime_sec;
-   time64_ti_ctime_sec;
u32 i_atime_nsec;
u32 i_mtime_nsec;
-   u32 i_ctime_nsec;
-   u32 i_generation;
+   ktime_t __i_ctime;
+   atomic64_t  i_version;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
u8  i_blkbits;
@@ -701,7 +700,6 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   atomic64_t  i_version;
atomic64_t  i_sequence; /* see futex */
atomic_ti_count;
atomic_ti_dio_count;
@@ -724,6 +722,8 @@ struct inode {
};
 
 
+   u32 i_generation;
+
 #ifdef CONFIG_FSNOTIFY
__u32   i_fsnotify_mask; /* all events this inode cares 
about */
/* 32-bit hole reserved for expanding i_fsnotify_mask */
@@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct 
inode *inode,
return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline time64_t inode_get_ctime_sec(const struct inode *inode)
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-   return inode->i_ctime_sec;
+   return ktime_to_timespec64(inode->__i_ctime);
 }
 
-static inline long inode_get_ctime_nsec(const struct inode *inode)
+static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
-   return inode->i_ctime_nsec;
+   return inode_get_ctime(inode).tv_sec;
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+static inline long inode_get_ctime_nsec(const struct inode *inode)
 {
-   struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
-.tv_nsec = inode_get_ctime_nsec(inode) };
-
-   return ts;
+   return inode_get_ctime(inode).tv_nsec;
 }
 
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
  struct timespec64 ts)
 {
-   inode->i_ctime_sec = ts.tv_sec;
-   inode->i_ctime_nsec = ts.tv_nsec;
+   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
return ts;
 }
 

-- 
2.45.2




[PATCH v2 00/11] fs: multigrain timestamp redux

2024-07-01 Thread Jeff Layton
This set is essentially unchanged from the last one, aside from the
new file in Documentation/. I had a review comment from Andi Kleen
suggesting that the ctime_floor should be per time_namespace, but I
think that's incorrect as the realtime clock is not namespaced.

At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].  That
version had to be reverted because it was possible for a file to get a
coarse grained timestamp that appeared to be earlier than another file
that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a per-time_namespace
ctime_floor value that should prevent this from occurring. In the above
situation that was problematic before, the two files might end up with
the same timestamp value, but they won't appear to have been modified in
the wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that I've posted to the
bug-gnulib mailing list.

The big question with this set is whether the performance will be
suitable. The testing I've done seems to show performance parity with
multigrain timestamps enabled, but it's hard to rule this out regressing
some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to let this soak in linux-next for a
bit to see if any problems shake out.

[1]: 
https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/

Signed-off-by: Jeff Layton 
---
Changes in v2:
- Added Documentation file
- Link to v1: 
https://lore.kernel.org/r/20240626-mgtime-v1-0-a189352d0...@kernel.org

---
Jeff Layton (11):
  fs: turn inode ctime fields into a single ktime_t
  fs: uninline inode_get_ctime and inode_set_ctime_to_ts
  fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts
  fs: add infrastructure for multigrain timestamps
  fs: add percpu counters to count fine vs. coarse timestamps
  fs: have setattr_copy handle multigrain timestamps appropriately
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps

 Documentation/filesystems/multigrain-ts.rst | 126 
 fs/attr.c   |  52 ++-
 fs/btrfs/file.c |  25 +---
 fs/btrfs/super.c|   3 +-
 fs/ext4/super.c |   2 +-
 fs/inode.c  | 221 +---
 fs/stat.c   |  39 -
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c   |   6 +-
 fs/xfs/xfs_super.c  |   2 +-
 include/linux/fs.h  |  61 +---
 include/trace/events/timestamp.h| 173 ++
 mm/shmem.c  |   2 +-
 13 files changed, 639 insertions(+), 79 deletions(-)
---
base-commit: 2e8c78ef85682671dae2ac3a5aa039b07be0fc0b
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton 




Re: [PATCH 04/10] fs: add infrastructure for multigrain timestamps

2024-06-27 Thread Jeff Layton
On Thu, 2024-06-27 at 11:02 -0400, Chuck Lever wrote:
> On Wed, Jun 26, 2024 at 09:00:24PM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamps when updating the ctime
> > and mtime after a change. This has the benefit of allowing filesystems
> > to optimize away a lot metadata updates, down to around 1 per jiffy,
> > even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried. Now that the ctime is stored as a ktime_t, we
> > can sacrifice the lowest bit in the word to act as a flag marking
> > whether the current timestamp has been queried via stat() or the like.
> > 
> > This solves the problem of being able to distinguish the timestamp
> > between updates, but introduces a new problem: it's now possible for a
> > file being changed to get a fine-grained timestamp and then a file that
> > was altered later to get a coarse-grained one that appears older than
> > the earlier fine-grained time. To remedy this, keep a global ktime_t
> > value that acts as a timestamp floor.
> > 
> > When we go to stamp a file, we first get the latter of the current floor
> > value and the current coarse-grained time (call this "now"). If the
> > current inode ctime hasn't been queried then we just attempt to stamp it
> > with that value using a cmpxchg() operation.
> > 
> > If it has been queried, then first see whether the current coarse time
> > appears later than what we have. If it does, then we accept that value.
> > If it doesn't, then we get a fine-grained time and try to swap that into
> > the global floor. Whether that succeeds or fails, we take the resulting
> > floor time and try to swap that into the ctime.
> > 
> > There is still one remaining problem:
> > 
> > All of this works as long as the realtime clock is monotonically
> > increasing. If the clock ever jumps backwards, then we could end up in a
> > situation where the floor value is "stuck" far in advance of the clock.
> > 
> > To remedy this, sanity check the floor value and if it's more than 6ms
> > (~2 jiffies) ahead of the current coarse-grained clock, disregard the
> > floor value, and just accept the current coarse-grained clock.
> > 
> > Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
> > caveat: those that do will always present ctimes that have the lowest
> > bit unset, even when the on-disk ctime has it set.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c   | 168 
> > +--
> >  fs/stat.c    |  39 -
> >  include/linux/fs.h   |  30 +++
> >  include/trace/events/timestamp.h |  97 ++
> >  4 files changed, 306 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 5d2b0dfe48c3..12790a26102c 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
> >  static struct hlist_head *inode_hashtable __ro_after_init;
> >  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
> >  
> > +/* Don't send out a ctime lower than this (modulo backward clock jumps). */
> > +static __cacheline_aligned_in_smp ktime_t ctime_floor;
> 
> This is piece of memory that will be hit pretty hard (and you
> obviously recognize that because of the alignment attribute).
> 
> Would it be of any benefit to keep a distinct ctime_floor in each
> super block instead?
> 

Good question. Dave Chinner suggested the same thing, but I think it's
a potential problem:

The first series had to be reverted because inodes that had been
modified in order could appear to be modi

[PATCH 10/10] tmpfs: add support for multigrain timestamps

2024-06-26 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ff7c756a7d02..d650f48444e0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4653,7 +4653,7 @@ static struct file_system_type shmem_fs_type = {
.parameters = shmem_fs_parameters,
 #endif
.kill_sb= kill_litter_super,
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2




[PATCH 09/10] btrfs: convert to multigrain timestamps

2024-06-26 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e764ac3f22e2..89b3c200c374 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ts;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   ts = inode_get_mtime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_mtime_to_ts(inode, now);
-
-   ts = inode_get_ctime(inode);
-   if (!timespec64_equal(&ts, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
.init_fs_context= btrfs_init_fs_context,
.parameters = btrfs_fs_parameters,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2




[PATCH 08/10] ext4: switch to multigrain timestamps

2024-06-26 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c682fb927b64..9ae48763f81f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7310,7 +7310,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= ext4_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2




[PATCH 07/10] xfs: switch to multigrain timestamps

2024-06-26 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c   | 6 --
 fs/xfs/xfs_super.c  | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..ed6e6d9507df 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -590,10 +590,12 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode_get_atime(inode);
-   stat->mtime = inode_get_mtime(inode);
-   stat->ctime = inode_get_ctime(inode);
+
+   fill_mg_cmtime(stat, request_mask, inode);
+
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+
if (xfs_has_v3inodes(mp)) {
if (request_mask & STATX_BTIME) {
stat->result_mask |= STATX_BTIME;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= xfs_kill_sb,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2




[PATCH 06/10] fs: have setattr_copy handle multigrain timestamps appropriately

2024-06-26 Thread Jeff Layton
The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton 
---
 fs/attr.c | 52 ++--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct timespec64 now;
+
+   /*
+* If the ctime isn't being updated then nothing else should be
+* either.
+*/
+   if (!(ia_valid & ATTR_CTIME)) {
+   WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+   return;
+   }
+
+   now = inode_set_ctime_current(inode);
+   if (ia_valid & ATTR_ATIME_SET)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   else if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, now);
+
+   if (ia_valid & ATTR_MTIME_SET)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   else if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap: idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
 
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
-   if (ia_valid & ATTR_ATIME)
-   inode_set_atime_to_ts(inode, attr->ia_atime);
-   if (ia_valid & ATTR_MTIME)
-   inode_set_mtime_to_ts(inode, attr->ia_mtime);
-   if (ia_valid & ATTR_CTIME)
-   inode_set_ctime_to_ts(inode, attr->ia_ctime);
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
*inode,
mode &= ~S_ISGID;
inode->i_mode = mode;
}
+
+   if (is_mgtime(inode))
+   return setattr_copy_mgtime(inode, attr);
+
+   if (ia_valid & ATTR_ATIME)
+   inode_set_atime_to_ts(inode, attr->ia_atime);
+   if (ia_valid & ATTR_MTIME)
+   inode_set_mtime_to_ts(inode, attr->ia_mtime);
+   if (ia_valid & ATTR_CTIME)
+   inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2




[PATCH 05/10] fs: add percpu counters to count fine vs. coarse timestamps

2024-06-26 Thread Jeff Layton
Keep a pair of percpu counters so we can track what proportion of
timestamps is fine-grained.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 12790a26102c..18a9d1398773 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #define CREATE_TRACE_POINTS
 #include 
@@ -64,6 +66,11 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(inode_hash_lock);
 
 /* Don't send out a ctime lower than this (modulo backward clock jumps). */
 static __cacheline_aligned_in_smp ktime_t ctime_floor;
+
+/* Keep track of the number of fine vs. coarse timestamp fetches */
+static struct percpu_counter mg_fine_ts;
+static struct percpu_counter mg_coarse_ts;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2636,6 +2643,9 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode)
trace_ctime_floor_update(inode, floor, now, old);
if (old != floor)
now = old;
+   percpu_counter_inc(&mg_fine_ts);
+   } else {
+   percpu_counter_inc(&mg_coarse_ts);
}
 retry:
/* Try to swap the ctime into place. */
@@ -2711,3 +2721,32 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+   u64 fine = percpu_counter_sum(&mg_fine_ts);
+   u64 coarse = percpu_counter_sum(&mg_coarse_ts);
+
+   seq_printf(s, "%llu %llu\n", fine, coarse);
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+   int ret = percpu_counter_init(&mg_fine_ts, 0, GFP_KERNEL);
+
+   if (ret)
+   return ret;
+
+   ret = percpu_counter_init(&mg_coarse_ts, 0, GFP_KERNEL);
+   if (ret) {
+   percpu_counter_destroy(&mg_fine_ts);
+   return ret;
+   }
+
+   debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
NULL, &mgts_fops);
+   return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2




[PATCH 04/10] fs: add infrastructure for multigrain timestamps

2024-06-26 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Now that the ctime is stored as a ktime_t, we
can sacrifice the lowest bit in the word to act as a flag marking
whether the current timestamp has been queried via stat() or the like.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp and then a file that
was altered later to get a coarse-grained one that appears older than
the earlier fine-grained time. To remedy this, keep a global ktime_t
value that acts as a timestamp floor.

When we go to stamp a file, we first get the latter of the current floor
value and the current coarse-grained time (call this "now"). If the
current inode ctime hasn't been queried then we just attempt to stamp it
with that value using a cmpxchg() operation.

If it has been queried, then first see whether the current coarse time
appears later than what we have. If it does, then we accept that value.
If it doesn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time and try to swap that into the ctime.

There is still one remaining problem:

All of this works as long as the realtime clock is monotonically
increasing. If the clock ever jumps backwards, then we could end up in a
situation where the floor value is "stuck" far in advance of the clock.

To remedy this, sanity check the floor value and if it's more than 6ms
(~2 jiffies) ahead of the current coarse-grained clock, disregard the
floor value, and just accept the current coarse-grained clock.

Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
caveat: those that do will always present ctimes that have the lowest
bit unset, even when the on-disk ctime has it set.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   | 168 +--
 fs/stat.c|  39 -
 include/linux/fs.h   |  30 +++
 include/trace/events/timestamp.h |  97 ++
 4 files changed, 306 insertions(+), 28 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5d2b0dfe48c3..12790a26102c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/* Don't send out a ctime lower than this (modulo backward clock jumps). */
+static __cacheline_aligned_in_smp ktime_t ctime_floor;
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/*
+ * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
+ * current floor is >6ms in the future, assume that the clock has jumped
+ * backward.
+ */
+#define CTIME_FLOOR_MAX_NS 600
+
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained time, unless the floor value is too
+ * far into the future. If that happens, assume the clock has jumped
+ * backward, and that the floor should be ignored.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+   ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
+
+   /* If coarse time is already newer, return that */
+   if (ktime_before(floor, now))
+   return now;
+
+   /* Ensure floor is not _too_ far in the future */
+   if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
+   return now;
+
+   return floor;
+}
+
+/**
+ * current_time - Return FS

[PATCH 03/10] fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts

2024-06-26 Thread Jeff Layton
Add a new tracepoint for when we're testing whether the timestamps need
updating, and around the update itself.

Signed-off-by: Jeff Layton 
---
 fs/inode.c   |  4 +++
 include/trace/events/timestamp.h | 76 
 2 files changed, 80 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 7b0a73ed499d..5d2b0dfe48c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
 #include "internal.h"
 
 /*
@@ -2096,6 +2098,7 @@ static int inode_needs_update_time(struct inode *inode)
if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
+   trace_inode_needs_update_time(inode, &now, &ts, sync_it);
return sync_it;
 }
 
@@ -2522,6 +2525,7 @@ EXPORT_SYMBOL(inode_get_ctime);
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
 {
inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+   trace_inode_set_ctime_to_ts(inode, &ts);
return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index ..35ff875d3800
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(inode_needs_update_time,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *now,
+struct timespec64 *ctime,
+int sync_it),
+
+   TP_ARGS(inode, now, ctime, sync_it),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   now_sec)
+   __field(time64_t,   ctime_sec)
+   __field(long,   now_nsec)
+   __field(long,   ctime_nsec)
+   __field(int,sync_it)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->sync_it= sync_it;
+   __entry->now_sec= now->tv_sec;
+   __entry->ctime_sec  = ctime->tv_sec;
+   __entry->now_nsec   = now->tv_nsec;
+   __entry->ctime_nsec = ctime->tv_nsec;
+   __entry->sync_it= sync_it;
+   ),
+
+   TP_printk("ino=%d:%d:%ld sync_it=%d now=%llu.%ld ctime=%llu.%lu",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+   __entry->sync_it,
+   __entry->now_sec, __entry->now_nsec,
+   __entry->ctime_sec, __entry->ctime_nsec
+   )
+);
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+   TP_PROTO(struct inode *inode,
+struct timespec64 *ts),
+
+   TP_ARGS(inode, ts),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev)
+   __field(ino_t,  ino)
+   __field(time64_t,   ts_sec)
+   __field(long,   ts_nsec)
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->ts_sec = ts->tv_sec;
+   __entry->ts_nsec= ts->tv_nsec;
+   ),
+
+   TP_printk("ino=%d:%d:%ld ts=%llu.%lu",
+   MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+   __entry->ts_sec, __entry->ts_nsec
+   )
+);
+#endif /* _TRACE_TIMESTAMP_H */
+
+/* This part must be outside protection */
+#include 

-- 
2.45.2




[PATCH 02/10] fs: uninline inode_get_ctime and inode_set_ctime_to_ts

2024-06-26 Thread Jeff Layton
Move both functions to fs/inode.c as they have grown a little large for
inlining.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 25 +
 include/linux/fs.h | 13 ++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e0815acc5abb..7b0a73ed499d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2501,6 +2501,31 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_get_ctime - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime tv_nsec field from the inode, mask off the
+ * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
+ * internal consumers of the ctime that aren't concerned with ensuring a
+ * fine-grained update on the next change (e.g. when preparing to store
+ * the value in the backing store for later retrieval).
+ */
+struct timespec64 inode_get_ctime(const struct inode *inode)
+{
+   ktime_t ctime = inode->__i_ctime;
+
+   return ktime_to_timespec64(ctime);
+}
+EXPORT_SYMBOL(inode_get_ctime);
+
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts)
+{
+   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+   return ts;
+}
+EXPORT_SYMBOL(inode_set_ctime_to_ts);
+
 /**
  * inode_set_ctime_current - set the ctime to current_time
  * @inode: inode
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5139dec085f2..4b10db12725d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1608,10 +1608,8 @@ static inline struct timespec64 inode_set_mtime(struct 
inode *inode,
return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
-{
-   return ktime_to_timespec64(inode->__i_ctime);
-}
+struct timespec64 inode_get_ctime(const struct inode *inode);
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 
ts);
 
 static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
@@ -1623,13 +1621,6 @@ static inline long inode_get_ctime_nsec(const struct 
inode *inode)
return inode_get_ctime(inode).tv_nsec;
 }
 
-static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
- struct timespec64 ts)
-{
-   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
-   return ts;
-}
-
 /**
  * inode_set_ctime - set the ctime in the inode
  * @inode: inode in which to set the ctime

-- 
2.45.2




[PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-06-26 Thread Jeff Layton
The ctime is not settable to arbitrary values. It always comes from the
system clock, so we'll never stamp an inode with a value that can't be
represented there. If we disregard people setting their system clock
past the year 2262, there is no reason we can't replace the ctime fields
with a ktime_t.

Switch the ctime fields to a single ktime_t. Move the i_generation down
above i_fsnotify_mask and then move the i_version into the resulting 8
byte hole. This shrinks struct inode by 8 bytes total, and should
improve the cache footprint as the i_version and ctime are usually
updated together.

The one downside I can see to switching to a ktime_t is that if someone
has a filesystem with files on it that has ctimes outside the ktime_t
range (before ~1678 AD or after ~2262 AD), we won't be able to display
them properly in stat() without some special treatment in the
filesystem. The operating assumption here is that that is not a
practical problem.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5ff362277834..5139dec085f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -662,11 +662,10 @@ struct inode {
loff_t  i_size;
time64_ti_atime_sec;
time64_ti_mtime_sec;
-   time64_ti_ctime_sec;
u32 i_atime_nsec;
u32 i_mtime_nsec;
-   u32 i_ctime_nsec;
-   u32 i_generation;
+   ktime_t __i_ctime;
+   atomic64_t  i_version;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
u8  i_blkbits;
@@ -701,7 +700,6 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   atomic64_t  i_version;
atomic64_t  i_sequence; /* see futex */
atomic_ti_count;
atomic_ti_dio_count;
@@ -724,6 +722,8 @@ struct inode {
};
 
 
+   u32 i_generation;
+
 #ifdef CONFIG_FSNOTIFY
__u32   i_fsnotify_mask; /* all events this inode cares 
about */
/* 32-bit hole reserved for expanding i_fsnotify_mask */
@@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct 
inode *inode,
return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline time64_t inode_get_ctime_sec(const struct inode *inode)
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-   return inode->i_ctime_sec;
+   return ktime_to_timespec64(inode->__i_ctime);
 }
 
-static inline long inode_get_ctime_nsec(const struct inode *inode)
+static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
-   return inode->i_ctime_nsec;
+   return inode_get_ctime(inode).tv_sec;
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+static inline long inode_get_ctime_nsec(const struct inode *inode)
 {
-   struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
-.tv_nsec = inode_get_ctime_nsec(inode) };
-
-   return ts;
+   return inode_get_ctime(inode).tv_nsec;
 }
 
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
  struct timespec64 ts)
 {
-   inode->i_ctime_sec = ts.tv_sec;
-   inode->i_ctime_nsec = ts.tv_nsec;
+   inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
return ts;
 }
 

-- 
2.45.2




[PATCH 00/10] fs: multigrain timestamp redux

2024-06-26 Thread Jeff Layton
At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].  That
version had to be reverted because it was possible for a file to get a
coarse grained timestamp that appeared to be earlier than another file
that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a global ctime_floor
value that should prevent this from occurring. In the above situation
that was problematic before, the two files might end up with the same
timestamp value, but they won't appear to have been modified in the
wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that I will post soon.

The big question with this set is whether the performance will be
suitable. The testing I've done seems to show performance parity with
multigrain timestamps enabled, but it's hard to rule this out regressing
some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to let this soak in linux-next for a
bit to see if any problems shake out.

[1]: 
https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a...@kernel.org/

Signed-off-by: Jeff Layton 
---
Jeff Layton (10):
  fs: turn inode ctime fields into a single ktime_t
  fs: uninline inode_get_ctime and inode_set_ctime_to_ts
  fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts
  fs: add infrastructure for multigrain timestamps
  fs: add percpu counters to count fine vs. coarse timestamps
  fs: have setattr_copy handle multigrain timestamps appropriately
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps
  tmpfs: add support for multigrain timestamps

 fs/attr.c|  52 +++--
 fs/btrfs/file.c  |  25 +
 fs/btrfs/super.c |   3 +-
 fs/ext4/super.c  |   2 +-
 fs/inode.c   | 222 +++
 fs/stat.c|  39 ++-
 fs/xfs/libxfs/xfs_trans_inode.c  |   6 +-
 fs/xfs/xfs_iops.c|   6 +-
 fs/xfs/xfs_super.c   |   2 +-
 include/linux/fs.h   |  61 +++
 include/trace/events/timestamp.h | 173 ++
 mm/shmem.c   |   2 +-
 12 files changed, 514 insertions(+), 79 deletions(-)
---
base-commit: 33b321ac3a51e590225585f41c7412b86e987a0d
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton 




Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

2018-08-01 Thread Jeff Layton
On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote:
> /proc/locks does not always print the correct inode number/device pair.
> Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> unique values for userspace.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/locks.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..3a012df87fd8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   loff_t id, char *pfx)
>  {
>   struct inode *inode = NULL;
> + struct dentry *dentry;
>   unsigned int fl_pid;
>   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>  
> @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   if (fl_pid == 0)
>   return;
>  
> - if (fl->fl_file != NULL)
> + if (fl->fl_file != NULL) {
>   inode = locks_inode(fl->fl_file);
> + dentry = file_dentry(fl->fl_file);
> + }
>  
>   seq_printf(f, "%lld:%s ", id, pfx);
>   if (IS_POSIX(fl)) {
> @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>  : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
>   }
>   if (inode) {
> + __u64 ino;
> + dev_t dev;
> +
> + vfs_map_unique_ino_dev(dentry, &ino, &dev);

This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I
don't think it'll be ok to call ->getattr while holding a spinlock.

>   /* userspace relies on this representation of dev_t */
>   seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> - MAJOR(inode->i_sb->s_dev),
> - MINOR(inode->i_sb->s_dev), inode->i_ino);
> + MAJOR(dev), MINOR(dev), inode->i_ino);
>   } else {
>   seq_printf(f, "%d :0 ", fl_pid);
>   }

-- 
Jeff Layton 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-31 Thread Jeff Layton
On Wed, 2018-01-31 at 08:46 -0800, Linus Torvalds wrote:
> On Wed, Jan 31, 2018 at 4:29 AM, Jeff Layton  wrote:
> > 
> > Do you mind just taking it directly? I don't have anything else queued
> > up for this cycle.
> 
> Done.
> 

Thanks...and also many thanks for spotting the original issue. I agree
that this makes it much harder for the callers to get things wrong (and
is probably much more efficient on some arches, as Ted pointed out).

> I wonder if "false for same, true for different" calling convention
> makes much sense, but it matches the old "0 for same" so obviously
> makes for a smaller diff.
> 
> If it ever ends up confusing people, maybe the sense of that function
> should be reversed, and the name changed to something like
> "same_inode_version()" or something.
> 
> But at least for now the situation seems ok to me,
> 

G. Baroncelli suggested changing the name too, so maybe we should just
go ahead and do it. Let me think on what the best approach is and I may
try to send another patch or PR before the end of the merge window.

Cheers,
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-31 Thread Jeff Layton
On Tue, 2018-01-30 at 12:53 -0800, Linus Torvalds wrote:
> Ack. Should I expect this in a future pull request, or take it directly?
> 
> There's no hurry about this, since none of the existing users of that
> function actually do anything but test the return value against zero,
> and nobody saves it into anything but a "bool" (which has magical
> casting properties and does not lose upper bits).
> 
>   Linus

Do you mind just taking it directly? I don't have anything else queued
up for this cycle.

Thanks,
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Jeff Layton
From: Jeff Layton 

As Linus points out:

The inode_cmp_iversion{+raw}() functions are pure and utter crap.

Why?

You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do

int cmp = inode_cmp_iversion(inode, old);
if (cmp < 0 ...

and as a result you get code that looks sane, but that doesn't
actually *WORK* right.

Since none of the callers actually care about the ternary value here,
convert the inode_cmp_iversion{+raw} functions to just return a boolean
value (false for matching, true for non-matching).

This matches the existing use of these functions just fine, and makes it
simple to convert them to return a ternary value in the future if we
grow callers that need it.

With this change we can also reimplement inode_cmp_iversion in a simpler
way using inode_peek_iversion.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 858463fca249..3d2fd06495ec 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare the current raw i_version counter with a previous one. Returns 0 if
- * they are the same or non-zero if they are different.
+ * Compare the current raw i_version counter with a previous one. Returns false
+ * if they are the same or true if they are different.
  */
-static inline s64
+static inline bool
 inode_cmp_iversion_raw(const struct inode *inode, u64 old)
 {
-   return (s64)inode_peek_iversion_raw(inode) - (s64)old;
+   return inode_peek_iversion_raw(inode) != old;
 }
 
 /**
@@ -323,19 +323,15 @@ inode_cmp_iversion_raw(const struct inode *inode, u64 old)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare an i_version counter with a previous one. Returns 0 if they are
- * the same, a positive value if the one in the inode appears newer than @old,
- * and a negative value if @old appears to be newer than the one in the
- * inode.
+ * Compare an i_version counter with a previous one. Returns false if they are
+ * the same, and true if they are different.
  *
  * Note that we don't need to set the QUERIED flag in this case, as the value
  * in the inode is not being recorded for later use.
  */
-
-static inline s64
+static inline bool
 inode_cmp_iversion(const struct inode *inode, u64 old)
 {
-   return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
-  (s64)(old << I_VERSION_QUERIED_SHIFT);
+   return inode_peek_iversion(inode) != old;
 }
 #endif
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Jeff Layton
On Tue, 2018-01-30 at 17:50 +, Trond Myklebust wrote:
> On Tue, 2018-01-30 at 12:31 -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > As Linus points out:
> > 
> > The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> > 
> > Why?
> > 
> > You say that they return 0/negative/positive, but they do so in a
> > completely broken manner. They return that ternary value as the
> > sequence number difference in a 's64', which means that if you
> > actually care about that ternary value, and do the *sane* thing
> > that
> > the kernel-doc of the function implies is the right thing, you
> > would
> > do
> > 
> > int cmp = inode_cmp_iversion(inode, old);
> > if (cmp < 0 ...
> > 
> > and as a result you get code that looks sane, but that doesn't
> > actually *WORK* right.
> > 
> > Since none of the callers actually care about the ternary value here,
> > convert the inode_cmp_iversion{+raw} functions to just return a
> > boolean
> > value (false for matching, true for non-matching).
> > 
> > This matches the existing use of these functions just fine, and makes
> > it
> > simple to convert them to return a ternary value in the future if we
> > grow callers that need it.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/iversion.h | 20 +---
> >  1 file changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 858463fca249..ace32775c5f0 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
> >   * @inode: inode to check
> >   * @old: old value to check against its i_version
> >   *
> > - * Compare the current raw i_version counter with a previous one.
> > Returns 0 if
> > - * they are the same or non-zero if they are different.
> > + * Compare the current raw i_version counter with a previous one.
> > Returns false
> > + * if they are the same or true if they are different.
> >   */
> > -static inline s64
> > +static inline bool
> >  inode_cmp_iversion_raw(const struct inode *inode, u64 old)
> >  {
> > -   return (s64)inode_peek_iversion_raw(inode) - (s64)old;
> > +   return inode_peek_iversion_raw(inode) != old;
> >  }
> >  
> >  /**
> > @@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode
> > *inode, u64 old)
> >   * @inode: inode to check
> >   * @old: old value to check against its i_version
> >   *
> > - * Compare an i_version counter with a previous one. Returns 0 if
> > they are
> > - * the same, a positive value if the one in the inode appears newer
> > than @old,
> > - * and a negative value if @old appears to be newer than the one in
> > the
> > - * inode.
> > + * Compare an i_version counter with a previous one. Returns false
> > if they are
> > + * the same, and true if they are different.
> >   *
> >   * Note that we don't need to set the QUERIED flag in this case, as
> > the value
> >   * in the inode is not being recorded for later use.
> >   */
> >  
> > -static inline s64
> > +static inline bool
> >  inode_cmp_iversion(const struct inode *inode, u64 old)
> >  {
> > -   return (s64)(inode_peek_iversion_raw(inode) &
> > ~I_VERSION_QUERIED) -
> > -  (s64)(old << I_VERSION_QUERIED_SHIFT);
> > +   return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED)
> > !=
> > +   (old << I_VERSION_QUERIED_SHIFT);
> >  }
> 
> Is there any reason why this couldn't just use inode_peek_iversion()
> instead of having to both mask the output from
> inode_peek_iversion_raw() and shift 'old'?

None at all. I'll send a v2.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Jeff Layton
From: Jeff Layton 

As Linus points out:

The inode_cmp_iversion{+raw}() functions are pure and utter crap.

Why?

You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do

int cmp = inode_cmp_iversion(inode, old);
if (cmp < 0 ...

and as a result you get code that looks sane, but that doesn't
actually *WORK* right.

Since none of the callers actually care about the ternary value here,
convert the inode_cmp_iversion{+raw} functions to just return a boolean
value (false for matching, true for non-matching).

This matches the existing use of these functions just fine, and makes it
simple to convert them to return a ternary value in the future if we
grow callers that need it.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 858463fca249..ace32775c5f0 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare the current raw i_version counter with a previous one. Returns 0 if
- * they are the same or non-zero if they are different.
+ * Compare the current raw i_version counter with a previous one. Returns false
+ * if they are the same or true if they are different.
  */
-static inline s64
+static inline bool
 inode_cmp_iversion_raw(const struct inode *inode, u64 old)
 {
-   return (s64)inode_peek_iversion_raw(inode) - (s64)old;
+   return inode_peek_iversion_raw(inode) != old;
 }
 
 /**
@@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode *inode, u64 old)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare an i_version counter with a previous one. Returns 0 if they are
- * the same, a positive value if the one in the inode appears newer than @old,
- * and a negative value if @old appears to be newer than the one in the
- * inode.
+ * Compare an i_version counter with a previous one. Returns false if they are
+ * the same, and true if they are different.
  *
  * Note that we don't need to set the QUERIED flag in this case, as the value
  * in the inode is not being recorded for later use.
  */
 
-static inline s64
+static inline bool
 inode_cmp_iversion(const struct inode *inode, u64 old)
 {
-   return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
-  (s64)(old << I_VERSION_QUERIED_SHIFT);
+   return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) !=
+   (old << I_VERSION_QUERIED_SHIFT);
 }
 #endif
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Jeff Layton
On Mon, 2018-01-29 at 13:50 -0800, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 4:26 AM, Jeff Layton  wrote:
> > 
> > This pile of patches is a rework of the inode->i_version field. We have
> > traditionally incremented that field on every inode data or metadata
> > change. Typically this increment needs to be logged on disk even when
> > nothing else has changed, which is rather expensive.
> 
> Hmm. I have pulled this, but it is really really broken in one place,
> to the degree that I always went "no, I won't pull this garbage".
> 
> But the breakage is potential, not actual, and can be fixed trivially,
> so I'll let it slide - but I do require it to be fixed. And I require
> people to *think* about it.
> 
> So what's to horribly horribly wrong?
> 
> The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> 
> Why?
> 
> You say that they return 0/negative/positive, but they do so in a
> completely broken manner. They return that ternary value as the
> sequence number difference in a 's64', which means that if you
> actually care about that ternary value, and do the *sane* thing that
> the kernel-doc of the function implies is the right thing, you would
> do
> 
> int cmp = inode_cmp_iversion(inode, old);
> 
> if (cmp < 0 ...
> 
> and as a result you get code that looks sane, but that doesn't
> actually *WORK* right.
> 

My intent here was to have this handle wraparound using the same sort of
method that the time_before/time_after macros do. Obviously, I didn't
document that well.

I want to make sure I understand what's actually broken here thoug. Is
it only broken when the two values are more than 2**63 apart, or is
there something else more fundamentally wrong here?

> To make it even worse, it will actually work in practice by accident
> in 99.9% of all cases, so now you have
> 
>  (a) subtly buggy code
>  (b) that looks fine
>  (c) and that works in testing
> 
> which is just about the worst possible case for any code. The
> interface is simply garbage that encourages bugs.
> 
> And the bug wouldn't be in the user, the bug would be in this code you
> just sent me. The interface is simply wrong.
> 
> So this absolutely needs to be fixed. I see two fixes:
> 
>  - just return a boolean. That's all that any current user actually
> wants, so the ternary value seems pointless.
> 
>  - make it return an 'int', and not just any int, but -1/0/1. That way
> there is no worry about uses, and if somebody *really* cares about the
> ternary value, they can now use a "switch" statement to get it
> (alternatively, make it return an enum, but whatever).
> 
> That "ternary" function that has 18446744069414584320 incorrect return
> values really is unacceptable.
> 

I think I'll just make it return a boolean value like you suggested
first. I'll send a patch to fix it once I've done some basic testing
with it.

Many thanks,
--
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] inode->i_version rework for v4.16

2018-01-29 Thread Jeff Layton
The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:

  Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git 
tags/iversion-v4.16-1

for you to fetch changes up to f02a9ad1f15daf4378afeda025a53455f72645dd:

  fs: handle inode->i_version more efficiently (2018-01-29 06:42:21 -0500)


Hi Linus,

This pile of patches is a rework of the inode->i_version field. We have
traditionally incremented that field on every inode data or metadata
change. Typically this increment needs to be logged on disk even when
nothing else has changed, which is rather expensive.

It turns out though that none of the consumers of that field actually
require this behavior. The only real requirement for all of them is that
it be different iff the inode has changed since the last time the field
was checked.

Given that, we can optimize away most of the i_version increments and
avoid dirtying inode metadata when the only change is to the i_version 
and no one is querying it. Queries of the i_version field are rather
rare, so we can help write performance under many common workloads.

This patch series converts existing accesses of the i_version field to a
new API, and then converts all of the in-kernel filesystems to use it.
The last patch in the series then converts the backend implementation to
a scheme that optimizes away a large portion of the metadata updates
when no one is looking at it.

In my own testing this series significantly helps performance with small
I/O sizes. I also got this email for Christmas this year from the kernel
test robot (a 244% r/w bandwidth improvement with XFS over DAX, with 4k
writes):

https://lkml.org/lkml/2017/12/25/8

A few of the earlier patches in this pile are also flowing to you via
other trees (mm, integrity, and nfsd trees in particular), so there may
be some minor merge conflicts here. Hopefully they're trivial to
resolve, but let me know if there are problems.

Thanks!
----
Jeff Layton (21):
  lustre: don't set f_version in ll_readdir
  ntfs: remove i_version handling
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was changed
  fs: handle inode->i_version more efficiently

Sascha Hauer (1):
  ima: Use i_version only when filesystem supports it

 drivers/staging/lustre/lustre/llite/dir.c |   3 -
 fs/affs/amigaffs.c|   5 +-
 fs/affs/dir.c |   5 +-
 fs/affs/super.c   |   3 +-
 fs/afs/fsclient.c |   3 +-
 fs/afs/inode.c|   5 +-
 fs/btrfs/delayed-inode.c  |   7 +-
 fs/btrfs/file.c   |   1 +
 fs/btrfs/inode.c  |  12 +-
 fs/btrfs/ioctl.c  |   1 +
 fs/btrfs/tree-log.c   |   4 +-
 fs/btrfs/xattr.c  |   1 +
 fs/exofs/dir.c|   9 +-
 fs/exofs/super.c  |   3 +-
 fs/ext2/dir.c |   9 +-
 fs/ext2/super.c   |   5 +-
 fs/ext4/dir.c |   9 +-
 fs/ext4/inline.c  |   7 +-
 fs/ext4/inode.c   |  13 +-
 fs/ext4/ioctl.c   |   3 +-
 fs/ext4/namei.c   |   5 +-
 fs/ext4/super.c   |   3 +-
 fs/ext4/xattr.c   |   5 +-
 fs/fat/dir.c  |   3 +-
 fs/fat/inode.c|   9 +-
 fs/fat/namei_msdos.c  |   7 +-
 fs/fat/namei_vfat.c   |  22 +-
 fs/inode.c|  11 +-
 fs/nfs/delegation.c   |   3 +-
 fs/nfs/fscache-index.c|   5 +-
 fs/nfs/inode.c|  18 +-
 fs/nfs/nfs4proc.c |  10 +-
 fs/nfs/nfstrace.h 

Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-19 Thread Jeff Layton
On Thu, 2018-01-18 at 16:45 -0500, J. Bruce Fields wrote:
> On Tue, Jan 09, 2018 at 09:10:42AM -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > The rationale for taking the i_lock when incrementing this value is
> > lost in antiquity. The readers of the field don't take it (at least
> > not universally), so my assumption is that it was only done here to
> > serialize incrementors.
> > 
> > If that is indeed the case, then we can drop the i_lock from this
> > codepath and treat it as a atomic64_t for the purposes of
> > incrementing it. This allows us to use inode_inc_iversion without
> > any danger of lock inversion.
> > 
> > Note that the read side is not fetched atomically with this change.
> > The assumption here is that that is not a critical issue since the
> > i_version is not fully synchronized with anything else anyway.
> 
> So I guess it's theoretically possible that e.g. if you read while it's
> incrementing from 2^32-1 to 2^32 you could read 0, 1, or 2^32+1?
>
> If so then you could see an i_version value reused and incorrectly
> decide that a file hadn't changed.
> 
> But it's such a tiny case, and I think you convert this to atomic64_t
> later anyway, so, whatever.
> 
> --b.
> 

Shrug...we have that problem with the spinlock in place too. The bottom
line is that reads of this value are not serialized with the increment
at all.

I'm not 100% thrilled with this patch, but I think it's probably better
not to add the i_lock all over the place, even as an interim step in
cleaning this stuff up.

The good news here (as you mention) is that this nastiness gets cleaned
up in the last patch when we convert the thing to an atomic64_t.


> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/iversion.h | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index d09cc3a08740..5ad9eaa3a9b0 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 
> > new)
> >  static inline bool
> >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> >  {
> > -   spin_lock(&inode->i_lock);
> > -   inode->i_version++;
> > -   spin_unlock(&inode->i_lock);
> > +   atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > +
> > +   atomic64_inc(ivp);
> > return true;
> >  }
> >  
> > +
> >  /**
> >   * inode_inc_iversion - forcibly increment i_version
> >   * @inode: inode that needs to be updated
> > -- 
> > 2.14.3

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 01/19] fs: new API for handling inode->i_version

2018-01-18 Thread Jeff Layton
On Thu, 2018-01-18 at 16:38 -0500, J. Bruce Fields wrote:
> On Tue, Jan 09, 2018 at 09:10:41AM -0500, Jeff Layton wrote:
> > --- /dev/null
> > +++ b/include/linux/iversion.h
> > @@ -0,0 +1,236 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_IVERSION_H
> > +#define _LINUX_IVERSION_H
> > +
> > +#include 
> > +
> > +/*
> > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version 
> > must
> > + * appear different to observers if there was a change to the inode's data 
> > or
> > + * metadata since it was last queried.
> > + *
> > + * Observers see the i_version as a 64-bit number that never changes.
> 
> I don't understand that sentence.
> 

That's because it's utter nonsense. I noticed that the other day and
fixed it in my tree. It now reads:

* Observers see the i_version as a 64-bit number that never decreases.

> > If it
> > + * remains the same since it was last checked, then nothing has changed in 
> > the
> > + * inode. If it's different then something has changed. Observers cannot 
> > infer
> > + * anything about the nature or magnitude of the changes from the value, 
> > only
> > + * that the inode has changed in some fashion.
> 
> As we've discussed before, there may be brief windows where the first
> two statements aren't quite correct.  I think that would be worth a
> mention if we can keep it concise.  Maybe add something like this?:
> 
>   It may be impractical for filesystems to keep i_version updates
>   atomic with respect to the changes that cause them.  They
>   should, however, guarantee that i_version updates are never
>   visible before the changes that caused them.  Also, i_version
>   updates should never be delayed longer than it takes the
>   original change to reach disk.

That makes sense. I added it in pretty much verbatim. I think we mostly
follow the latter should already.

> Or maybe those details are best left to documentation on the relevant
> parts of the api below (maybe inode_maybe_inc_iversion?).
> 
> I dunno if it's also worth mentioning that nfsd doesn't actually use the
> raw i_version--it mixes it with ctime to prevent i_version reuse after
> reboot.  Presumably that doesn't matter to IMA since it doesn't compare
> i_version across reboots.
> 

I think I won't document that here. nfsd is a consumer of i_version.
What it does with it is sort of its own business. Might be good to have
a comment blurb in the nfsd code about it though.

> The documentation here is all very helpful, thanks.

Thanks for all of the suggestions so far!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 00/19] fs: rework and optimize i_version handling in filesystems

2018-01-12 Thread Jeff Layton
On Thu, 2018-01-11 at 12:23 -0800, Liu Bo wrote:
> On Tue, Jan 09, 2018 at 09:10:40AM -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > v5:
> > - don't corrupt refcounts stashed in i_version of ext4 xattr inodes
> > - add raw variants of inc and cmp functions, and have nfs use them
> > 
> > v4:
> > - fix SB_LAZYTIME handling in generic_update_time
> > - add memory barriers to patch to convert i_version field to atomic64_t
> > 
> > v3:
> > - move i_version handling functions to new header file
> > - document that the kernel-managed i_version implementation will appear to
> >   increase over time
> > - fix inode_cmp_iversion to handle wraparound correctly
> > 
> > v2:
> > - xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
> > - rework file_update_time patch
> > - don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
> > - better comments and documentation
> > 
> > I think this is now approaching merge readiness.
> > 
> > Special thanks to Jan Kara and Dave Chinner who helped me tighten up the
> > memory barriers in the final patch, and Krzysztof Kozlowski for help in
> > tracking down a set of bugs in the NFS client patch.
> > 
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it if no
> > one is looking at it. We can also clean up the code to prepare to
> > eventually expose this value via statx().
> > 
> > Note that this set relies on a few patches that are in other trees. The
> > full stack that I've been testing with is here:
> > 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion
> > 
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
> > will also use it to optimize away some remeasurement if it's available.
> > NFS and AFS just use it to store an opaque change attribute from the
> > server.
> > 
> > Only btrfs, ext4, and xfs increment it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes. That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change.
> > 
> > It turns out though that none of these users of i_version require that
> > it change on every change to the file. The only real requirement is that
> > it be different if something changed since the last time we queried for
> > it.
> > 
> > If we keep track of when something queries the value, we can avoid
> > bumping the counter and an on-disk update when nothing else has changed
> > if no one has queried it since it was last incremented.
> > 
> > This patchset changes the code to only bump the i_version counter when
> > it's strictly necessary, or when we're updating the inode metadata
> > anyway (e.g. when times change).
> > 
> > It takes the approach of converting the existing accessors of i_version
> > to use a new API, while leaving the underlying implementation mostly the
> > same.  The last patch then converts the existing implementation to keep
> > track of whether the value has been queried since it was last
> > incremented. It then uses that to avoid incrementing the counter when
> > it can.
> > 
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > I can see measurable performance gains on xfs and ext4 with iversion
> > enabled, when streaming small (4k) I/Os.
> > 
> > btrfs shows some slight gain in testing, but not quite the magnitude
> > that xfs and ext4 show. I'm not sure why yet and would appreciate some
> > input from btrfs folks.
> > 
> 
> Thanks for the patchset.
> 
> Not sure about how you tested the performance, but in terms of
> write+fsync or synchronous write, btrfs's fsync doesn't check if no
> timestamp/iversion has been changed, instead only checks if inode has
> been logged by some btrfs internal flags and counters, probably
> because by default every write is cow and every write 

[PATCH v5 00/19] fs: rework and optimize i_version handling in filesystems

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

v5:
- don't corrupt refcounts stashed in i_version of ext4 xattr inodes
- add raw variants of inc and cmp functions, and have nfs use them

v4:
- fix SB_LAZYTIME handling in generic_update_time
- add memory barriers to patch to convert i_version field to atomic64_t

v3:
- move i_version handling functions to new header file
- document that the kernel-managed i_version implementation will appear to
  increase over time
- fix inode_cmp_iversion to handle wraparound correctly

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

I think this is now approaching merge readiness.

Special thanks to Jan Kara and Dave Chinner who helped me tighten up the
memory barriers in the final patch, and Krzysztof Kozlowski for help in
tracking down a set of bugs in the NFS client patch.

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change.

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os.

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into linux-next fairly soon. If it shows no
problems then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c|   5 +-
 fs/affs/dir.c |   5 +-
 fs/affs/super.c   |   3 +-
 fs/afs/fsclient.c |   3 +-
 fs/afs/inode.c|   5 +-
 fs/btrfs/delayed-inode.c  |   7 +-
 fs/btrfs/file.c   | 

[PATCH v5 01/19] fs: new API for handling inode->i_version

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/btrfs/file.c  |   1 +
 fs/btrfs/inode.c |   1 +
 fs/btrfs/ioctl.c |   1 +
 fs/btrfs/xattr.c |   1 +
 fs/ext4/inode.c  |   1 +
 fs/ext4/namei.c  |   1 +
 fs/inode.c   |   1 +
 include/linux/fs.h   |  15 ---
 include/linux/iversion.h | 236 +++
 9 files changed, 243 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iversion.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eb1bac7c8553..c95d7b2efefb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..27f008b33fc1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ef8acaac688..aa452c9e2eff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 2c7e53f9ff1b..5258c1714830 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "transaction.h"
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..fa5d8bc52d2d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 798b3ac680db..bcf0dff517be 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..19e72f500f71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -18,6 +18,7 @@
 #include  /* for inode_has_buffers */
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..76382c24e9d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
-/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
- */
-
-static inline void inode_inc_iversion(struct inode *inode)
-{
-   spin_lock(&inode->i_lock);
-   inode->i_version++;
-   spin_unlock(&inode->i_lock);
-}
-
 enum file_time_flags {
S_ATIME = 1,
S_MTIME = 2,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
new file mode 100644
index ..d09cc3a08740
--- /dev/null
+++ b/include/linux/iversion.h
@@ -0,0 +1,236 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IVERSION_H
+#define _LINUX_IVERSION_H
+
+#include 
+
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * Observers see the i_version as a 64-bit number that never changes. If it
+ * remains the same since it was last checked, then nothing has changed in the
+ * inode. If it's different then something has changed. Observers cannot infer
+ * anything about the nature or magnitude of the changes from the value, only
+ * that the inode has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems 
that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_

[PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index d09cc3a08740..5ad9eaa3a9b0 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-   spin_lock(&inode->i_lock);
-   inode->i_version++;
-   spin_unlock(&inode->i_lock);
+   atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+
+   atomic64_inc(ivp);
return true;
 }
 
+
 /**
  * inode_inc_iversion - forcibly increment i_version
  * @inode: inode that needs to be updated
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 03/19] fat: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  3 ++-
 fs/fat/inode.c   |  9 +
 fs/fat/namei_msdos.c |  7 ---
 fs/fat/namei_vfat.c  | 22 +++---
 4 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..8e100c3bf72c 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 /*
@@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..ffbbf0520d9e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 #ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..582ca731a6c9 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include "fat.h"
 
 /* Characters that are undesirable in an MS-DOS file name */
@@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c03a3a..cefea792cde8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "fat.h"
 
 static inline unsigned long vfat_d_version(struct dentry *dentry)
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(&dentry->d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(&dentry->d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir-&g

  1   2   3   4   5   >