Re: rfc: [patch] change attribute for ext3
On Wed, 2006-12-13 at 20:52 -0500, J. Bruce Fields wrote: What kind of requirements does NFSv4 place on the version? Monotonic is probably a good bet. The only requirement is that it be unique (assuming a file is never modified 2^64 times). Clients can't compare them except for equality. The other requirement is that they be updated in more or less any situation where you would normally see a 'ctime' update. In other words any time when the file metadata or data changes, and any time when the ACL changes. (NB: I'm not sure what we should do w.r.t. xattr changes since those are not really covered by RFC3530.) Atomicity is not a hard requirement, however the server is required to know whether or not the update was atomic. If the update is atomic, a careful client may perform certain optimisations based upon it knowing that no other changes to the inode have raced with this one. For instance, if it knows that a file creation atomically updated the change attribute of the directory, then it can determine that it does not need to check for other changes to that directory. Does it need to be global for the filesystem Nope. or is a per-inode version sufficient? Yes. Yes. If your filesystem wants to support Solaris or Reiser4-like subfiles, then it is expected that each subfile should have its own change attribute (whereas changes to the subfile 'directory' will be reflected by the parent inode's change attribute. Change attribute values may be reused if the inode number is reused (as long as the filesystem has something like a generation counter that allows it to distinguish between different instances of the same inode number). What functionality of NFSv4 needs the version? Clients use it to revalidate their caches. Yup. It is used to detect changes made on the NFS server itself (possibly by other NFS clients, possibly by local processes on the server), so that the client can flush out any stale cached data. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rfc: [patch] change attribute for ext3
On Wed, Sep 13, 2006 at 01:31:30PM -0600, Andreas Dilger wrote: On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote: the change attribute is a simple counter that is reset to zero on inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). To start, I'm supportive of this concept, my comments are only to get the most efficient implementation. This appears to be very similar to the i_version field that is already in the inode (which is also modified only by ext3), so instead of increasing the size of the inode further we could use that field and just make it persistent on disk. The i_version field is incremented in mkdir/rmdir/create/rename. For Lustre it would also be desirable to modify the version field for regular files when they are modified (e.g. setattr, write), and it appears NFS v4 also wants the same (assumed from use of file_update_time()). The question is whether this should be handled internal to the filesystem (as ext3 does now) or if it should be part of the VFS. hmm..., i_version is currently used for directory entries validation; i've browsed the ext{2,3,4} sources and i don't see any drawbacks in merging i_version and i_chattr. IMHO, the natural place to do this stuff is the VFS, because it can be common to all file-systems supporting this feature. Currently it's the same with ctime, mtime and atime. These are in the VFS even if there are file-systems that don't support all of them. The patch also adds a new ``st_change_attribute'' field in the stat structure, and modifies the stat(2) syscall accordingly. Currently the change is only visible on i386 and x86_64 archs. Is this really necessary for knfsd? @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_change_attribute = 0; Initializing to zero is more dangerous than any non-zero number, since this is the most likely outcome of corruption... The current ext3 code initializes i_version to a random number, and we can use comparisons similar to jiffies as long as we don't expect 2^31 changes between comparisons. it's ok for me; +++ fs/inode.c 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file) return; now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) - sync_it = 1; inode-i_mtime = now; - - if (!timespec_equal(inode-i_ctime, now)) - sync_it = 1; inode-i_ctime = now; - - if (sync_it) - mark_inode_dirty_sync(inode); + inode-i_change_attribute++; + mark_inode_dirty_sync(inode); Ugh, this would definitely hurt performance, because ext3_dirty_inode() packs-for-disk the whole inode each time. I believe Stephen had patches at one time to do the inode packing at transaction commit time instead of when it is changed, so we only do the packing once. Having a generic mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty inode to the buffer) would also be useful for other things like doing the inode or group descriptor checksum only once per transaction... yes, that part is ugly. I've thought about another solution, but i don't know if this would work: afaik, for an open file, there is always a copy of the inode in memory, because there is a reference to it in the file structure. So, in principle, we shouldn't need to make dirty the inode. I don't know if this is feasable perhaps am i missing something here. Index: include/linux/ext3_fs.h === RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 - 1.1.1.3 +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -286,7 +286,7 @@ struct ext3_inode { __u16 i_pad1; __le16 l_i_uid_high; /* these 2 fields*/ __le16 l_i_gid_high; /* were reserved2[0] */ - __u32 l_i_reserved2; + __le32 l_i_change_attribute; There was some other use of the reserved fields for ext4 64-bit-blocks support. One was for i_file_acl_hi (I think this is using the i_pad1 above), one was for i_blocks_hi (I believe this was proposed to use the i_frag and i_fsize bytes). Is this conflicting with anything else? There were a lot of proposals for these fields, and I don't recall which ones are still out there. i haven't noticed any conflicts here. @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb +#define ATTR_CHANGE_ATTRIBUTE 16384 Do
Re: rfc: [patch] change attribute for ext3
On Sep 14, 2006 15:21 +0200, Alexandre Ratchov wrote: IMHO, the natural place to do this stuff is the VFS, because it can be common to all file-systems supporting this feature. Currently it's the same with ctime, mtime and atime. These are in the VFS even if there are file-systems that don't support all of them. Well, that is only partly true. I see lots of places in ext3 that are setting i_ctime and i_mtime... Ugh, this would definitely hurt performance, because ext3_dirty_inode() packs-for-disk the whole inode each time. I believe Stephen had patches at one time to do the inode packing at transaction commit time instead of when it is changed, so we only do the packing once. Having a generic mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty inode to the buffer) would also be useful for other things like doing the inode or group descriptor checksum only once per transaction... afaik, for an open file, there is always a copy of the inode in memory, because there is a reference to it in the file structure. So, in principle, we shouldn't need to make dirty the inode. I don't know if this is feasable perhaps am i missing something here. The in-memory inode needs to be copied into the buffer so that it is part of the transaction being committed to disk, or updates are lost. This was a common bug with early ext3 - marking the inode dirty and then changing a field in the in-core inode - which would not be saved to disk. In other filesystems this is only a few-cycle race, but in ext3 the in-core inode is not written to disk unless the inode is again marked dirty. The potential benefit of making this a callback from the JBD layer is it avoids copying the inode for EVERY dirty, and only doing it once per transaction. Add a list of callbacks hooked onto the transaction to be called before it is committed, and the callback data is the inode pointer which does a single ext3_do_update_inode() call if the inode is still marked dirty. it's not strictly necessary; it's not more necessary that the interface to ctime or other attributes. It's here for completness, in my opinion the change attribute is the same as the ctime time-stamp Then makes sense to just improve the ctime mechanism instead of adding new code and interfaces... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
rfc: [patch] change attribute for ext3
hello, here is a small patch that adds the change attribute for ext3 file-systems; the change attribute is a simple counter that is reset to zero on inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). Its purpose is to make possible to watch a file for changes, as in the following pseudo-code: stat(filename, oldsb); ... /* do something */ stat(filename, newsb); if (oldsb-st_ctime != newsb-st_ctime || oldsb-st_change_attribute != newsb-st_changeattribute) { /* file changed */ } else { /* file didn't change */ } The patch also adds a new ``st_change_attribute'' field in the stat structure, and modifies the stat(2) syscall accordingly. Currently the change is only visible on i386 and x86_64 archs. In order to test the patch, there's a trivial utility that displays the value of the change attribute: http://www.bullopensource.org/ext4/20060913/chinfo.tar.gz Comments? cheers, -- Alexandre Signed-off-by: Celine Bourde Signed-off-by: Alexandre Ratchov Index: fs/attr.c === RCS file: /home/ratchova/cvs/linux/fs/attr.c,v retrieving revision 1.1.1.1 retrieving revision 1.1.1.1.10.1 diff -u -p -r1.1.1.1 -r1.1.1.1.10.1 --- fs/attr.c 28 Jun 2006 18:41:20 - 1.1.1.1 +++ fs/attr.c 13 Sep 2006 18:15:43 - 1.1.1.1.10.1 @@ -88,6 +88,9 @@ int inode_setattr(struct inode * inode, if (ia_valid ATTR_CTIME) inode-i_ctime = timespec_trunc(attr-ia_ctime, inode-i_sb-s_time_gran); + if (ia_valid ATTR_CHANGE_ATTRIBUTE) + inode-i_change_attribute = attr-ia_change_attribute; + if (ia_valid ATTR_MODE) { umode_t mode = attr-ia_mode; @@ -111,7 +114,7 @@ int notify_change(struct dentry * dentry mode = inode-i_mode; now = current_fs_time(inode-i_sb); - + inode-i_change_attribute++; attr-ia_ctime = now; if (!(ia_valid ATTR_ATIME_SET)) attr-ia_atime = now; Index: fs/bad_inode.c === RCS file: /home/ratchova/cvs/linux/fs/bad_inode.c,v retrieving revision 1.1.1.2 retrieving revision 1.1.1.2.6.1 diff -u -p -r1.1.1.2 -r1.1.1.2.6.1 --- fs/bad_inode.c 28 Jun 2006 18:42:06 - 1.1.1.2 +++ fs/bad_inode.c 13 Sep 2006 18:15:43 - 1.1.1.2.6.1 @@ -97,6 +97,7 @@ void make_bad_inode(struct inode * inode inode-i_mode = S_IFREG; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_change_attribute++; inode-i_op = bad_inode_ops; inode-i_fop = bad_file_ops; } Index: fs/binfmt_misc.c === RCS file: /home/ratchova/cvs/linux/fs/binfmt_misc.c,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- fs/binfmt_misc.c13 Sep 2006 17:45:04 - 1.1.1.3 +++ fs/binfmt_misc.c13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_change_attribute = 0; } return inode; } Index: fs/eventpoll.c === RCS file: /home/ratchova/cvs/linux/fs/eventpoll.c,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- fs/eventpoll.c 13 Sep 2006 17:45:04 - 1.1.1.3 +++ fs/eventpoll.c 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -1590,6 +1590,7 @@ static struct inode *ep_eventpoll_inode( inode-i_uid = current-fsuid; inode-i_gid = current-fsgid; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_change_attribute = 0; inode-i_blksize = PAGE_SIZE; return inode; Index: fs/inode.c === RCS file: /home/ratchova/cvs/linux/fs/inode.c,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- fs/inode.c 13 Sep 2006 17:45:04 - 1.1.1.3 +++ fs/inode.c 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file) return; now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) - sync_it = 1; inode-i_mtime = now; - - if (!timespec_equal(inode-i_ctime, now)) - sync_it = 1; inode-i_ctime = now; - - if (sync_it) - mark_inode_dirty_sync(inode); +
Re: rfc: [patch] change attribute for ext3
On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote: the change attribute is a simple counter that is reset to zero on inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). To start, I'm supportive of this concept, my comments are only to get the most efficient implementation. This appears to be very similar to the i_version field that is already in the inode (which is also modified only by ext3), so instead of increasing the size of the inode further we could use that field and just make it persistent on disk. The i_version field is incremented in mkdir/rmdir/create/rename. For Lustre it would also be desirable to modify the version field for regular files when they are modified (e.g. setattr, write), and it appears NFS v4 also wants the same (assumed from use of file_update_time()). The question is whether this should be handled internal to the filesystem (as ext3 does now) or if it should be part of the VFS. The patch also adds a new ``st_change_attribute'' field in the stat structure, and modifies the stat(2) syscall accordingly. Currently the change is only visible on i386 and x86_64 archs. Is this really necessary for knfsd? @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_change_attribute = 0; Initializing to zero is more dangerous than any non-zero number, since this is the most likely outcome of corruption... The current ext3 code initializes i_version to a random number, and we can use comparisons similar to jiffies as long as we don't expect 2^31 changes between comparisons. +++ fs/inode.c13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file) return; now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) - sync_it = 1; inode-i_mtime = now; - - if (!timespec_equal(inode-i_ctime, now)) - sync_it = 1; inode-i_ctime = now; - - if (sync_it) - mark_inode_dirty_sync(inode); + inode-i_change_attribute++; + mark_inode_dirty_sync(inode); Ugh, this would definitely hurt performance, because ext3_dirty_inode() packs-for-disk the whole inode each time. I believe Stephen had patches at one time to do the inode packing at transaction commit time instead of when it is changed, so we only do the packing once. Having a generic mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty inode to the buffer) would also be useful for other things like doing the inode or group descriptor checksum only once per transaction... Index: include/linux/ext3_fs.h === RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 - 1.1.1.3 +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -286,7 +286,7 @@ struct ext3_inode { __u16 i_pad1; __le16 l_i_uid_high; /* these 2 fields*/ __le16 l_i_gid_high; /* were reserved2[0] */ - __u32 l_i_reserved2; + __le32 l_i_change_attribute; There was some other use of the reserved fields for ext4 64-bit-blocks support. One was for i_file_acl_hi (I think this is using the i_pad1 above), one was for i_blocks_hi (I believe this was proposed to use the i_frag and i_fsize bytes). Is this conflicting with anything else? There were a lot of proposals for these fields, and I don't recall which ones are still out there. @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb +#define ATTR_CHANGE_ATTRIBUTE 16384 Do you need a setattr interface for this, or is it sufficient to use the i_version field from the inode, and let the filesystem manage i_version updates as it is doing now? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html