Re: rfc: [patch] change attribute for ext3

2006-12-14 Thread Trond Myklebust
On Wed, 2006-12-13 at 20:52 -0500, J. Bruce Fields wrote:
  What kind of requirements does NFSv4 place on the version?  Monotonic is
  probably a good bet.
 
 The only requirement is that it be unique (assuming a file is never
 modified 2^64 times).  Clients can't compare them except for equality.

The other requirement is that they be updated in more or less any
situation where you would normally see a 'ctime' update. In other words
any time when the file metadata or data changes, and any time when the
ACL changes.

(NB: I'm not sure what we should do w.r.t. xattr changes since those are
not really covered by RFC3530.)

Atomicity is not a hard requirement, however the server is required to
know whether or not the update was atomic. If the update is atomic, a
careful client may perform certain optimisations based upon it knowing
that no other changes to the inode have raced with this one. For
instance, if it knows that a file creation atomically updated the change
attribute of the directory, then it can determine that it does not need
to check for other changes to that directory.

  Does it need to be global for the filesystem
 
 Nope.
 
  or is a per-inode version sufficient?
 
 Yes.

Yes. If your filesystem wants to support Solaris or Reiser4-like
subfiles, then it is expected that each subfile should have its own
change attribute (whereas changes to the subfile 'directory' will be
reflected by the parent inode's change attribute.

Change attribute values may be reused if the inode number is reused (as
long as the filesystem has something like a generation counter that
allows it to distinguish between different instances of the same inode
number).

  What functionality of NFSv4 needs the version?
 
 Clients use it to revalidate their caches.

Yup. It is used to detect changes made on the NFS server itself
(possibly by other NFS clients, possibly by local processes on the
server), so that the client can flush out any stale cached data.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-09-14 Thread Alexandre Ratchov
On Wed, Sep 13, 2006 at 01:31:30PM -0600, Andreas Dilger wrote:
 On Sep 13, 2006  18:42 +0200, Alexandre Ratchov wrote:
  the change attribute is a simple counter that is reset to zero on
  inode creation and that is incremented every time the inode data is
  modified (similarly to the ctime time-stamp).
 
 To start, I'm supportive of this concept, my comments are only to
 get the most efficient implementation.
 
 This appears to be very similar to the i_version field that is already
 in the inode (which is also modified only by ext3), so instead of
 increasing the size of the inode further we could use that field and
 just make it persistent on disk.  The i_version field is incremented
 in mkdir/rmdir/create/rename.  For Lustre it would also be desirable
 to modify the version field for regular files when they are modified
 (e.g. setattr, write), and it appears NFS v4 also wants the same
 (assumed from use of file_update_time()).  The question is whether
 this should be handled internal to the filesystem (as ext3 does now)
 or if it should be part of the VFS.

hmm..., i_version is currently used for directory entries validation; i've
browsed the ext{2,3,4} sources and i don't see any drawbacks in merging
i_version and i_chattr.

IMHO, the natural place to do this stuff is the VFS, because it can be
common to all file-systems supporting this feature. Currently it's the same
with ctime, mtime and atime. These are in the VFS even if there are
file-systems that don't support all of them.

  The patch also adds a new ``st_change_attribute'' field in the stat
  structure, and modifies the stat(2) syscall accordingly. Currently the
  change is only visible on i386 and x86_64 archs.
 
 Is this really necessary for knfsd?
 
  @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
  inode-i_blocks = 0;
  inode-i_atime = inode-i_mtime = inode-i_ctime =
  current_fs_time(inode-i_sb);
  +   inode-i_change_attribute = 0;
 
 Initializing to zero is more dangerous than any non-zero number,
 since this is the most likely outcome of corruption...  The current
 ext3 code initializes i_version to a random number, and we can use
 comparisons similar to jiffies as long as we don't expect  2^31
 changes between comparisons.
 

it's ok for me; 

  +++ fs/inode.c  13 Sep 2006 18:15:43 -  1.1.1.3.2.1
  @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
  return;
   
  now = current_fs_time(inode-i_sb);
  -   if (!timespec_equal(inode-i_mtime, now))
  -   sync_it = 1;
  inode-i_mtime = now;
  -
  -   if (!timespec_equal(inode-i_ctime, now))
  -   sync_it = 1;
  inode-i_ctime = now;
  -
  -   if (sync_it)
  -   mark_inode_dirty_sync(inode);
  +   inode-i_change_attribute++;
  +   mark_inode_dirty_sync(inode);
 
 Ugh, this would definitely hurt performance, because ext3_dirty_inode()
 packs-for-disk the whole inode each time.  I believe Stephen had patches
 at one time to do the inode packing at transaction commit time instead
 of when it is changed, so we only do the packing once.  Having a generic
 mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
 inode to the buffer) would also be useful for other things like doing
 the inode or group descriptor checksum only once per transaction...
 

yes, that part is ugly. I've thought about another solution, but i don't
know if this would work:

afaik, for an open file, there is always a copy of the inode in memory,
because there is a reference to it in the file structure. So, in principle,
we shouldn't need to make dirty the inode. I don't know if this is feasable
perhaps am i missing something here.

  Index: include/linux/ext3_fs.h
  ===
  RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v
  retrieving revision 1.1.1.3
  retrieving revision 1.1.1.3.2.1
  diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
  --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 -  1.1.1.3
  +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 -  1.1.1.3.2.1
  @@ -286,7 +286,7 @@ struct ext3_inode {
  __u16   i_pad1;
  __le16  l_i_uid_high;   /* these 2 fields*/
  __le16  l_i_gid_high;   /* were reserved2[0] */
  -   __u32   l_i_reserved2;
  +   __le32  l_i_change_attribute;
 
 There was some other use of the reserved fields for ext4 64-bit-blocks
 support.  One was for i_file_acl_hi (I think this is using the i_pad1
 above), one was for i_blocks_hi (I believe this was proposed to use the
 i_frag and i_fsize bytes).  Is this conflicting with anything else?
 There were a lot of proposals for these fields, and I don't recall which
 ones are still out there.

i haven't noticed any conflicts here.

 
  @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb
  +#define ATTR_CHANGE_ATTRIBUTE  16384
 
 Do 

Re: rfc: [patch] change attribute for ext3

2006-09-14 Thread Andreas Dilger
On Sep 14, 2006  15:21 +0200, Alexandre Ratchov wrote:
 IMHO, the natural place to do this stuff is the VFS, because it can be
 common to all file-systems supporting this feature. Currently it's the same
 with ctime, mtime and atime. These are in the VFS even if there are
 file-systems that don't support all of them.

Well, that is only partly true.  I see lots of places in ext3 that are
setting i_ctime and i_mtime...

  Ugh, this would definitely hurt performance, because ext3_dirty_inode()
  packs-for-disk the whole inode each time.  I believe Stephen had patches
  at one time to do the inode packing at transaction commit time instead
  of when it is changed, so we only do the packing once.  Having a generic
  mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
  inode to the buffer) would also be useful for other things like doing
  the inode or group descriptor checksum only once per transaction...
 
 afaik, for an open file, there is always a copy of the inode in memory,
 because there is a reference to it in the file structure. So, in principle,
 we shouldn't need to make dirty the inode. I don't know if this is feasable
 perhaps am i missing something here.

The in-memory inode needs to be copied into the buffer so that it is
part of the transaction being committed to disk, or updates are lost.
This was a common bug with early ext3 - marking the inode dirty and
then changing a field in the in-core inode - which would not be saved
to disk.  In other filesystems this is only a few-cycle race, but in
ext3 the in-core inode is not written to disk unless the inode is
again marked dirty.

The potential benefit of making this a callback from the JBD layer is
it avoids copying the inode for EVERY dirty, and only doing it once
per transaction.  Add a list of callbacks hooked onto the transaction
to be called before it is committed, and the callback data is the
inode pointer which does a single ext3_do_update_inode() call if the
inode is still marked dirty.

 it's not strictly necessary; it's not more necessary that the interface to
 ctime or other attributes. It's here for completness, in my opinion the
 change attribute is the same as the ctime time-stamp

Then makes sense to just improve the ctime mechanism instead of adding
new code and interfaces...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rfc: [patch] change attribute for ext3

2006-09-13 Thread Alexandre Ratchov
hello, 

here is a small patch that adds the change attribute for ext3
file-systems;

the change attribute is a simple counter that is reset to zero on
inode creation and that is incremented every time the inode data is
modified (similarly to the ctime time-stamp).

Its purpose is to make possible to watch a file for changes, as
in the following pseudo-code:

stat(filename, oldsb);

... /* do something */ 

stat(filename, newsb);
if (oldsb-st_ctime != newsb-st_ctime ||
oldsb-st_change_attribute != newsb-st_changeattribute) {

/* file changed */
} else {

/* file didn't change */
}

The patch also adds a new ``st_change_attribute'' field in the stat
structure, and modifies the stat(2) syscall accordingly. Currently the
change is only visible on i386 and x86_64 archs.

In order to test the patch, there's a trivial utility that displays the
value of the change attribute:

http://www.bullopensource.org/ext4/20060913/chinfo.tar.gz

Comments?

cheers,

-- Alexandre


Signed-off-by: Celine Bourde
Signed-off-by: Alexandre Ratchov

Index: fs/attr.c
===
RCS file: /home/ratchova/cvs/linux/fs/attr.c,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.1.10.1
diff -u -p -r1.1.1.1 -r1.1.1.1.10.1
--- fs/attr.c   28 Jun 2006 18:41:20 -  1.1.1.1
+++ fs/attr.c   13 Sep 2006 18:15:43 -  1.1.1.1.10.1
@@ -88,6 +88,9 @@ int inode_setattr(struct inode * inode, 
if (ia_valid  ATTR_CTIME)
inode-i_ctime = timespec_trunc(attr-ia_ctime,
inode-i_sb-s_time_gran);
+   if (ia_valid  ATTR_CHANGE_ATTRIBUTE)
+   inode-i_change_attribute = attr-ia_change_attribute;
+   
if (ia_valid  ATTR_MODE) {
umode_t mode = attr-ia_mode;
 
@@ -111,7 +114,7 @@ int notify_change(struct dentry * dentry
 
mode = inode-i_mode;
now = current_fs_time(inode-i_sb);
-
+   inode-i_change_attribute++;
attr-ia_ctime = now;
if (!(ia_valid  ATTR_ATIME_SET))
attr-ia_atime = now;
Index: fs/bad_inode.c
===
RCS file: /home/ratchova/cvs/linux/fs/bad_inode.c,v
retrieving revision 1.1.1.2
retrieving revision 1.1.1.2.6.1
diff -u -p -r1.1.1.2 -r1.1.1.2.6.1
--- fs/bad_inode.c  28 Jun 2006 18:42:06 -  1.1.1.2
+++ fs/bad_inode.c  13 Sep 2006 18:15:43 -  1.1.1.2.6.1
@@ -97,6 +97,7 @@ void make_bad_inode(struct inode * inode
inode-i_mode = S_IFREG;
inode-i_atime = inode-i_mtime = inode-i_ctime =
current_fs_time(inode-i_sb);
+   inode-i_change_attribute++;
inode-i_op = bad_inode_ops;   
inode-i_fop = bad_file_ops;   
 }
Index: fs/binfmt_misc.c
===
RCS file: /home/ratchova/cvs/linux/fs/binfmt_misc.c,v
retrieving revision 1.1.1.3
retrieving revision 1.1.1.3.2.1
diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
--- fs/binfmt_misc.c13 Sep 2006 17:45:04 -  1.1.1.3
+++ fs/binfmt_misc.c13 Sep 2006 18:15:43 -  1.1.1.3.2.1
@@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
inode-i_blocks = 0;
inode-i_atime = inode-i_mtime = inode-i_ctime =
current_fs_time(inode-i_sb);
+   inode-i_change_attribute = 0;
}
return inode;
 }
Index: fs/eventpoll.c
===
RCS file: /home/ratchova/cvs/linux/fs/eventpoll.c,v
retrieving revision 1.1.1.3
retrieving revision 1.1.1.3.2.1
diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
--- fs/eventpoll.c  13 Sep 2006 17:45:04 -  1.1.1.3
+++ fs/eventpoll.c  13 Sep 2006 18:15:43 -  1.1.1.3.2.1
@@ -1590,6 +1590,7 @@ static struct inode *ep_eventpoll_inode(
inode-i_uid = current-fsuid;
inode-i_gid = current-fsgid;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
+   inode-i_change_attribute = 0;
inode-i_blksize = PAGE_SIZE;
return inode;
 
Index: fs/inode.c
===
RCS file: /home/ratchova/cvs/linux/fs/inode.c,v
retrieving revision 1.1.1.3
retrieving revision 1.1.1.3.2.1
diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
--- fs/inode.c  13 Sep 2006 17:45:04 -  1.1.1.3
+++ fs/inode.c  13 Sep 2006 18:15:43 -  1.1.1.3.2.1
@@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
return;
 
now = current_fs_time(inode-i_sb);
-   if (!timespec_equal(inode-i_mtime, now))
-   sync_it = 1;
inode-i_mtime = now;
-
-   if (!timespec_equal(inode-i_ctime, now))
-   sync_it = 1;
inode-i_ctime = now;
-
-   if (sync_it)
-   mark_inode_dirty_sync(inode);
+   

Re: rfc: [patch] change attribute for ext3

2006-09-13 Thread Andreas Dilger
On Sep 13, 2006  18:42 +0200, Alexandre Ratchov wrote:
 the change attribute is a simple counter that is reset to zero on
 inode creation and that is incremented every time the inode data is
 modified (similarly to the ctime time-stamp).

To start, I'm supportive of this concept, my comments are only to
get the most efficient implementation.

This appears to be very similar to the i_version field that is already
in the inode (which is also modified only by ext3), so instead of
increasing the size of the inode further we could use that field and
just make it persistent on disk.  The i_version field is incremented
in mkdir/rmdir/create/rename.  For Lustre it would also be desirable
to modify the version field for regular files when they are modified
(e.g. setattr, write), and it appears NFS v4 also wants the same
(assumed from use of file_update_time()).  The question is whether
this should be handled internal to the filesystem (as ext3 does now)
or if it should be part of the VFS.

 The patch also adds a new ``st_change_attribute'' field in the stat
 structure, and modifies the stat(2) syscall accordingly. Currently the
 change is only visible on i386 and x86_64 archs.

Is this really necessary for knfsd?

 @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
   inode-i_blocks = 0;
   inode-i_atime = inode-i_mtime = inode-i_ctime =
   current_fs_time(inode-i_sb);
 + inode-i_change_attribute = 0;

Initializing to zero is more dangerous than any non-zero number,
since this is the most likely outcome of corruption...  The current
ext3 code initializes i_version to a random number, and we can use
comparisons similar to jiffies as long as we don't expect  2^31
changes between comparisons.

 +++ fs/inode.c13 Sep 2006 18:15:43 -  1.1.1.3.2.1
 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
   return;
  
   now = current_fs_time(inode-i_sb);
 - if (!timespec_equal(inode-i_mtime, now))
 - sync_it = 1;
   inode-i_mtime = now;
 -
 - if (!timespec_equal(inode-i_ctime, now))
 - sync_it = 1;
   inode-i_ctime = now;
 -
 - if (sync_it)
 - mark_inode_dirty_sync(inode);
 + inode-i_change_attribute++;
 + mark_inode_dirty_sync(inode);

Ugh, this would definitely hurt performance, because ext3_dirty_inode()
packs-for-disk the whole inode each time.  I believe Stephen had patches
at one time to do the inode packing at transaction commit time instead
of when it is changed, so we only do the packing once.  Having a generic
mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
inode to the buffer) would also be useful for other things like doing
the inode or group descriptor checksum only once per transaction...

 Index: include/linux/ext3_fs.h
 ===
 RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v
 retrieving revision 1.1.1.3
 retrieving revision 1.1.1.3.2.1
 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
 --- include/linux/ext3_fs.h   13 Sep 2006 17:45:20 -  1.1.1.3
 +++ include/linux/ext3_fs.h   13 Sep 2006 18:15:43 -  1.1.1.3.2.1
 @@ -286,7 +286,7 @@ struct ext3_inode {
   __u16   i_pad1;
   __le16  l_i_uid_high;   /* these 2 fields*/
   __le16  l_i_gid_high;   /* were reserved2[0] */
 - __u32   l_i_reserved2;
 + __le32  l_i_change_attribute;

There was some other use of the reserved fields for ext4 64-bit-blocks
support.  One was for i_file_acl_hi (I think this is using the i_pad1
above), one was for i_blocks_hi (I believe this was proposed to use the
i_frag and i_fsize bytes).  Is this conflicting with anything else?
There were a lot of proposals for these fields, and I don't recall which
ones are still out there.

 @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb
 +#define ATTR_CHANGE_ATTRIBUTE  16384

Do you need a setattr interface for this, or is it sufficient to use
the i_version field from the inode, and let the filesystem manage
i_version updates as it is doing now?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html