Re: rfc: [patch] change attribute for ext3

2006-09-13 Thread Andreas Dilger
On Sep 13, 2006  18:42 +0200, Alexandre Ratchov wrote:
 the change attribute is a simple counter that is reset to zero on
 inode creation and that is incremented every time the inode data is
 modified (similarly to the ctime time-stamp).

To start, I'm supportive of this concept, my comments are only to
get the most efficient implementation.

This appears to be very similar to the i_version field that is already
in the inode (which is also modified only by ext3), so instead of
increasing the size of the inode further we could use that field and
just make it persistent on disk.  The i_version field is incremented
in mkdir/rmdir/create/rename.  For Lustre it would also be desirable
to modify the version field for regular files when they are modified
(e.g. setattr, write), and it appears NFS v4 also wants the same
(assumed from use of file_update_time()).  The question is whether
this should be handled internal to the filesystem (as ext3 does now)
or if it should be part of the VFS.

 The patch also adds a new ``st_change_attribute'' field in the stat
 structure, and modifies the stat(2) syscall accordingly. Currently the
 change is only visible on i386 and x86_64 archs.

Is this really necessary for knfsd?

 @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
   inode-i_blocks = 0;
   inode-i_atime = inode-i_mtime = inode-i_ctime =
   current_fs_time(inode-i_sb);
 + inode-i_change_attribute = 0;

Initializing to zero is more dangerous than any non-zero number,
since this is the most likely outcome of corruption...  The current
ext3 code initializes i_version to a random number, and we can use
comparisons similar to jiffies as long as we don't expect  2^31
changes between comparisons.

 +++ fs/inode.c13 Sep 2006 18:15:43 -  1.1.1.3.2.1
 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
   return;
  
   now = current_fs_time(inode-i_sb);
 - if (!timespec_equal(inode-i_mtime, now))
 - sync_it = 1;
   inode-i_mtime = now;
 -
 - if (!timespec_equal(inode-i_ctime, now))
 - sync_it = 1;
   inode-i_ctime = now;
 -
 - if (sync_it)
 - mark_inode_dirty_sync(inode);
 + inode-i_change_attribute++;
 + mark_inode_dirty_sync(inode);

Ugh, this would definitely hurt performance, because ext3_dirty_inode()
packs-for-disk the whole inode each time.  I believe Stephen had patches
at one time to do the inode packing at transaction commit time instead
of when it is changed, so we only do the packing once.  Having a generic
mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
inode to the buffer) would also be useful for other things like doing
the inode or group descriptor checksum only once per transaction...

 Index: include/linux/ext3_fs.h
 ===
 RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v
 retrieving revision 1.1.1.3
 retrieving revision 1.1.1.3.2.1
 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
 --- include/linux/ext3_fs.h   13 Sep 2006 17:45:20 -  1.1.1.3
 +++ include/linux/ext3_fs.h   13 Sep 2006 18:15:43 -  1.1.1.3.2.1
 @@ -286,7 +286,7 @@ struct ext3_inode {
   __u16   i_pad1;
   __le16  l_i_uid_high;   /* these 2 fields*/
   __le16  l_i_gid_high;   /* were reserved2[0] */
 - __u32   l_i_reserved2;
 + __le32  l_i_change_attribute;

There was some other use of the reserved fields for ext4 64-bit-blocks
support.  One was for i_file_acl_hi (I think this is using the i_pad1
above), one was for i_blocks_hi (I believe this was proposed to use the
i_frag and i_fsize bytes).  Is this conflicting with anything else?
There were a lot of proposals for these fields, and I don't recall which
ones are still out there.

 @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb
 +#define ATTR_CHANGE_ATTRIBUTE  16384

Do you need a setattr interface for this, or is it sufficient to use
the i_version field from the inode, and let the filesystem manage
i_version updates as it is doing now?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfc: [patch] change attribute for ext3

2006-09-14 Thread Andreas Dilger
On Sep 14, 2006  15:21 +0200, Alexandre Ratchov wrote:
 IMHO, the natural place to do this stuff is the VFS, because it can be
 common to all file-systems supporting this feature. Currently it's the same
 with ctime, mtime and atime. These are in the VFS even if there are
 file-systems that don't support all of them.

Well, that is only partly true.  I see lots of places in ext3 that are
setting i_ctime and i_mtime...

  Ugh, this would definitely hurt performance, because ext3_dirty_inode()
  packs-for-disk the whole inode each time.  I believe Stephen had patches
  at one time to do the inode packing at transaction commit time instead
  of when it is changed, so we only do the packing once.  Having a generic
  mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
  inode to the buffer) would also be useful for other things like doing
  the inode or group descriptor checksum only once per transaction...
 
 afaik, for an open file, there is always a copy of the inode in memory,
 because there is a reference to it in the file structure. So, in principle,
 we shouldn't need to make dirty the inode. I don't know if this is feasable
 perhaps am i missing something here.

The in-memory inode needs to be copied into the buffer so that it is
part of the transaction being committed to disk, or updates are lost.
This was a common bug with early ext3 - marking the inode dirty and
then changing a field in the in-core inode - which would not be saved
to disk.  In other filesystems this is only a few-cycle race, but in
ext3 the in-core inode is not written to disk unless the inode is
again marked dirty.

The potential benefit of making this a callback from the JBD layer is
it avoids copying the inode for EVERY dirty, and only doing it once
per transaction.  Add a list of callbacks hooked onto the transaction
to be called before it is committed, and the callback data is the
inode pointer which does a single ext3_do_update_inode() call if the
inode is still marked dirty.

 it's not strictly necessary; it's not more necessary that the interface to
 ctime or other attributes. It's here for completness, in my opinion the
 change attribute is the same as the ctime time-stamp

Then makes sense to just improve the ctime mechanism instead of adding
new code and interfaces...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2/3 create large filesystem takes too much time; solutions

2006-09-17 Thread Andreas Dilger
On Sep 16, 2006  16:06 -0400, Theodore Tso wrote:
 On Sat, Sep 16, 2006 at 05:56:02PM +0300, Pavel Mironchik wrote:
  I agree with you and I would prefer to send something more
  serious on that list than those previous patches - I like your
  idea with counters. Btw I assume crc is more preferable than
  just control sum for block group descriptors
 
 Yes, when I said checksum I meant a cyclic redundancy checksum, and
 not an additive checksum...  (and one of the things we can do is to
 build in the superblock UUID into the CRC, so that if the filesystem
 gets recreated we can distinguish an old inode from a new one).

Just to avoid duplication of effort, I'm attaching the current
work-in-progress patches for the uninitialized groups (kernel + e2fsprogs).
They are really at the barely compile stage (if that), but at least
people can look at them and start improving them instead of starting from
scratch.  The patches are based on work done by Anshu Goel
[EMAIL PROTECTED], but have been reworked a fair amount since they
were given to me (i.e. bugs added are mine).  I've been sitting on them
for too long and they should see the light of day instead of continuing
to stagnate.

I also just incorporated Ted's suggestion to include the filesystem UUID
into the checksum.  I previously had added in the group number, so that
if the block is written out to the wrong location it wouldn't verify
correctly.

Things that need to be done:
- the kernel block/inode allocation needs to be reworked:
  - initialize a whole block worth of inodes at one time instead
of single inodes.
  - I don't think we need to zero out the unused inodes - the kernel
should already be doing this if the inode block is unused
  - find a happy medium between using existing groups (inodes/blocks)
and initializing new ones
- we likely need to verify the checksum in more places in e2fsck before
  trusting the UNINIT flags

I won't be able to work more on this for a while, so have at it :-).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Index: linux-stage/fs/ext3/mballoc.c
===
--- linux-stage.orig/fs/ext3/mballoc.c  2006-09-17 00:59:44.0 -0600
+++ linux-stage/fs/ext3/mballoc.c   2006-09-17 00:59:44.0 -0600
@@ -198,7 +198,8 @@
 void ext3_mb_release_blocks(struct super_block *, int);
 void ext3_mb_poll_new_transaction(struct super_block *, handle_t *);
 void ext3_mb_free_committed_blocks(struct super_block *);
-
+unsigned long free_blks_after_init(struct super_block *,
+   struct ext3_group_desc *, int);
 #if BITS_PER_LONG == 64
 #define mb_correct_addr_and_bit(bit,addr)  \
 {  \
@@ -527,7 +528,12 @@
unlock_buffer(bh[i]);
continue;
}
-
+   if (desc-bg_flags  EXT3_BG_BLOCK_UNINIT) {
+   init_block_bitmap(sb, bh[i], desc, first_group + i);
+   set_buffer_uptodate(bh[i]);
+   unlock_buffer(bh[i]);
+   continue;
+   }
get_bh(bh[i]);
bh[i]-b_end_io = end_buffer_read_sync;
submit_bh(READ, bh[i]);
@@ -1525,9 +1531,16 @@
mb_set_bits(bitmap_bh-b_data, ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len);
 
spin_lock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group));
+   if (gdp-bg_flags  EXT3_BG_BLOCK_UNINIT) {
+   gdp-bg_flags = ~EXT3_BG_BLOCK_UNINIT;
+   gdp-bg_free_blocks_count =
+   cpu_to_le16(free_blks_after_init(sb, gdp,
+   ac.ac_b_ex.fe_group));
+   }
gdp-bg_free_blocks_count =
cpu_to_le16(le16_to_cpu(gdp-bg_free_blocks_count)
- ac.ac_b_ex.fe_len);
+   gdp-bg_checksum = ext3_group_desc_csum(es, gdp, ac.ac_b_ex.fe_group);
spin_unlock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group));
percpu_counter_mod(sbi-s_freeblocks_counter, - ac.ac_b_ex.fe_len);
 
@@ -2377,6 +2390,7 @@
spin_lock(sb_bgl_lock(sbi, block_group));
gdp-bg_free_blocks_count =
cpu_to_le16(le16_to_cpu(gdp-bg_free_blocks_count) + count);
+   gdp-bg_checksum = ext3_group_desc_csum(es, gdp, block_group);
spin_unlock(sb_bgl_lock(sbi, block_group));
percpu_counter_mod(sbi-s_freeblocks_counter, count);
 
Index: linux-stage/fs/ext3/ialloc.c
===
--- linux-stage.orig/fs/ext3/ialloc.c   2006-09-17 00:59:44.0 -0600
+++ linux-stage/fs/ext3/ialloc.c2006-09-17 00:59:44.0 -0600
@@ -23,7 +23,6 @@
 #include linux/buffer_head.h
 #include linux/random.h
 #include linux/bitops.h
-
 #include asm/byteorder.h
 
 #include xattr.h
@@ -59,8 +58,14 @@
desc

Re: [RFC] [PATCH] Documentation/filesystems/ext4.txt

2006-10-11 Thread Andreas Dilger
On Oct 10, 2006  15:02 -0500, Dave Kleikamp wrote:
 +  - It's still mke2fs -j /dev/hda1

I would suggest mke2fs -j -O dir_index -I 256 /dev/XXX to be more
representative of what will be used in the future.

 +programs:http://e2fsprogs.sourceforge.net/
 + http://ext2resize.sourceforge.net

You should likely remove ext2resize from this list, it hasn't got any
support for extent-mapped files.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Design alternatives for fragments/file tail support in ext4

2006-10-13 Thread Andreas Dilger
On Oct 13, 2006  08:23 -0400, Theodore Tso wrote:
 On Fri, Oct 13, 2006 at 02:56:46PM +0400, Alex Tomas wrote:
   Theodore Tso (TT) writes:
   TT I suggest this be tunable by superblock field, and not by a /proc
   TT tunable.  This is the sort of thing which might be different
   TT per-filesystem, and the algorithm will be most effective if the
   TT filesystem always use the same cluster size from the time when it was
   TT first created.  I'd be happy to assign a superblock field for this
   TT purpose, and add the appropriate tune2fs support if we have general
   TT agreement on this point.
  
  that would be good. there is even a stride option to mke2fs?
 
 Yes, there is.  And just as we have -E stride=stripe-size and -E
 resize=max-online-resize, we can also -E cluster-size=bytes parameter
 in mke2fs.  It would also make sense to make this be something that
 can be defaulted in /etc/mke2fs.conf, since even for IDE or SATA disks
 it probably makes sense to make the cluster size be 16k or 32k or
 maybe even higher.  We probably need to do some benchmarks to see
 whether or not this makes sense.

I think what Alex meant is that the mke2fs -E stride= value should
just be put into the superblock.  This would allow tuning the mballoc code
to match the RAID alignment, and would also make life easier for resizers
so they can continue the RAID-stepping of bitmaps that mke2fs does
without having to extrapolate the stride value from the bitmap locations.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ixt3

2006-10-17 Thread Andreas Dilger
On Oct 17, 2006  10:36 +0200, Jure Pečar wrote:
 http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf
 
 I think I got this link from OLS papers or from fs workshop ... and I really 
 like what it says.
 
 How many of ixt3 features are going to be implemented in ext4? 

None, yet.  The journal checksum code has been brought into a basically
ready-to-go state, and there is also work to add checksums to the group
descriptors, so there is some chance that this will be added before ext4
is declared closed.

Nothing done yet to checksum inodes, superblock, file index
or extent blocks.  None of the retry code has been looked at.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ixt3

2006-10-17 Thread Andreas Dilger
On Oct 17, 2006  11:06 +0200, Jure Pečar wrote:
 On Tue, 17 Oct 2006 02:51:30 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
  None of the retry code has been looked at.
 
 This is what I'm mostly interested in ... predictable, well behaving
 and tested error handling  recovery. Is ext4 going to be any better
 here than ext3?

Not unless someone massages the ixt3 code into a form usable in the
kernel.  I'd suggest you contact the authors if you are interested in
doing this work.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, E2FSPROGS] On-disk format for large number of subdirectors

2006-10-18 Thread Andreas Dilger
On Oct 18, 2006  02:24 -0400, Theodore Ts'o wrote:
 On-disk format for large number of subdirectories
 
 - EXT4_FEATURE_RO_COMPAT_DIR_NLINK (0x0020?) - allow directories to have
65000 subdirectories (i_nlinks) by setting i_nlinks = 1 for such
   directories.  RO_COMPAT protects old filesystems from unlinking such
   directories incorrectly and losing all files therein.  
 
 Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]

Looks good.  Will try to get a new patch out with this support.

 Index: e2fsprogs/lib/ext2fs/ext2_fs.h
 ===
 --- e2fsprogs.orig/lib/ext2fs/ext2_fs.h   2006-10-18 01:42:48.0 
 -0400
 +++ e2fsprogs/lib/ext2fs/ext2_fs.h2006-10-18 01:49:51.0 -0400
 @@ -71,7 +71,7 @@
  /*
   * Maximal count of links to a file
   */
 -#define EXT2_LINK_MAX32000
 +#define EXT2_LINK_MAX65000
  
  /*
   * Macro-instructions used to manage several block sizes
 @@ -606,6 +606,7 @@
  /* #define EXT2_FEATURE_RO_COMPAT_BTREE_DIR  0x0004 not used */
  #define EXT4_FEATURE_RO_COMPAT_HUGE_FILE 0x0008
  #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM  0x0010
 +#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
  
  #define EXT2_FEATURE_INCOMPAT_COMPRESSION0x0001
  #define EXT2_FEATURE_INCOMPAT_FILETYPE   0x0002

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

2006-10-18 Thread Andreas Dilger
On Oct 18, 2006  16:56 -0500, Eric Sandeen wrote:
 Andreas Dilger wrote:
  The directory leaf data is kept in
  the page cache and there is a helper function ext2_check_page() to mark
  the page checked.  That means the page only needs to be checked once
  after being read from disk, instead of each time through readdir.
 
 ah, sure.  Hm...  well, this might be a bit of a performance hit if it's
 checking cached data... let me think on that.

Well, having something like ext3_dir_bread() that verifies the leaf block
once if (!uptodate()) would be almost the same as ext2 with fairly little
effort.  It would help performance in several places, at the slight risk
of not handling in-memory corruption after the block is read...

  I'm not sure whether this is a win or not.  It means that if there is ever
  a directory with a bad leaf block any entries beyond that block are not
  accessible anymore.  
 
 I'm amazed at how hard ext3 works to cope with bad blocks ;-)

It would fail all of your tests otherwise, right?  That is one virtue of
ext2 having grown up in the days when bad blocks existed.  Those days are
(sadly) coming back again, hence desire for fs-level checksums, etc.

  The existing !bh case already marks the filesystem in
  error.  Maybe as a special case we can check in if (!bh) if i_size and
  i_blocks make sense.  Something like:
  
  if (!bh) {
  :
  :
  +   if (filp-f_pos  inode-i_blocks  9) {
  +   break;
  filp-f_pos += sb-s_blocksize - offset;
  continue;
  }
  
  This obviously won't help if the whole inode is bogus, but then nothing
  will catch all errors.
 
 Yep, I'd thought maybe a size vs. blocks test might make sense; I think
 there can never legitimately be a sparse directory?

Not currently, though there was some desire to allow this during htree
development, to allow shrinking large-but-empty directories.  Since this
already provokes an ext3_error() (which might be a panic()) to hit a hole
we can assume that this needs to be carefully implmemented.

 I guess if the intent is to soldier on in the face of adversity, it
 doesn't matter if it's an umappable offset or an IO error; ext3 wants to
 go ahead  try the next one block anyway.  So the size test probably
 makes sense as a stopping point.

Well, it would also be possible to look into inode-i_blocks to see what
blocks exist past this offset, but that is complicated by the introduction

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size

2006-10-18 Thread Andreas Dilger
On Oct 18, 2006  18:16 -0400, Theodore Tso wrote:
 On Wed, Oct 18, 2006 at 01:34:38PM -0600, Andreas Dilger wrote:
  There was some discussion about moving the i_ctime_extra field into the
  small inode, for use as a version field by NFSV4.  The proposed field was
  l_i_reserved2 in the original Bull patch.
 
 The potential problem with this is what if program makes a huge number
 of changes to inodes, such that we end up having to bump the ctime
 field into the future?

Well, if you are changing the inode billions of times/second for any
extended amount of time, then yes that might be possible.  Normally
setting it to the current ctime and incrementing the nsec timestamp
on the rare cases two threads are updating it at once is sufficient.
Note that in this version is only critical to compare for a single
inode, and not filesystem-wide.

 But I have another question --- why does this the change attribute
 counter have to be a persistent value that takes up space in the inode
 at all?  This is strictly a NFSv4 only hack, and NFSv4 simply requires
 a 64-bit increasing counter when any inode attributes have changed, so
 it can determine whether client-side caches are refreshed.

Well, Lustre has a need for this also, in order to know if a particular
modification to an inode has been done or not.  Using the nsec ctime is
a way of saving space in the inode, while at the same time allowing two
operations to be ordered relative to each other.  This needs to be
persistent across a server reboot for both NFSv4 and Lustre.

Since i_ctime_extra needs to be stored on disk anyways, I don't see why
there would be an objection to having it serve a dual purpose.  For
Lustre we don't actually care whether it is in the core inode, since we
always format filesystems with large inodes anyways, but in consideration
of NFSv4 exporting existing ext3 filesystems I suggested i_ctime_extra go
into the core inode.

 So let the high 32-bits of the attribute value be the time in seconds
 that the system was booted (or the filesystem was mounted; I don't
 care), and let the low 32-bit values be a value which is stored in the
 in-core inode which is set to a global counter value which is
 incremented whenever inode is changed.  If the in-core inode gets
 dropped and then later reloaded, it will get a newer global counter
 value, which may forced some clients to reload their cache when they
 may not have needed to --- and the same would happen if the server
 gets rebooted, since the high 32-bits would get bumped to the new
 mount time.

This would work, I suppose.  It doesn't allow this value to be exported
to userspace without additional API changes as the i_ctime_extra one can.
The initial Bull patch consumed an extra field in struct stat, stat64,
struct inode, and struct iattr in order to allow userspace access also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

2006-10-19 Thread Andreas Dilger
On Oct 18, 2006  19:26 -0500, Eric Sandeen wrote:
 Well, it would also be possible to look into inode-i_blocks to see what
 blocks exist past this offset, but that is complicated by the introduction
  eom
 introduction of ...?  :)

Sorry - introduction of extents.  So we can't just look into the i_blocks
{d,t,}indirect blocks to work out the maximum reasonable size for an inode
without adding decoding of extents into this code.  Maybe if SEEK_DATA
is added to ext3 (patch was proposed this past week) then we could seek
past the hole efficiently.  For now I'm happy to assume i_blocks * 512 is
a safe upper limit on the file size.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

2006-10-19 Thread Andreas Dilger
On Oct 19, 2006  17:43 -0500, Eric Sandeen wrote:
 Eric Sandeen wrote:
  How about just tweaking the existing ext3_bread so that it lets the
  caller know whether or not it found an uptodate buffer?  Seems
  conceivable that more than just the dir code might want to do a data
  sanity check, based on if this is a fresh read or not.
  
  Could maybe even change the *err argument to a *retval; negative on
  errors, else 0 == not read (found uptodate), 1 == fresh read (not found
  uptodate).  Or is that too much overloading...
 
 I played around with this a little bit today, and it seems to have some
 tangible results.  A fairly unsophisticated test of running find over
 my whole root filesystem 10 times :) with and without re-checking cached
 directory entries, yielded about a 10% speedup when skipping the re-checks.
 
 Is this something we want to do?  Are we comfortable with only checking
 directory entries the first time they are read from disk?

Well, we already do this on ext2 without noticable problems.  As you say,
if we are getting memory corruption we are in for a world of hurt in other
areas.  The only case that might be worth checking inside the loop is if
rec_len == 0, so that we don't spin on a bad entry forever.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  18:31 +0400, Alex Tomas wrote:
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?

In some cases this is useful (e.g. if file has small fragments after
being written in small pieces or in a fragmented free space).  In other
cases the user tool HAS to be able to specify the new mapping in order
to make progress.

Consider if there are two very large fragmented files and user-space
defrag tool wants to make contiguous free space.  If kernel is left to do
allocation it will always consume the largest chunk of free space first,
even if it is not yet optimal (e.g. large 1MB aligned extent).

I would make this interface optionally allow the target extent to be
specified, but if target block == 0 then the kernel is free to do its
own allocation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  10:16 -0400, Theodore Tso wrote:
 As a suggestion, I would pass the inode number and inode generation
 number into the ext3_file_mode_data array:
 
 struct ext3_file_move_data {
   int extents;
   struct ext3_reloc_extent __user *ext_array;
 };
 
 This will be much more efficient for the userspace relocator, since it
 won't need to translate from an inode number to a pathname, and then
 try to open the file before relocating it.
 
 I'd also use an explicit 64-bit block numbers type so that we don't
 have to worry about the ABI changing when we support 64-bit block
 numbers.

I would in fact go so far as to allow only a single extent to be specified
per call.  This is to avoid the passing of any pointers as part of the
interface (hello ioctl police :-), and also makes the kernel code simpler.
I don't think the syscall/ioctl overhead is significant compared to the
journal and IO overhead.

Also, I would specify both the source extent and the target extent in
the inode.  This first allows defragmenting only part of the file
instead of (it appears) requiring the whole file to be relocated.  That
would be a killer if the file being defragmented is larger than free
space.  It secondly provides a level of insurance that what the kernel
is relocating matches what userspace thinks it is doing.  It would
protect against problems if the kernel ever does block relocation
itself (e.g. merge fragments into a single extent on (re)write, or for
snapshot/COW).

 The other problem I see with this patch is that there will be cache
 coherency problems between the buffer cache and the page cache.  I
 think you will want to pull the data blocks of the file into the page
 cache, and then write them out from the page cache, and only *then*
 update the indirect blocks and commit the transaction.

Alternately (maybe even better) is to treat it as O_DIRECT and ensure
the page cache is flushed.  This also avoids polluting the whole page
cache while running a defragmenter on the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  16:45 +0200, Andre Noll wrote:
 stress tests on a 6.3T ext3 filesystem which runs on top of software
 raid 6 revealed the following:
 
 [663594.224641] init_special_inode: bogus i_mode (4412)
 [663596.355652] init_special_inode: bogus i_mode (5123)
 [663596.355832] init_special_inode: bogus i_mode (71562)

This would appear to be inode table corruption.

 [663763.319514] EXT3-fs error (device md0): ext3_new_block: Allocating block 
 in system zone - blocks from 517570560, length 1

This is bitmap corruption.

 [663763.339423] EXT3-fs error (device md0): ext3_free_blocks: Freeing blocks 
 in system zones - Block = 517570560, count = 1

Hmm, this would appear to be a buglet in error handling.  If the block just
allocated above is in the system zone it should be marked in-use in the
bitmap but otherwise ignored.  We definitely should NOT be freeing it on
error.

Yikes!  It seems a patch I submitted to 2.4 that fixed the behaviour
of ext3_new_block() so that if we detect this block shouldn't be
allocated it is skipped instead of corrupting the filesystem if it
is running with errors=continue...

It looks like ext3_free_blocks() needs a similar fix - i.e. report an
error and don't actually free those blocks.

 This system is currently not in production use, so I can use it for tests ATM.

I would suggest that you give this a try on RAID5, just for starters.
I'm not aware of any problems in the existing ext3 code for anything
below 8TB.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-24 Thread Andreas Dilger
On Oct 24, 2006  11:14 +0200, Andre Noll wrote:
 On 14:02, Andreas Dilger wrote:
 Something like the this? (only compile tested). And no, I do _not_ know,
 what I'm doing ;)

Don't worry, everyone starts out not knowing what they are doing.
The ext3_free_blocks() part looks OK from a cursory review.

 @@ -1372,12 +1370,24 @@ allocated:
   in_range(ret_block, le32_to_cpu(gdp-bg_inode_table),
 EXT3_SB(sb)-s_itb_per_group) ||
   in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table),
 +   EXT3_SB(sb)-s_itb_per_group)) {
 + int j;
 + ext3_error(sb, __FUNCTION__,
   Allocating block in system zone - 
   blocks from E3FSBLK, length %lu,
ret_block, num);
 -
 + /* Note: This will potentially use up one of the handle's
 +  * buffer credits.  Normally we have way too many credits,
 +  * so that is OK.  In _very_ rare cases it might not be OK.
 +  * We will trigger an assertion if we run out of credits,
 +  * and we will have to do a full fsck of the filesystem -
 +  * better than randomly corrupting filesystem metadata.
 +  */
 + j = find_next_usable_block(-1, gdp, EXT3_BLOCKS_PER_GROUP(sb));

I'm not sure why the find_next_usable_block() part is in here?  At this
point we KNOW that ret_block is not a block we should be allocating, yet
it is marked free in the bitmap.  So we should just mark the block(s) in-use
in the bitmap and look for a different block(s).

 + if (j = 0)
 + ext3_set_bit(j, gdp_bh-b_data);

Note that we need to loop over num blocks and set any bits that should
be set, as is done in the ext3_free_blocks() code.  This function has
changed since 2.4 in that it can now potentially allocate multiple blocks.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Andreas Dilger
On Oct 25, 2006  16:54 +0200, Jan Kara wrote:
 I've just not yet decided how to handle indirect
 blocks in case of relocation in the middle of the file. Should they be
 relocated or shouldn't they? Probably they should be relocated at least
 in case they are fully contained in relocated interval or maybe better
 said when all the blocks they reference to are also in the interval
 (this handles also the case of EOF). But still if you would like to
 relocate the file by parts this is not quite what you want (you won't be
 able to relocate indirect blocks in the boundary of intervals) :(.

I suspect that the natural choice for metadata blocks is to keep the
block which has the most metadata unchanged.  For example, if you are
doing a full-file relocation then you would naturally keep all of the
new {dt}indirect blocks.  If you are relocating a small chunk of the
file you would keep the old {dt}indirect blocks and just copy a few
block pointers over.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-26 Thread Andreas Dilger
On Oct 25, 2006  11:44 +0200, Andre Noll wrote:
 Are you saying that ext3_set_bit() should simply be called with
 ret_block as its first argument? If yes, that is what the revised
 patch below does.

You might need to call ext3_set_bit_atomic() (as claim_block() does,
not sure.

 @@ -1372,12 +1370,21 @@ allocated:
   in_range(ret_block, le32_to_cpu(gdp-bg_inode_table),
 EXT3_SB(sb)-s_itb_per_group) ||
   in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table),
 +   EXT3_SB(sb)-s_itb_per_group)) {
 + ext3_error(sb, __FUNCTION__,
   Allocating block in system zone - 
   blocks from E3FSBLK, length %lu,
ret_block, num);
 + /* Note: This will potentially use up one of the handle's
 +  * buffer credits.  Normally we have way too many credits,
 +  * so that is OK.  In _very_ rare cases it might not be OK.
 +  * We will trigger an assertion if we run out of credits,
 +  * and we will have to do a full fsck of the filesystem -
 +  * better than randomly corrupting filesystem metadata.
 +  */
 + ext3_set_bit(ret_block, gdp_bh-b_data);
 + goto repeat;
 + }

The other issue is that you need to potentially set num bits in the
bitmap here, if those all overlap metadata.  In fact, it might just
make more sense at this stage to walk all of the bits in the bitmaps,
the inode table and the backup superblock and group descriptor to see
if they need fixing also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-28 Thread Andreas Dilger
On Oct 27, 2006  17:34 +0200, Andre Noll wrote:
 On 12:01, Andreas Dilger wrote:
  Well, since we know at least one bit needs fixing and results in the block
  being written to disk then setting the bits for all of the other metadata
  blocks in this group has no extra IO cost (only a tiny amount of CPU).
  Re-setting the bits if they are already set is not harmful.
 
 I.e, something like
 
 int i;
 ext3_fsblk_t bit;
 unsigned long gdblocks = EXT3_SB(sb)-s_gdb_count;
 
 for (i = 0, bit = 1; i  gdblocks; i++, bit++)
 ext3_set_bit(bit, gdp_bh-b_data);
 
 Is that correct?

Well, it needs to also handle backup superblock, bitmaps, inode table:

if (ext3_bg_has_super())
ext3_set_bit(0, gdp_bh-b_data);
gdblocks = ext3_bg_num_gdb(sb, group);
for (i = 0, bit = 1; i  gdblocks; i++, bit++)
/* actually a bit more complex for INCOMPAT_META_BG fs */
ext3_set_bit(i, gdp_bh-b_data);
ext3_set_bit(gdp-bg_inode_bitmap % EXT3_BLOCKS_PER_GROUP(sb), ...);
ext3_set_bit(gdp-bg_block_bitmap % EXT3_BLOCKS_PER_GROUP(sb), ...);
for (i = 0, bit = gdp-bg_inode_table % EXT3_BLOCKS_PER_GROUP(sb);
 i  EXT3_SB(sb)-s_itb_per_group; i++, bit++)
ext3_set_bit(i, gdp_bh-b_data);

(or something close to this).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shred mount option for ext4?

2006-11-01 Thread Andreas Dilger
On Oct 31, 2006  15:14 -0500, Nikolai Joukov wrote:
 1. One of the patches performs N overwrites with configurable patterns
 (can comply with NIST and NISPOM standards).  Because of the transaction
 compaction we had to separately add overwriting as separate transactions.
 Fortunately, the whole procedure is still atomic due to the orphan list.
 The problem that we have right now is per-file syncing of dirty data
 buffers between overwrites.  We sync the whole device at the moment.

Did anyone discuss doing this with crypto instead of actually overwriting
the whole file?  It would be pretty easy to store a per-file crypto key
in each inode as an EA, then to delete the file all that would be
needed would be to erase the key in a secure matter (which is a great
deal easier because inodes don't move around on disk).

The drawback is there is a runtime overhead to encrypt/decrypt the file
data, but honestly, if people care about secure deletion don't they also
care about security of the undeleted data also?  By having an (unknown
to the user) per-file crypto key then if the file is deleted the user
can also plausibly deny the ability to recover the file data even if
they are forced to surrender their key.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2 readdir/lookup/check_page behavior

2006-11-14 Thread Andreas Dilger
On Nov 14, 2006  13:38 -0600, Eric Sandeen wrote:
 Andreas Dilger wrote:
  It would make sense to fix ext2 in the same way.
  I'd suggest bailing out early == min(i_size  blocksize, i_blocks).
  The i_blocks count is an upper limit, because it includes the overhead of
  indirect blocks.  Directories cannot be sparse.
 
 so we could either a) keep processing pages based on i_size, until we
 have passed i_blocks, or b) if i_size  i_blocks don't match,
 immediately bail out because we know we have found a corrupted inode
 (vs. a normal unreadable block...)

Do we already ext3_error() in this case?  That allows the admin to determine
the behaviour already.  If it is errors=continue or errors=remount-ro then
we should continue I think.  We might consider the inode fatally corrupted

if (i_blocks  9  i_size ||
i_blocks  i_size  (blockbits - 8) + /* blocks */
i_size  (blockbits * 2 - 8 - 2) + /* indirect */
i_size  (blockbits * 3 - 8 - 2) + /* dindirect */
i_size  (blockbits * 4 - 8 - 2))  /* tindirect */

I think... Trying to account for indirect blocks.  It is already given a
100% margin (-8 instead of -9) to cover rounding, EA blocks, some small
bugs in block counting, extents format, etc.  FYI, the -2 is 4 bytes/addr.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/4] BIG_BG: support of large block groups

2006-12-01 Thread Andreas Dilger
On Nov 30, 2006  14:41 -0500, Theodore Tso wrote:
 * We ignore the problem, and accept that there are some kinds of
 filesystem corruptions which e2fsck will not be able to fix --- or at
 least not without adding complexity which would allow it to relocate
 data blocks in order to make a contiguous range of blocks to be used
 for the allocation bitmaps.  
 
 The last alternative sounds horrible, but if we assume that some other
 layer (i.e., the hard drive's bad block replacement pool) provides us
 the illusion of a flawless storage media, and CRC to protect metadata
 will prevent us from relying on an corrupted bitmap block, maybe it is
 acceptable that e2fsck may not be able to fix certain types of
 filesystem corruption.

I'd agree that even with media errors, the bad-block replacement pool
is almost certainly available to handle this case.  Even if there are
media errors on the read of the bitmap, they will generally go away
if the bitmap is rewritten (because of relocation).  At worst, we
would no longer allow new blocks/inodes to be allocated that are
tracked by that block, and if we are past 256TB then the sacrifice
of 128MB of space is not fatal.  It wouldn't even have to impact any
files that are already allocated in that space.

 without any of these protections, I'd want to keep the block group
 size under 32k so we can avoid dealing with these issues for as long
 as possible.  Even if we assume laptop drives will double in size
 every 12 months, we still have a good 10+ years before we're in danger
 of seeing a 512TB laptop drives.  :-)

Agreed, I think there isn't any reason to increase the group size
unless it is really needed, or it is specified with mke2fs -g {blocks}
or the number of inodes requires it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

2006-12-14 Thread Andreas Dilger
On Dec 14, 2006  11:03 -0500, Theodore Tso wrote:
 There was discussion on yesterday's call about whether or not 32-bit
 was enough for NFSv4, or whether it also requried 64-bits of change
 notification in the RFC's.  So one of the questions is whether this is
 something that would justify requiring 64-bits --- and if so, maybe we
 need to require that big inodes be used and store the entire 64-bit
 value beyond 128 bytes.  This would mean that NFSv4 cache management
 couldn't be fully implemented without big inodes, or we'd have to make
 do by using the inode ctime as a partial substitute.

Per Trond and Bruce Field's reply to my email it seems that NFSv4 only
needs the version to compare for inequality.  If the change numbers are
sequential for a given inode it can OPTIONALLY extract additional
information about the server (i.e. it still has an up-to-date cache
because it was the only one that did an update on a given file).

So, I think for basic NFSv4 setups that 2^32 is sufficient (per Bull's
original patch) but 2^64 is desirable to avoid collisions and allow the
sequential updates logic to work properly for long-lived files.

So, I think a 32-bit field in the small inode, and an additional 32-bit
field in the large inode would be perfect.  It allows this functionality
to work with existing ext3 filesystems, if not quite optimally.

In addition, for Lustre, could we get a 64-bit field in the superblock
which contains the fs-wide version number.

I'm proposing that, per the original Bull patch, l_i_reserved1 be changed
to be i_version for linux, and we add i_version_hi after cr_time_extra in
the large inode.  The disk i_version would be stored in the vfs_inode
i_version (which is already used for this same purpose).  It would be good
for NFSv4 if the i_version field could be expanded to 64 bits to avoid
the need for it to have fs-specific operations, but failing that we can
put the high word into ext4_inode_info and NFS can access it via
export_operations I think.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/3] ext4 online defrag (ver 0.2)

2007-01-19 Thread Andreas Dilger
On Jan 19, 2007  14:19 +0900, Takashi Sato wrote:
 On Jan 16, 2007  21:03 +0900, [EMAIL PROTECTED] wrote:
 1. Add new ioctl(EXT4_IOC_DEFRAG) which returns the first physical
block number of the specified file.  With this ioctl, a command
gets the specified directory's.
 
 Maybe I don't understand, but how is this different from the long-time
 FIBMAP ioctl?
 
 I can use FIBMAP instead of my new ioctl.
 You are right.  I should have used FIBMAP ioctl...
 
 I have to get the physical block number of the specified directory.
 But FIBMAP is available only for a regular file, not for a directory.
 So I will use my new ioctl.

Though it might make sense to implement FIBMAP for a directory, to keep
it consistent and allow user-space tools like filefrag to work on
directories also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 0/3] i_version update for ext4

2007-01-23 Thread Andreas Dilger
On Jan 23, 2007  18:23 +0100, Cordenner jean noel wrote:
 I've updated what was previously the change attribute patch for ext4 
 initially posted by Alexandre Ratchov. The previous patch was 
 introducing a change_attribute field, now it uses the i_version field of 
 the inode.
 
 The i_version field is a counter that is set on every inode creation and 
 that is incremented every time the inode data is modified (similarly to 
 the ctime time-stamp).
 The aim is to fulfill NFSv4 requirements for rfc3530.
 For the moent, the counter is only a 32bit value but it is planned to be 
 64bit as required.
 
 The patch is divided into 3 parts, the vfs layer, the ext4 specific code 
 and an user part to check i_version changes via stat.

Have you had a chance to look at the performance impact of this change
(possible with oprofile)?  Always marking the inodes dirty for ext3
may have some noticable overhead.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [patch 3/3] i_version update for ext4: user interface

2007-01-23 Thread Andreas Dilger
On Jan 23, 2007  18:24 +0100, Cordenner jean noel wrote:
 This patch adds a  ``st_i_version'' field in the stat
 structure, and modifies the stat(2) syscall accordingly. Currently the
 change is only visible on i386 and x86_64 archs.

What is the need for exporting i_version to userspace?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] e2fsprogs: user selectable dup block handling in fsck

2007-01-31 Thread Andreas Dilger
On Jan 31, 2007  08:22 -0800, Jim Garlick wrote:
 It also adds a check to make sure only one -E option is passed
 on the command line as -E option parsing is not cumulative.

 @@ -633,6 +639,8 @@ static errcode_t PRS(int argc, char *arg
   case 'E':
 + if (extended_opts)
 + fatal_error(ctx, _(-E must only be 
 specified once));
   extended_opts = optarg;

In such cases I've usually just changed the code to do the parsing
as the option is passed.  Otherwise, it isn't possible to override
previously-specified options. This sometimes is needed if you have an
alias or script that is passing a bunch of options, and in some rare
cases you don't want the default, e.g.

alias mye2fsck=e2fsck -f -p -E clone=dup

# mye2fsck -y -E clone=zero /dev/really-broken

Ted, is there a reason that the call to parse_extended_opts() can't
just be moved in place of saving the options in extended_opts?  I
can't see anything in -E (yet) that depends on other options that
might not be set yet.

Also, it looks like that function leaks the duplicated string in buf,
since that variable goes out of scope without freeing the allocation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [BUG -mm] ext3_orphan_add() accessing corrupted list on a corrupted ext3fs

2007-02-01 Thread Andreas Dilger
 to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Resubmit][PATCH 5/5] Secure Deletion and Trash-Bin Support for Ext4

2007-02-01 Thread Andreas Dilger
On Feb 01, 2007  12:41 -0800, Mingming Cao wrote:
 On Thu, 2007-02-01 at 19:32 +, Dave Kleikamp wrote:
   Unfortunately, we need some file system-specific code to access per-file
   secure deletion and per-file trash bit attributes.  These attributes are
   supported only by some file systems and in different ways.  
 
 The check for fs specific attributes has to be underlying fs code.  But
 the code the handling the secure delete and trash bin (although now is
 only two functions being called) are identical for all fs, could be move
 to VFS layer.
 
  Yeah, I did see that.  I wonder adding some inode or file operation just
  to query the existence of those attributes (or something more generic)
  would be too ugly.
 
 I gave a brief thought on that yesterday, it was not very pretty:)

Actually, the major filesystems (ext3, reiserfs, jfs, xfs) all use the
same lsattr/chattr ioctl as ext2 (EXT2_IOC_GETFLAGS).  Maybe this code
can just do an ioctl inside the kernel?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 1/1] nanosecond timestamps

2007-02-07 Thread Andreas Dilger
On Feb 05, 2007  23:09 -0500, Theodore Tso wrote:
 On Fri, Feb 02, 2007 at 08:09:40PM +0530, Kalpak Shah wrote:
  This patch is a spinoff of the old nanosecond patches. It includes some
  cleanups and addition of a creation timestamp. The
  EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
  s_{min, want}_extra_isize fields in struct ext3_super_block. 
 
 Thanks for sending it.  I haven't had a chance to go over it in detail
 yet, but one quick comment; the patch looks like it got line-wrapped
 by your mail agent (looks like you're using Evolution 2.0).  Could you
 send it as a text/plain attachment, or otherwise fix your mailer to
 not wrap your patches?  
 
 It might be nice if the patch had a way of adjusting the granularity
 by masking off bits of the nanoseconds field, so we can benchmark how
 much overhead constantly updating the ctime field is going to cost us.  

Can anyone suggest a benchmark which will test this area?  Bull had done
some testing with the inode version patch (it also forces an update for
every change to an inode) and reported no noticable performance loss.
That could have been because of CPU headroom available to do repeat copies
of the in-core inode to the on-disk inode, which may hurt in a more CPU
constrained environment (other server code, multiple filesystems, etc).

Before we go to changing the patch, we may as well start by just testing
before vs. after patch (still using large inodes, for consistency).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 1/1] Nanosecond timestamps

2007-02-07 Thread Andreas Dilger
On Feb 06, 2007  16:12 +0100, Johann Lombardi wrote:
  +   if (sbi-s_inode_size  EXT3_GOOD_OLD_INODE_SIZE) {
  +   EXT3_SB(sb)-s_want_extra_isize = sizeof(struct ext3_inode) - 
  EXT3_GOOD_OLD_INODE_SIZE;

 Maybe EXT3_SB(sb)- could be replaced by sbi- here and in the lines below.

Yes, this should definitely be done.  It also increases clarity between
sbi-s_want_extra_isize and es-s_want_extra_isize.

  +   if (EXT3_SB(sb)-s_want_extra_isize 
  +   le32_to_cpu(es-s_min_extra_isize))
 ^^
  +   EXT3_SB(sb)-s_want_extra_isize =
  +   le32_to_cpu(es-s_min_extra_isize);
 ^^
 Since es-s_{min,want}_extra_isize are both __u16 (BTW, shouldn't it be
 __le16?), I think you should use le16_to_cpu() instead of le32_to_cpu().

You are right - this works fine on little endian systems, but fails on
big endian systems where you will get the other half of the word.

This has been a bug in several places already, and I wonder if the
le*_to_cpu() and cpu_to_le*() macros shouldn't do some type checking
instead of just casting the variable to the specified type?

The only problem is if casting constants it would be a bit of a pain
to have to cast them explicitly, though we could have something like:

#define le16_to_cpu(var) (__builtin_constant(var) || !typecheck(__u16, var) ? \
  __constant_cpu_to_le16(var) : __le16_to_cpu(var))
   
The only question is whether typecheck adds extra variables on the stack
or if the compiler will always optimize them away.

  +   /* Check if enough inode space is available */
  +   if (EXT3_GOOD_OLD_INODE_SIZE + EXT3_SB(sb)-s_want_extra_isize 
  +   sbi-s_inode_size) {
  +   EXT3_SB(sb)-s_want_extra_isize = sizeof(struct ext3_inode) 
  - EXT3_GOOD_OLD_INODE_SIZE;
  +   printk(KERN_INFO EXT3-fs: required extra inode space not
  +   available.\n);
  +   }
 
 If the inode size is EXT3_GOOD_OLD_INODE_SIZE, sbi-s_want_extra_isize won't
 be initialized. However, it should not be an issue because the ext3_sb_info
 is set to zero in ext3_fill_super().

So I'm not sure I understand if you have an objection or if this is just a
comment.  sbi-s_want_extra_isize will be zero and it is not possible for
sbi-s_inode_size  EXT3_GOOD_OLD_INODE_SIZE so this case won't be hit.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Testing ext4 persistent preallocation patches for 64 bit features

2007-02-07 Thread Andreas Dilger
On Feb 07, 2007  16:06 +0530, Suparna Bhattacharya wrote:
 On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote:
  - disable preallocation if the filesystem free blocks is under some low
  watermarks, to save space for near future real block allocation?
 
 A policy decision like this is probably worth a discussion during today's 
 call.
 
  - is de-preallocation something worth doing?

As discussed in the call - I don't think we can remove preallocations.
The whole point of database preallocation is to guarantee that this space
is available in the filesystem when writing into a file at random offsets
(which would otherwise be sparse).

Similarly, persistent preallocation shouldn't be considered differently
than an efficient way of doing zero filling of blocks.  At least that is
my understanding...  Is this code implementing the uninitialized extents
for databases (via explicit preallocation via fallocate/ioctl) so that
they don't have to zero-fill large files, or is there also automatic
preallocation of space to files (e.g. for O_APPEND files)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] e2fsck journal recovery can corrupt all superblock backups

2007-02-08 Thread Andreas Dilger
Ted,
this was sent with the first patch, and it looks like it is a very
serious problem.

Looking through the e2fsck code it would also seem possible to move the
setting of EXT2_FLAG_MASTER_SB_ONLY before the journal replay.  That
is the second patch.  I'm not sure which one is better.

Jim Garlick wrote:
 When fsck replays the journal, it clears the EXT3_FEATURE_INCOMPAT_RECOVER
 feature, dirties the superblock, and closes the file system.
 Unfortunately, the file system EXT2_FLAG_MASTER_SB_ONLY flag is not set
 at this time, so it copies the primary superblock and group descriptors
 over all the backups.  Then fsck restarts and checks the superblock for
 consistancy.  If the superblock or group descriptors are then found to
 be bad, all your backups are now also bad.

Index: e2fsprogs+chaos/lib/ext2fs/openfs.c
===
--- e2fsprogs+chaos.orig/lib/ext2fs/openfs.c
+++ e2fsprogs+chaos/lib/ext2fs/openfs.c
@@ -101,6 +101,8 @@ errcode_t ext2fs_open2(const char *name,
memset(fs, 0, sizeof(struct struct_ext2_filsys));
fs-magic = EXT2_ET_MAGIC_EXT2FS_FILSYS;
fs-flags = flags;
+   /* don't overwrite sb backups unless flag is explicitly cleared */
+   fs-flags |= EXT2_FLAG_MASTER_SB_ONLY; 
fs-umask = 022;
retval = ext2fs_get_mem(strlen(name)+1, fs-device_name);
if (retval)

---

Index: e2fsprogs/e2fsck/unix.c
===
--- e2fsprogs.orig/e2fsck/unix.c2006-12-27 17:12:23.0 -0700
+++ e2fsprogs/e2fsck/unix.c 2007-02-08 22:05:13.0 -0700
@@ -1153,6 +1153,15 @@ restart:
}
 
/*
+* We only update the master superblock because (a) paranoia;
+* we don't want to corrupt the backup superblocks, and (b) we
+* don't need to update the mount count and last checked
+* fields in the backup superblock (the kernel doesn't
+* update the backup superblocks anyway).
+*/
+   fs-flags |= EXT2_FLAG_MASTER_SB_ONLY;
+
+   /*
 * Check to see if we need to do ext3-style recovery.  If so,
 * do it, and then restart the fsck.
 */
@@ -1227,15 +1236,6 @@ restart:
!(ctx-options  E2F_OPT_READONLY))
ext2fs_mark_super_dirty(fs);
 
-   /*
-* We only update the master superblock because (a) paranoia;
-* we don't want to corrupt the backup superblocks, and (b) we
-* don't need to update the mount count and last checked
-* fields in the backup superblock (the kernel doesn't
-* update the backup superblocks anyway).
-*/
-   fs-flags |= EXT2_FLAG_MASTER_SB_ONLY;
-
ehandler_init(fs-io);
 
if (ctx-superblock)

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs coverity patch cid-4.diff

2007-02-12 Thread Andreas Dilger
On Feb 09, 2007  18:11 -0800, Brian D. Behlendorf wrote:
 This check is unnecessary since fs_type is guaranteed to be set earlier in the
 function.

I'm not positive this is the right fix.  If we are creating a journal device,
we shouldn't be using the normal defaults based on the size of the journal
device because this will often be much smaller than the actual filesystem.

I'd argue that we set fs_type=journal earlier when fs_type is being first
set, if INCOMPAT_JOURNAL_DEV is set.

 Index: e2fsprogs+chaos/misc/mke2fs.c
 ===
 --- e2fsprogs+chaos.orig/misc/mke2fs.c
 +++ e2fsprogs+chaos/misc/mke2fs.c
 @@ -1310,8 +1310,6 @@ static void PRS(int argc, char *argv[])
  
   if (fs_param.s_feature_incompat  
   EXT3_FEATURE_INCOMPAT_JOURNAL_DEV) {
 - if (!fs_type)
 - fs_type = journal;
   reserved_ratio = 0;
   fs_param.s_feature_incompat = EXT3_FEATURE_INCOMPAT_JOURNAL_DEV;
   fs_param.s_feature_compat = 0;

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-16 Thread Andreas Dilger
On Feb 16, 2007  00:06 -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote:
  well, even so we need to reserve not that block ony, but
  also needed metadata (for the worst case).
 
 Right.  Reserve all the blocks for the page and all the metadata for a page
 at that file offset.  Worst case.

This only really becomes an issue when the filesystem (+reservations)
is close to being full.  If we are close enough to worry about running
out of space we can also refuse to do delayed allocation.
 
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-16 Thread Andreas Dilger
On Feb 15, 2007  09:18 -0800, Eric Sandeen wrote:
 This was for the delayed allocation patchset, right; and by managing 
 this at the page level that means we can't do this for block size  page 
 size, I think...

We could always disable delayed allocation in the PAGE_SIZE != blocksize
case.  It would be no worse than what we have now, and for IO performance
you need larger blocksize anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: data=journal busted

2007-02-16 Thread Andreas Dilger
On Feb 16, 2007  15:42 -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 16:31:09 -0700
 Andreas Dilger [EMAIL PROTECTED] wrote:
  We have a patch that we use for Lustre testing which allows you to set a
  block device readonly (silently discarding all writes), without the
  filesystem immediately keeling over dead like set_disk_ro.  The readonly
  state persists until the the last reference on the block device is dropped,
  so there are no races w.r.t. VFS cleanup of inodes and flushing buffers
  after the filesystem is unmounted.
 
 Not sure I understand all that.

Actually, the patch was originally based on your just-posted readonly code.
The problem with that code (unlike ours) is that it doesn't do the explicit
ro clearing in ext3_put_super(), but rather not until ALL inodes/buffers/etc
are cleared out for the block device by the VFS.  Otherwise, it is possible
to re-enable rw on the block device, and there still be dirty buffers that
are being flushed out to disk, guaranteeing filesystem inconsistency.

It allows multiple devices to be marked ro at the same time without problem.

It also works on all block devices instead of just IDE.

 For this application, we *want* to expose VFS races, errors in handling
 EIO, errors in handling lost writes, etc.  It's another form of for-developers
 fault injection, not a thing-for-production.

I understand that part, and I agree our patch doesn't include the random
part of the set-readonly code you have.  Both of the patches do have the
drawback that they don't return errors like set_disk_ro(), so that isn't
exercising the error-handling code in ext3/jbd as much as it could.

When we first worked on this in 2.4 the ext3/jbd code was far to fragile to
handle -EROFS under load without oopsing so we chose the silent failure
mode to allow testing w/o crashing all the time.  Maybe today it is better
to just call set_dev_ro() and fix the (hopefully few) -EROFS problems.

 The reason I prefer doing it from the timer interrupt is to toss more
 randomness in there, avoid the possibility of getting synchronised
 with application or kernel activity in some fashion.

Definitely this has its place too.  We chose the opposite route and have
fault-injection calls sprinkled throughout our code, so that we can create
complex regression tests that need specific series of events to be hit.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 1/1] nanosecond timestamps

2007-02-26 Thread Andreas Dilger
On Feb 25, 2007  02:36 -0800, Andrew Morton wrote:
  On Fri, 02 Feb 2007 20:09:40 +0530 Kalpak Shah [EMAIL PROTECTED] wrote:
  +#define EXT3_INODE_GET_XTIME(xtime, extra_xtime, inode, raw_inode)
  \
  +do {
  \
  +   (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);
  \
  + \
  +   if (offsetof(typeof(*raw_inode), extra_xtime) -
  \
  +   offsetof(typeof(*raw_inode), i_extra_isize) +
  \
  +   sizeof((raw_inode)-extra_xtime) =
  \
  +   le16_to_cpu((raw_inode)-i_extra_isize)) {
  \
  +   if (sizeof((inode)-xtime.tv_sec)  4)\
  +   (inode)-xtime.tv_sec |=  \
  +   (__u64)(le32_to_cpu((raw_inode)-extra_xtime)\
  +   EXT3_EPOCH_MASK)  32;   \
  +   (inode)-xtime.tv_nsec =  \
  +   (le32_to_cpu((raw_inode)-extra_xtime)   \
  +   EXT3_NSEC_MASK)  2; \
  +   }
  \
  +} while (0)
 
 ow, my eyes.  Can we find a way to do this in C rather than in cpp?

The macro is working on the field names of the inode in most places
(e.g. i_mtime, i_ctime, etc) rather than the values.  You _could_ do it
in C, but it would mean moving all of the offsetof() into the caller
(basically putting the whole macro in-line at the caller) or doing math
on the pointer addresses at runtime instead of compile time.

It boils down to a check whether the particular nanosecond field is
inside the reserved space in the inode or not, so it ends up a comparison
against a constant.  For ctime:

if (4 = le32_to_cpu((raw_inode)-i_extra_isize) {

And the second if decides whether to save bits  32 in the seconds
field for 64-bit architectures, so it is also evaluated at compile time.

Better to have this in a macro than in the code itself.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] added sanity check for xattr validation

2007-02-26 Thread Andreas Dilger
Ted,
the attached patch adds an extra validity test in check_ext_attr().  If
an attribute's e_value_size is zero the current code does not allocate
a region for it and as a result the e_value_offs value is not verified.
However, if e_value_offs is very large then the later call to
ext2fs_ext_attr_hash_entry() can dereference bad memory and crash e2fsck.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Jim Garlick [EMAIL PROTECTED]

=== e_value_offs.patch 

Index: e2fsprogs-1.39/e2fsck/pass1.c
===
--- e2fsprogs-1.39.orig/e2fsck/pass1.c  2007-02-10 15:43:36.0 +0530
+++ e2fsprogs-1.39/e2fsck/pass1.c   2007-02-10 15:45:36.0 +0530
@@ -1346,6 +1347,11 @@
if (fix_problem(ctx, PR_1_EA_BAD_VALUE, pctx))
goto clear_extattr;
}
+   if (entry-e_value_offs + entry-e_value_size  fs-blocksize) {
+   if (fix_problem(ctx, PR_1_EA_BAD_VALUE, pctx))
+   goto clear_extattr;
+   break;
+   }
if (entry-e_value_size 
region_allocate(region, entry-e_value_offs,
EXT2_EXT_ATTR_SIZE(entry-e_value_size))) {
=== e_value_offs.patch 

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 1/1] nanosecond timestamps

2007-02-26 Thread Andreas Dilger
On Feb 26, 2007  17:20 -0600, Dave Kleikamp wrote:
 I wanted to see if I could clean this up, and this is what I came up
 with.  I also did some macro magic to make these macros take one less
 argument.  I ended up breaking two macros up into three smaller macros
 and two inline functions.  Does this look any better?  (This is
 incremental against Kalpak's patch.)
 
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

Looks like an improvement to me, if only to give a name and clarity to
the EXT3_FITS_IN_INODE() part of the function.

One other note - there was a problem with the original patch because
i_crtime is also in the extended part of the inode so it needs to
be validated against i_extra_isize so we need to submit a new patch
for that anyways and may as well include this cleanup in that patch.

Thanks Dave.

 diff -Nurp linux.orig/fs/ext3/inode.c linux/fs/ext3/inode.c
 --- linux.orig/fs/ext3/inode.c2007-02-26 17:10:22.0 -0600
 +++ linux/fs/ext3/inode.c 2007-02-26 17:02:28.0 -0600
 @@ -2674,10 +2674,10 @@ void ext3_read_inode(struct inode * inod
   inode-i_nlink = le16_to_cpu(raw_inode-i_links_count);
   inode-i_size = le32_to_cpu(raw_inode-i_size);
  
 - EXT3_INODE_GET_XTIME(i_ctime, i_ctime_extra, inode, raw_inode);
 - EXT3_INODE_GET_XTIME(i_mtime, i_mtime_extra, inode, raw_inode);
 - EXT3_INODE_GET_XTIME(i_atime, i_atime_extra, inode, raw_inode);
 - EXT3_INODE_GET_XTIME(i_crtime, i_crtime_extra, ei, raw_inode);
 + EXT3_INODE_GET_XTIME(i_ctime, inode, raw_inode);
 + EXT3_INODE_GET_XTIME(i_mtime, inode, raw_inode);
 + EXT3_INODE_GET_XTIME(i_atime, inode, raw_inode);
 + EXT3_INODE_GET_XTIME(i_crtime, ei, raw_inode);
  
   ei-i_state = 0;
   ei-i_dir_start_lookup = 0;
 @@ -2829,10 +2829,10 @@ static int ext3_do_update_inode(handle_t
   }
   raw_inode-i_links_count = cpu_to_le16(inode-i_nlink);
   raw_inode-i_size = cpu_to_le32(ei-i_disksize);
 - EXT3_INODE_SET_XTIME(i_ctime, i_ctime_extra, inode, raw_inode);
 - EXT3_INODE_SET_XTIME(i_mtime, i_mtime_extra, inode, raw_inode);
 - EXT3_INODE_SET_XTIME(i_atime, i_atime_extra, inode, raw_inode);
 - EXT3_INODE_SET_XTIME(i_crtime, i_crtime_extra, ei, raw_inode);
 + EXT3_INODE_SET_XTIME(i_ctime, inode, raw_inode);
 + EXT3_INODE_SET_XTIME(i_mtime, inode, raw_inode);
 + EXT3_INODE_SET_XTIME(i_atime, inode, raw_inode);
 + EXT3_INODE_SET_XTIME(i_crtime, ei, raw_inode);
  
   raw_inode-i_blocks = cpu_to_le32(inode-i_blocks);
   raw_inode-i_dtime = cpu_to_le32(ei-i_dtime);
 diff -Nurp linux.orig/include/linux/ext3_fs.h linux/include/linux/ext3_fs.h
 --- linux.orig/include/linux/ext3_fs.h2007-02-26 17:10:22.0 
 -0600
 +++ linux/include/linux/ext3_fs.h 2007-02-26 17:07:20.0 -0600
 @@ -331,37 +331,42 @@ struct ext3_inode {
  #define EXT3_EPOCH_MASK ((1  EXT3_EPOCH_BITS) - 1)
  #define EXT3_NSEC_MASK  (~0UL  EXT3_EPOCH_BITS)
  
 +#define EXT3_FITS_IN_INODE(ext3_inode, field)\
 + ((offsetof(typeof(*ext3_inode), field) -\
 +   offsetof(typeof(*ext3_inode), i_extra_isize) +\
 +   sizeof((ext3_inode)-field)) =   \
 +  le16_to_cpu((ext3_inode)-i_extra_isize))
 +
 +static inline __le32 ext3_encode_extra_time(struct timespec *time)
 +{
 + return cpu_to_le32((sizeof(time-tv_sec)  4 ? time-tv_sec  32 : 0) |
 +((time-tv_nsec  2)  EXT3_NSEC_MASK));

I think this should be:

return cpu_to_le32((sizeof(time-tv_sec)  4 ?
(time-tv_sec  32)  EXT3_EPOCH_MASK : 0) |
   (time-tv_nsec  EXT3_EPOCH_BITS));

We should use EXT3_EPOCH_BITS, and don't really need  EXT3_NSEC_MASK
unless there is some reason to believe that tv_nsec is  1e9.  Even then
it would be truncated at 32 bits by the (__u32) cast in cpu_to_le32().

 +static inline void ext3_decode_extra_time(struct timespec *time, __le32 
 extra) {
 + if (sizeof(time-tv_sec)  4)
 + time-tv_sec |= (__u64)(le32_to_cpu(extra)  EXT3_EPOCH_MASK)
 +  32;
 + time-tv_nsec = (le32_to_cpu(extra)  EXT3_NSEC_MASK)  2;

And this should be:

time-tv_nsec = (le32_to_cpu(extra)  EXT3_NSEC_MASK) EXT3_EPOCH_BITS;

We could again skip the  EXT3_NSEC_MASK because le32_to_cpu(extra) will
truncate the input at 32 bits, so:

time-tv_nsec = le32_to_cpu(extra)  EXT3_EPOCH_BITS;

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH take3 1/1] nanosecond timestamps

2007-02-27 Thread Andreas Dilger
On Feb 27, 2007  15:14 +0530, Kalpak Shah wrote:
 +#define EXT4_EPOCH_BITS 2
 +
 +static inline __le32 ext4_encode_extra_time(struct timespec *time)
 +{
 +   return cpu_to_le32((sizeof(time-tv_sec)  4 ?
 +time-tv_sec  32 : 0) |
 +((time-tv_nsec  2)  EXT4_NSEC_MASK));

This should be (time-tv_nsec  EXT4_EPOCH_BITS).  We don't strictly
need the EXT4_NSEC_MASK because cpu_to_le32() will truncate the field to
32 bits anyways.

 +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
 extra) {
 +   if (sizeof(time-tv_sec)  4)
 +time-tv_sec |= (__u64)(le32_to_cpu(extra)  EXT4_EPOCH_MASK)
 + 32;
 +   time-tv_nsec = (le32_to_cpu(extra)  EXT4_NSEC_MASK)  2;

This should be  EXT4_EPOCH_BITS.  Similarly, le32_to_cpu() will truncate
extra to 32 bits and we shift away the epoch part of the field, so we don't
need the EXT4_NSEC_MASK at all.

 +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)
\
 +do {\
 + if (EXT4_FITS_IN_INODE(raw_inode, xtime))  \
 + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \

We might consider doing this in the caller for the crtime field.  That is
only read/written in a single place, and it is the only one that needs the
extra check to determine if the seconds field is in the body of the inode.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsck and human intervention

2007-03-06 Thread Andreas Dilger
On Mar 06, 2007  09:27 -0500, Daniel Drake wrote:
 OK. Given that write caching may be required for performance reasons or
 there might be other possible reasons which would result in
 preen-unrepairable fs corruption on power loss, my question is now: Is
 it a really bad idea to run e2fsck -y on every boot?

If your primary concern is not halting the boot, then yes.  99% of people
only know to answer y to e2fsck anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232

2007-03-09 Thread Andreas Dilger
On Mar 09, 2007  07:40 -0800, Andrew Morton wrote:
 EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232
 Assertion failure in dx_probe() at fs/ext3/namei.c:384: 
 dx_get_limit(entries) 
 == dx_root_limit(dir, root-info.info_length)

Looks like corrupted on-disk or in-memory htree metadata.  We should
probably handle this more gracefully, if indeed it is corrupted on disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] improving ext4 stat/unlink performance

2007-03-09 Thread Andreas Dilger
In recent discussions on #linuxfs with Coly Li (I hope I have the email
correct), we have ressurected the idea of improving the stat and unlink
performance of ext3 htree directories.  Coly seems interested in working
on fixing this problem, so I thought I'd start an email thread in the
list to get wider input, and also keep a record of the discussions.


This problem was exposed widely in an old thread started by Andrew Morton:
http://marc.theaimsgroup.com/?l=linux-ext4m=105711174128574w=4
and can be summarized as htree directories have random readdir
order compared to inode number (and vice versa).

And my email in that thread explains the solution relatively cleanly:
http://marc.theaimsgroup.com/?l=linux-ext4m=105713438820870w=4
and can be summarized as allocate new inodes in a manner that takes
the name hash into account.


Apparently, this has become a hot topic for the kernel.org systems
because Git is fond of creating huge directories.


It should be possible to make a patch that improves the performance
of readdir+stat or readdir+unlink (for directories created under this
scheme) but does not affect the on-disk format.

Ted has also proposed the LD_PRELOAD library for doing sorting of the
readdir output in userspace before returning it to the application.
This can be helpful, but until this approach is widely available (e.g.
built into glibc or something) it will have limited benefit to users.
This is complementary with any approach taken here.



The basic goal, as discussed in my above email is to map the filename
to a goal inode number in (growing) range of inodes/blocks in the itable.
The resulting inode order is piece-wise in approximately hash order.  The
goal selection part is relatively straight forward, especially for new
directories, where we could keep a count of the number of inodes in
a directory and scale the itable window size as a function of the number
of inodes.  We would likely reserve the itable blocks for a given directory
in the same manner (and possibly even the same code) that ext3 reservation
works today.

For existing directories, the number of inodes in the directory (and hence
the itable window size) can also be approximated by the current size of
the directory divided by an average filename length.  Whether the directory
is full or not is less important, if we assume that past usage of the
directory is an indication of future usage.  It is in fact better to know
the directory size in advance because we get fewer pieces in the hash to
itable mapping.


What remains open for discussion (hopefully someone will have a good answer)
with existing directories is how to find the previous itable allocation
window(s) across a remount.  For directories with more than tens of
thousands of inodes in them the inodes by necessity must span multiple
block groups, and there is no guarantee that the itable ranges are
contiguous (no point in reserving a range of itable blocks that are full).

One possibility is to scan the directory leaf block that will ultimately
hold the new inode and try to allocate in the same itable block(s) that
the other entries in this leaf are using (preferring the block that is
used closest hashes in that leaf).  Drawbacks of this are that renamed
entries will have non-local inode numbers (which is a drawback of the
whole scheme, actually) and we don't store hashes with each entry.

We could also persistently store a hint for the itable allocation window
in the directory somewhere (e.g. the htree header by increasing the 
info_length).



Another possibility, first suggested by Ted in the same thread above,
and something I've been thinking about since the meeting in Portland
last year, is to change the directory format so that the directory leaf
blocks are actually the inode table itself.  It is complex enough that
it should go into a separate email, and is only worth considering if
we are going to have another incompatible change to the on-disk format
(though we could bundle a number of inode- and directory- specific
changes together).


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla2xxx BUG: workqueue leaked lock or atomic

2007-03-13 Thread Andreas Dilger
On Mar 12, 2007  16:22 +0100, Valerie Clement wrote:
 Mingming Cao wrote:
 IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
 on 10TB ext3, I think RedHat and BULL have done similar test on 8TB
 ext3 too.
 
 Is there not a problem of backward-compatibility with old kernels?
 Doesn't we need to handle a new INCOMPAT flag in e2fsprogs and kernel
 before allowing ext3 filesystems greater than 8T?

No, it really depends on the kernel.  There were some bugs that caused
problems with  8TB because of signed 32-bit int problems, so it isn't
really recommended to use  8TB unless you know this is fixed in your
kernel (and any older kernel you might have to downgrade to).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs coverity patch cid-1.diff

2007-03-18 Thread Andreas Dilger
On Mar 18, 2007  09:40 -0400, Theodore Tso wrote:
 On Fri, Feb 09, 2007 at 06:08:16PM -0800, Brian D. Behlendorf wrote:
  Coverity ID: 2: Checked Return
 
 Would I be right in assuming this should be Coverity ID 1, and not 2?
 (cid-2.diff is also labelled as Coverity ID 1).
 
 What is Coverity ID's, anyway?  I assume they are something you
 locally assigned?

Yes, this is just the error number within a given software project
at a particular site.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cleaned up ext4 patch series

2007-03-19 Thread Andreas Dilger
On Mar 19, 2007  09:15 -0800, Mingming Cao wrote:
 I wonder if we should create two branches: one branch for patches that
 are well discussed and tested, which Andrew could trust and pull to mm
 tree; and create another branch to store patches that are still under
 discussion and likely to be rewriten based on the review feedback.

Yes, that makes a lot of sense.  It would avoid issues like Andrew finding
problems with the nanosecond patches because they hadn't been widely tested
yet.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 5/19 e2fsprogs f_extents_ee_len test

2007-03-19 Thread Andreas Dilger
Regression test for a bad ee_len field in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_ee_len.patch
Description: Binary data


[PATCH] 3/19 e2fsprogs f_extents_bad_blk test

2007-03-19 Thread Andreas Dilger
Regression test for a bad block in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_bad_blk.patch
Description: Binary data


[PATCH] 4/19 e2fsprogs f_extents_ee_block test

2007-03-19 Thread Andreas Dilger
Regression test for a bad ee_block field in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_ee_block.patch
Description: Binary data


[PATCH] 6/19 e2fsprogs f_extents_ee_start test

2007-03-19 Thread Andreas Dilger
Regression test for a bad ee_start field in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_ee_start.patch
Description: Binary data


[PATCH] 7/19 e2fsprogs f_extents_eh_depth test

2007-03-19 Thread Andreas Dilger
Regression test for a bad eh_depth field in an extent file.
Binary patch file.

This test still fails with the current extents patch.
It takes 2 runs of e2fsck to correct the problem.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_eh_depth.patch
Description: Binary data


[PATCH] 8/19 e2fsprogs f_extents_eh_entries test

2007-03-19 Thread Andreas Dilger
Regression test for a bad eh_entries field in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_eh_entries.patch
Description: Binary data


[PATCH] 12/19 e2fsprogs f_extents_ei_leaf test

2007-03-19 Thread Andreas Dilger
Regression test for a bad ei_leaf field in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_ei_leaf.patch
Description: Binary data


[PATCH] 14/19 e2fsprogs f_extents_inrlevel-incons test

2007-03-19 Thread Andreas Dilger
Regression test for an inter-level index depth inconsistency in an extent file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_inrlevel-incons.patch
Description: Binary data


[PATCH] 15/19 e2fsprogs f_extents_orphan_blks test

2007-03-19 Thread Andreas Dilger
Regression test for orphaned blocks in an extent filesystem.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_orphan_blks.patch
Description: Binary data


Re: [PATCH] 18/19 e2fsprogs f_extents_shrd_blk test

2007-03-19 Thread Andreas Dilger
Regression test for unsorted extents within a file.
Binary patch file.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_unsorted.patch
Description: Binary data


[PATCH] 18/19 e2fsprogs f_extents_shrd_blk test

2007-03-19 Thread Andreas Dilger
Regression test for two extent files sharing the same blocks.
Binary patch file.

This test is known not to work currently.  The problem is that there is not
enough space in the index to split the extent and clone all of the blocks.
It is hoped that Ted's change of e2fsprogs to be extent based instead of
block iterator based will allow this problem to be fixed.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_extents_shrd_blk.patch
Description: Binary data


Re: 2.6.20 ext3 rename() returns success but doesn't unlink the source

2007-03-23 Thread Andreas Dilger
On Mar 24, 2007  01:18 +0200, Timo Sirainen wrote:
 On Fri, 2007-03-23 at 23:24 +0100, Guillaume Chazarain wrote:
  Timo Sirainen a écrit :
  
 rename(/mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina,
/mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2,);
  
  $ ls -li /mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina 
  /mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2,
  122481 -rw--- 2 1000 1000 3091 Mar 23 08:43 
  /mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2,
  122481 -rw--- 2 1000 1000 3091 Mar 23 08:43 
  /mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina
  
  All your files in the list are hard linked.
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=fs/namei.c;h=ee60cc4d3453209723d6f70982e7083d7cb39477;hb=HEAD#l2447
  seems to handle your case as you see it:
  if (old_dentry-d_inode == new_dentry-d_inode)
  return 0;
 
 Hmm. Oh. So this is this intentional? First I ever heard of it, and also
 goes against my logic. I'll go complain to man pages people then. This
 will make my error handling difficult..

Yes, it is a very counter-intuitive requirement in the POSIX spec.  The
GNU user tools do what is expected (unlink the source file).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


group descriptor contents and LAZY_BG

2007-03-24 Thread Andreas Dilger
Ted,
the LAZY_BG feature was originally written to test large filesystems and
avoid the need to format (or even allocate, if sparse) large parts of the
inode table and bitmaps.  To make the feature COMPAT the bg_free_blocks_count
and bg_free_inodes_count are initialized to zero, and the kernel will skip
these groups entirely without checking the bitmap.

For the GDT_CSUM feature, we are using the LAZY_BG feature to not initialize
the bitmaps and inode table.  However, setting the bg_free_*_count to zero
is troublesome because it means we always need to check the GDT_CSUM feature
flag to know whether a group actually has free blocks/inodes, and e2fsck
is also getting confused about the number of free blocks/inodes in the groups
and filesystem (we have to have a double accounting for real free blocks
and uninitialized free blocks.

What do you think for the GDT_CSUM feature that we initialize the group
descriptors as if the group had actually been formatted?  I think our
use of the LAZY_BG feature is actually misguided - while they share some
components (e.g. UNINIT flags, mke2fs code, some e2fsck code), since the
GDT_CSUM feature is RO_COMPAT there isn't much reason to even enable
LAZY_BG at format time...

We have to replace all uses of bg_free_*_count with the below macros:

+/* Macro-instructions used to calculate Free inodes and blocks count. */
+#define EXT3_BG_INODES_FREE(sb,gr,gdp) ((EXT3_HAS_RO_COMPAT_FEATURE(sb,   \
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM)   \
+ (gdp)-bg_flags \
+ cpu_to_le16(EXT3_BG_INODE_UNINIT)) ? \
+  EXT3_INODES_PER_GROUP(sb) : \
+  le16_to_cpu((gdp)-bg_itable_unused) +  \
+  le16_to_cpu((gdp)-bg_free_inodes_count))
+#define EXT3_BG_BLOCKS_FREE(sb,gr,gdp) ((EXT3_HAS_RO_COMPAT_FEATURE(sb,   \
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM)   \
+ (gdp)-bg_flags \
+ cpu_to_le16(EXT3_BG_BLOCK_UNINIT)) ? \
+  ext3_free_blocks_after_init(sb,gr,gdp) :\
+  le16_to_cpu((gdp)-bg_free_blocks_count))

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext4 benchmarks

2007-03-30 Thread Andreas Dilger
On Mar 30, 2007  10:43 +0200, Johann Lombardi wrote:
 On Wed, Mar 28, 2007 at 09:20:50AM -0500, Jose R. Santos wrote:
  While it may be to late for the purposes of your OLS paper, one thing 
  that doesn't seem to be getting much attention is the performance of a 
  file system while doing many meta-data operations or throughput testing 
  during heavy journal log activity.
 
 Back in 2005, Alex sent a patch to monitor journal activity through procfs:
 http://marc.info/?l=linux-ext4m=113538565128617w=2
 
 We have been using this patch for more than 1 year and it is very useful
 for investigating performance issues.
 Does anybody know why this patch hasn't found its way into the mainline?
 Maybe it could be merged in jbd2?

Yes, we use this functionality failry often, so having it in jbd2 would
be quite useful.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Correction to check_filetype()

2007-03-31 Thread Andreas Dilger
On Mar 30, 2007  20:44 -0400, Theodore Tso wrote:
 On Wed, Feb 21, 2007 at 02:45:59PM +0530, Kalpak Shah wrote:
  If the mode of a directory gets corrupted, check_filetype() makes
  wrong decisions for all its sub-directories. For example, using
  debugfs we can corrupt the mode of a directory to 0140755 (i.e. a
  socket). e2fsck will set the filetype of all its subdirectories as 6
  (filetype for socket). All the subdirectories would be moved to
  lost+found, and in second run of e2fsck their filetype would be set
  back to 2.
 
 Um, I'm not seeing this.  Using stock e2fsprogs, given the following
 test image, I'm not seeing the behavior you describe.  

The log of the e2fsck (on a test filesystem with deliberately introduced
corruption) is available at:

https://bugzilla.lustre.org/show_bug.cgi?id=11645

I've also added a testcase (created by making a parent directory with several
subdirs, then using debugfs to change the mode of the parent directory).

https://bugzilla.lustre.org/attachment.cgi?id=9958
(also attached here).

It doesn't exhibit the filetype breakage in the upstream e2fsck because the
test is run with the patch applied, but it does still show the size is wrong
on second e2fsck problem you observed.  The test case is created as if that
problem was also fixed already (i.e. second e2fsck is clean).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



e2fsprogs-tests-f_check_filetype.patch
Description: Binary data


Re: [PATCH] Add i_version_hi for 64-bit version

2007-04-02 Thread Andreas Dilger
On Apr 02, 2007  14:47 -0700, Mingming Cao wrote:
 On Sun, 2007-03-18 at 15:43 +0530, Kalpak Shah wrote:
  This patch adds a 32-bit i_version_hi field to ext4_inode, which can be 
  used for 64-bit inode versions. This field will store the higher 32 bits of 
  the version, while Jean Noel's patch has added support to store the lower 
  32-bits in osd1.linux1.l_i_version.
  
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
  
  Index: linux-2.6.20/include/linux/ext4_fs.h
  ===
  --- linux-2.6.20.orig/include/linux/ext4_fs.h
  +++ linux-2.6.20/include/linux/ext4_fs.h
  @@ -336,6 +336,7 @@ struct ext4_inode {
  __le32  i_atime_extra;  /* extra Access time  (nsec  2 | 
  epoch) */
  __le32  i_crtime;   /* File Creation time */
  __le32  i_crtime_extra; /* extra File Creation time (nsec  2 | 
  epoch) */
  +   __u32   i_version_hi;   /* high 32 bits for 64-bit version */
   };
  
 
 Any reason not using __le32 for the i_version_hi?

Not really.  I've also asked Kalpak to run sparse with Shaggy's flags
against the other patches to fix up any similar issues.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Correction to check_filetype()

2007-04-03 Thread Andreas Dilger
On Mar 31, 2007  10:39 -0400, Theodore Tso wrote:
 I'm going to let this one soak for a bit to make sure we don't pick up
 any fase positives or negatives in the hueristics.
 
 @@ -133,11 +133,10 @@ int e2fsck_pass1_check_device_inode(ext2
 +  * If the index flag is set, then this is a bogus
 +  * device/fifo/socket
*/
 - if ((ext2fs_inode_data_blocks(fs, inode) != 0) ||
 - (inode-i_flags  EXT2_INDEX_FL))
 + if (inode-i_flags  EXT2_INDEX_FL)
   return 0;

There were ancient versions of the kernel that left EXT2_INDEX_FL set
on all files, instead of just directories...  I'm not sure if those
were in actual released kernels, or just in patches.

 +static void check_is_really_dir(e2fsck_t ctx, struct problem_context *pctx,
 + char *buf)
 +{
 + if (ext2fs_read_dir_block(ctx-fs, inode-i_block[0], buf))
 + return;

Do we call check_blocks() on this inode shortly thereafter?  If we do then
the overhead of reading the first block is offset by not reading it again
later.  Otherwise, this could slow things down.

 + dirent = (struct ext2_dir_entry *) buf;
 + if (((dirent-name_len  0xFF) != 1) ||
 + (dirent-name[0] != '.') ||
 + (dirent-inode != pctx-ino) ||
 + (dirent-rec_len  12) ||
 + (dirent-rec_len % 4) ||
 + (dirent-rec_len = ctx-fs-blocksize - 12))
 + return;
 +
 + dirent = (struct ext2_dir_entry *) (buf + dirent-rec_len);
 + if (((dirent-name_len  0xFF) != 2) ||
 + (dirent-name[0] != '.') ||
 + (dirent-name[1] != '.') ||
 + (dirent-rec_len  12) ||
 + (dirent-rec_len % 4))
 + return;
 +
 + if (fix_problem(ctx, PR_1_TREAT_AS_DIRECTORY, pctx)) {
 + inode-i_mode = (inode-i_mode  0) | LINUX_S_IFDIR;
 + e2fsck_write_inode_full(ctx, pctx-ino, inode, 
 + EXT2_INODE_SIZE(ctx-fs-super), 
 + check_is_really_dir);
   }

The one worry I have here (though I don't think it is necessarily IN
the code you propose) is that someone could create a regular file which
looks like a directory and somehow get it linked into the filesystem
tree, giving them escalated access (e.g. device files owned by them,
suid executables, links to otherwise unreadable files, etc).

It would seem that this is only a danger if the mode on the file is
corrupted, which shouldn't really be doable by a regular user, but
it is definitely something to consider.

I take it that this code fixes the test image I previously posted?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 2] e2fsprogs: Add ext4migrate

2007-04-03 Thread Andreas Dilger
On Apr 03, 2007  15:37 +0530, Aneesh Kumar K.V wrote:
 The extent insert code is derived out of the latest ext4 kernel
 source. I have tried to keep the code as close as possible to the
 kernel sources. This makes sure that any fixes for the tree building
 code in kernel should be easily applied to ext4migrate.  The ext3_ext
 naming convention instead of ext4_ext found in kernel is to make sure
 we are in sync with rest of e2fsprogs source.

Of course, the other way to do this would be to temporarily mount the
filesystem as ext4, copy non-extent files via cp (can use lsattr to
check for extent flag) and then rename new file over old one.  Care
must be taken to not mount filesystem on visible mountpoint, so that
users cannot be changing the filesystem while copy is being done.

This can be done to convert an ext4 filesystem back to ext3 also, if
the ext4 filesystem is mounted with noextents (to disable creation
of new files with extent mapping).

The only minor issue is that the inode numbers of the files will change.

 The inode modification is done only at the last stage. This is to make
 sure that if we fail at any intermediate stage, we exit without touching
 the disk.
 
 The inode update is done as below
 a) Walk the extent index blocks and write them to the disk. If failed exit
 b) Write the inode. if failed exit.
 c) Write the updated block bitmap. if failed exit ( This could be a problem
because we have already updated the inode i_block filed to point to new
blocks.). But such inconsistancy between inode i_block and block bitmap
can be fixed by fsck IIUC.

Why not mark all the relevant blocks in use (for both exent- and block-mapped
copies) until the copy is done, then write everything out, and only mark the
block-mapped file blocks free after the inode is written to disk?  This avoids
the danger that the new extent-mapped file's blocks are marked free and get
double-allocated (corrupting the file data, possibly the whole filesystem).

I don't think there is a guarantee that an impatient user will run a lengthy
e2fsck after interrupting the migrate.  Also, you should mark the filesystem
unclean at first change unless everything completes successfully.  That way
e2fsck will at least run automatically on the next boot.


Other general notes:
- wrap lines at 80 columns
- would be good to have a -R mode that walked the whole filesystem,
  since startup time is very long for large filesystems
- also allow specifying multiple files on the command-line
- changing the operation to be multi-file allows avoiding sync of bitmaps
  two times (once after extents are allocated and inode written, once after
  indirect blocks are freed).  There only needs to be one sync per file.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune2fs -l stale info

2007-04-03 Thread Andreas Dilger
On Apr 03, 2007  15:55 +0200, Vincent Caron wrote:
   BTW, if ondisk superblocks are not updated until specific events occur
 (umount, statfs), what is the consequence of a system crash ? Does
 journalization come into play (superblock=metadata?), does fsck fixes
 figures from other ondisk structures ? Just being curious...

The free blocks/inodes count in the superblock aren't really used for
anything these days.  At mount time the kernel sums up the free inode
and block counts from the group descriptors to put into the percpu
counters, and only the group descriptors are kept uptodate.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext4 benchmarks

2007-04-04 Thread Andreas Dilger
On Apr 04, 2007  19:06 +0200, Cordenner jean noel wrote:
 here is the first results of the round:
 http://www.bullopensource.org/ext4/20070404/

Jean Noel,
thank you for the test results.  It is always nice to see that ext4 is
doing so well compared to ext3 and XFS.

Ming Ming,
it should be possible to just include the mballoc+delalloc patches that
Jean Noel used into the upstream ext4 patch series.  When Alex or Christoph
get a chance to do the VFS delalloc rewrite we can move to that new patch,
but until then it seems pointless to not include this functionality which
improves the performance so much.

Also, if we include those patches the mballoc and delalloc features (along
with extents) should be enabled by default if INCOMPAT_EXTENTS is in the
superblock unless:
- noextents, nomballoc, or nodelalloc mount options are given
- delalloc needs to be disabled if blocksize != PAGE_SIZE

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add a norecovery option to ext3/4?

2007-04-09 Thread Andreas Dilger
On Apr 08, 2007  22:24 -0500, Eric Sandeen wrote:
 Samuel Thibault wrote:
 Distribution installers usually try to probe OSes for building a suited
 grub menu.  Unfortunately, mounting an ext3 partition, even in read-only
 mode, does perform some operations on the filesystem (log recovery).
 This is not a good idea since it may silently garbage data.  
 
 Can you elaborate?  Under what circumstances is log replay going to harm 
 data?  Do you mean that the installer mounts partitions, looking for 
 what OS is installed?  How is that harmful?

If that disk was actually in use on another system but just exported
via a SAN to this node you've potentially corrupted the filesystem.

It's a bad idea to just go ahead and mount filesystems that you aren't
told to mount.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3, BKL, journal replay, multiple non-bind mounts of same device

2007-04-10 Thread Andreas Dilger
On Apr 10, 2007  20:49 -0400, John Anthony Kazos Jr. wrote:
 Since it is possible for the same block device to be mounted multiple 
 times concurrently by the same filesystem, and since ext3 explicitly 
 disables the BKL during its fill_super operation which would prevent this, 
 what is the result of mounting it multiple times this way? Especially if 
 the filesystem is dirty and a journal is replayed. (In any case, what 
 operation is being performed by ext3/ext4 that requires the BKL to be 
 dropped? What's the need to even consider the BKL during fill_super?)
 
 And in general, how does a filesystem deal with being mounted multiple 
 times in this way? In my testing and exploration so far, everything seems 
 to generally work, but I haven't tried deliberately using different 
 instances of the mount concurrently. Do we end up with locks not being 
 held properly on the superblock because the super_block structure 
 instances don't know about each other? Has dealing with this behavior of 
 bd_claim really been considered before, and if so, what's the general 
 scheme for handling it?

It is a myth (that actually frightened me quite a bit when I first did it)
that the filesystem is mounted twice in this case.  The truth of the matter
is if you mount -t ext3 /dev/ /mnt/1 and ... /mnt/2 you actually
get the equivalent of a bind mount for this block device on the two mount
points.  You can see this easily because e.g. you don't get two kjournald
threads for the two mounts, and it doesn't completely blow up.

If, on the other hand, you tried one mount with ext3 and another with ext4
it will fail the second with -EBUSY.

As for the BKL changes, your best bet is to go back through GIT and/or BK
or search the mailing lists to see when and why that was added.  It appears
to have  been 2.6.11, but I don't know why.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs - pass1c terminates early if hard links

2007-04-11 Thread Andreas Dilger
On Apr 11, 2007  06:42 -0400, Theodore Tso wrote:
 We don't currently have a way to modify indirect blocks
 directly (although I may create that soon or patches would be
 greatfully accepted :-), and of course there is a similar issue with
 extents and extended attributes blocks.

Note that as part of the i_extra_isize e2fsck support we are adding
functions to manage EAs to libext2fs that could be used as the basis
for this.

 At the moment I'm assuming that e2fsprogs block allocation algorithms
 (which are not as sophisticated as what's in the kernel) aren't going
 to be changing.

I thought as much myself.

 If they change, you're right, they could break the test.
 At the minimum I should document where the numbers are coming
 from in the test.

At least the test would break and any change that caused it could be
seen immediately.

 I could imagine a debugfs extension where we could do things like:
 
 set_var $shared_blk /dir4/quux block[0]
 set_inode_field /dir/foo block[0] $shared_blk
 set_inode_field /dir2/bar block[0] $shared_blk
 set_inode_field /dir3/baz block[0] $shared_blk
 
 Of course the temptation then would be to start adding conditions,
 functions, maybe an entire tcl interpreter   :-)

Yes, I thought of this too, and also the let's build a full interpreted
language into debugfs conclusion.  It wouldn't be fatal though - in
some cases it would be nice to have debugfs loop over inodes to fix some
unusual corruption.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: debugfs fill_bgs command?

2007-04-11 Thread Andreas Dilger
On Apr 11, 2007  15:17 -0500, Eric Sandeen wrote:
 I was wondering if others think it would be useful to have a fill_bgs
 command in debugfs; this would (minimally) mark the lower X bgs as full
 for both inodes  blocks (possibly either/or), to allow testing the
 higher block groups.  I'm using a hacked up version of this to do a
 little ext3 testing above 8T.  Given that this would really only be a
 testing option in nature, would it be accepted into e2fsprogs?  I'd
 probably need some sort of unfill command as well, to put the bg
 counters back where they should be; probably by actually reading the
 bitmaps.  This way fsck would still find a consistent filesystem...
 
 I had originally written a set_bg_field function too, to go with inode 
 sb variants, though for marking the first few thousand bg's it was going
 to get a bit tedious... :)

Try mke2fs -O lazy_bg XXX and be happy.  Can be used on any kernel as
it is a COMPAT feature and marks all but first and last groups as full.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger
 to allow this interface to be mapped
onto XFS_IOC_BMAP internally (or vice versa).  Even for block-mapped
filesystems, they can at least improve over the -bmap() case by skipping
holes in files that cover [dt]indirect blocks (saving thousands of calls).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger
On Apr 12, 2007  12:22 +0100, Anton Altaparmakov wrote:
 On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
 I'm interested in getting input for implementing an ioctl to  
 efficiently map file extents  holes (FIEMAP) instead of looping
 over FIBMAP a billion times.  We already have customers with single
 files in the 10TB range and we additionally need to get the mapping
 over the network so it needs to be efficient in terms of how data
 is passed, and how easily it can be extracted from the filesystem.
 
 struct fibmap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fibmap {
  struct fibmap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags for input request */
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fibmap_extent fm_extents[0];
 }
 
 #define FIEMAP_LEN_MASK  0xff
 #define FIEMAP_LEN_HOLE  0x01
 #define FIEMAP_LEN_UNWRITTEN 0x02
 
 Sound good but I would add:
 
 #define FIEMAP_LEN_NO_DIRECT_ACCESS
 
 This would say that the offset on disk can move at any time or that  
 the data is compressed or encrypted on disk thus the data is not  
 useful for direct disk access.

This makes sense.  Even for Reiserfs the same is true with packed tails,
and I believe if FIBMAP is called on a tail it will migrate the tail into
a block because this is might be a sign that the file is a kernel that
LILO wants to boot.

I'd rather not have any such feature in FIEMAP, and just return the
on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
My main reason for FIEMAP is being able to investigate allocation patterns
of files.

By no means is my flag list exhaustive, just the ones that I thought would
be needed to implement this for ext4 and Lustre.

 Also why are you not using 0xff00, i.e. two more zeroes  
 at the end?  Seems unnecessary to drop an extra 8 bits of  
 significance from the byte size...

It was actually just a typo (this was the first time I'd written the
structs and flags down, it is just at the discussion stage).  I'd meant
for it to be 2^56 bytes for the file size as I wrote later in the email.
That said, I think that 2^48 bytes is probably sufficient for most uses,
so that we get 16 bits for flags.  As it is this email already discusses
5 flags, and that would give little room for expansion in the future.

Remember, this is the mapping for a single file (which can't practially
be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to
return a few separate extents which are actually contiguous (assuming that
there will actually be files in filesystems with  2^48 bytes of contiguous
space).  Since the API is that it will return the extent that contains the
requested start byte, the kernel will be able to detect this case also,
since it won't be able to specify a length for the extent that contains the
start byte.

At most we'd have to call the ioctl() 65536 times for a completely
contiguous 2^64 byte file if the buffer was only large enough for a
single extent.  In reality, I expect any file to have some discontinuities
and the buffer to be large enough for a thousand or more entries so the
corner case is not very bad.

 Finally please make sure that the file system can return in one way  
 or another errors for example when it fails to determine the extents  
 because the system ran out of memory, there was an i/o error,  
 whatever...  It may even be useful to be able to say here is an  
 extent of size X bytes but we do not know where it is on disk because  
 there was an error determining this particular extent's on-disk  
 location for some reason or other...

Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
FIEMAP_LEN_ERROR.  Consider FIEMAP on a file that was migrated
to tape and currently has no blocks allocated in the filesystem.  We
want to return some indication that there is actual file data and not
just a hole, but at the same time we don't want this to actually return
the file from tape just to generate block mappings for it.

This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ,
but this needs to be specified on input to prevent the file being mapped
and I'd rather the opposite (not getting file from tape) be the default,
by principle of least surprise.


 block-aligned/sized allocations (e.g. tail packing).  The  
 fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 Why the fe_start == 0?  Surely just the flag is sufficient...  On  
 NTFS it is perfectly valid to have fe_start == 0 and to have that not  
 be sparse (normally the $Boot system file is stored in the first 8  
 sectors of the volume)...

I thought

Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-15 Thread Andreas Dilger
Just a quick note before I forget.  I thought there was a call in ext4
to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
more than 2^32 blocks?  I don't see that anywhere in the 2.6.20 ext4.
Is that in the upstream git repo?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Copy i_flags to ext3 inode flags on write

2007-04-16 Thread Andreas Dilger
On Apr 16, 2007  19:05 +0200, Jan Kara wrote:
   attached is a patch that stores inode flags such as S_IMMUTABLE,
 S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is
 written to disk. The same thing is done on GETFLAGS ioctl.
   Quota code changes these flags on quota files (to make it harder for
 sysadmin to screw himself) and these changes were not correctly
 propagated into the filesystem (especially, lsattr did not show them and
 users were wondering...).
  
 +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */
 +void ext3_get_inode_flags(struct inode *inode)
 +{
 + unsigned int flags = inode-i_flags;
 +
 + EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
 + EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
 + if (flags  S_SYNC)
 + EXT3_I(inode)-i_flags |= EXT3_SYNC_FL;
 + if (flags  S_APPEND)
 + EXT3_I(inode)-i_flags |= EXT3_APPEND_FL;
 + if (flags  S_IMMUTABLE)
 + EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL;
 + if (flags  S_NOATIME)
 + EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL;
 + if (flags  S_DIRSYNC)
 + EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL;
 +}

It probably makes sense to pass struct ext3_inode_info *ei to this
function, which is available at both callsites and avoids some pointer
math for each access.

  void ext3_read_inode(struct inode * inode)
  {
   struct ext3_iloc iloc;
 @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t
   if (ei-i_state  EXT3_STATE_NEW)
   memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size);
  
 + ext3_get_inode_flags(inode);
   raw_inode-i_mode = cpu_to_le16(inode-i_mode);
   if(!(test_opt(inode-i_sb, NO_UID32))) {
   raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid));
 diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c 
 linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c
 --- linux-2.6.21-rc6/fs/ext3/ioctl.c  2007-02-07 12:03:23.0 +0100
 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c  2007-04-12 
 18:22:54.0 +0200
 @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st
  
   switch (cmd) {
   case EXT3_IOC_GETFLAGS:
 + ext3_get_inode_flags(inode);
   flags = ei-i_flags  EXT3_FL_USER_VISIBLE;
   return put_user(flags, (int __user *) arg);
   case EXT3_IOC_SETFLAGS: {

Looks fine otherwise - there is already ext3_set_inode_flags() which sets the
VFS inode flags properly in ext3_read_inode().

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  18:01 +1000, Timothy Shimmin wrote:
 --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] 
 wrote:
 struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping 
  */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to 
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
 }
 
 # define FIEMAP_LEN_MASK 0xff
 # define FIEMAP_LEN_HOLE 0x01
 # define FIEMAP_LEN_UNWRITTEN0x02
 
 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 The -fm_extents[] array includes all of the holes in addition to
 allocated extents because this avoids the need to return both the logical
 and physical address for every extent and does not make processing any
 harder.
 
 Well, that's what stood out for me. I was wondering where the fe_block 
 field had gone - the physical address.
 So is your fe_start; /* starting offset */ actually the disk location
 (not a logical file offset)
 _except_ in the header (fiemap) where it is the desired logical offset.

Correct.  The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region.  In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the virtual hole at the end of the file.  I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent.  fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

 Okay, looking at your example use below that's what it looks like.
 And when you refer to fm_start below, you mean fm_start.fe_start?
 Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

 So you get rid of all the logical file offsets in the extents because we
 report holes explicitly (and we know everything is contiguous if you
 include the holes).

Correct.  It saves space in the common case.

 Caller works something like:
 
  char buf[4096];
  struct fiemap *fm = (struct fiemap *)buf;
  int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
  
  fm-fm_start.fe_start = 0; /* start of file */
  fm-fm_start.fe_len = -1;   /* end of file */
  fm-fm_extent_count = count; /* max extents in fm_extents[] array */
  fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */
 
  fd = open(path, O_RDONLY);
  printf(logical\t\tphysical\t\tbytes\n);
 
  /* The last entry will have less extents than the maximum */
  while (fm-fm_extent_count == count) {
  rc = ioctl(fd, FIEMAP, fm);
  if (rc)
  break;
 
  /* kernel filled in fm_extents[] array, set fm_extent_count
   * to be actual number of extents returned, leaves
   * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
 
  for (i = 0; i  fm-fm_extent_count; i++) {
  __u64 len = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_MASK;
  __u64 fm_next = fm-fm_start.fe_start + len;
  int hole = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_HOLE;
  int unwr = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_UNWRITTEN;
 
  printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
  fm-fm_start.fe_start, fm_next - 1,
  hole ? 0 : fm-fm_extents[i].fe_start,
  hole ? 0 : fm-fm_extents[i].fe_start +
 fm-fm_extents[i].fe_len - 1,
  len, hole ? (hole)  : ,
  unwr ? (unwritten)  : );
 
  /* get ready for printing next extent, or next ioctl 
  */
  fm-fm_start.fe_start = fm_next;
  }
  }
 

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  21:22 +1000, David Chinner wrote:
 On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
  struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
  }
  
  struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
  }
  
  #define FIEMAP_LEN_MASK 0xff
  #define FIEMAP_LEN_HOLE 0x01
  #define FIEMAP_LEN_UNWRITTEN0x02
 
 I'm not sure I like stealing bits from the length to use a flags -
 I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern.  I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE  0x0001  /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x0002  /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN   0x0004  /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x0008  /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x0010  /* no direct data access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)

- separate flags word for fm_flags:
  - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
  - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
  - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
  - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
  sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
  - FIEMAP_EXTENT_HOLE = extent has no space allocation
  - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
  - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
  - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == errno?
  - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
  (e.g. OCFS, GFS, ZFS, Lustre).  This would otherwise have been unused.


 Given that xfs_bmap uses extra information from the filesystem
 (geometry) to display extra (and frequently used) information
 about the alignment of extents. ie:
 
 chook 681% xfs_bmap -vv fred
 fred:
  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
  FLAG Values:
 01 Unwritten preallocated extent
 001000 Doesn't begin on stripe unit
 000100 Doesn't end   on stripe unit
 10 Doesn't begin on stripe width
 01 Doesn't end   on stripe width

Can you clarify the terminology here?  What is a stripe unit and what is
a stripe width?  Are there N * stripe_unit = stripe_width in e.g. a
RAID

Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-19 Thread Andreas Dilger
On Apr 19, 2007  12:15 -0700, Mingming Cao wrote:
 On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
  Just a quick note before I forget.  I thought there was a call in ext4
  to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
  more than 2^32 blocks?
 
 Question about the online resize case. If the fs is increased to more
 than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
 journal. What about existing transactions that still stores 32 bit block
 numbers?  I guess the journal need to commit them all so that revoke
 will not get confused about the bits for block numbers later.  After
 that done then JBD2 can set this feature safely.

Well, there are two options here:
1) refuse resizing filesystems beyond 16TB
   - this is required if they were not formatted as ext4 to start with, as
 the group descriptors will not be large enough to handle the _hi
 word in the bitmap/inode table locations
   - this is also a problem for block-mapped files that need to allocate
 blocks beyond 16TB (though this could just fail on those files with
 e.g. ENOSPC or EFBIG or something similar)
2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond
   16TB.  This would also require changing over to META_BG at some point,
   because there cannot be enough reserved group descriptor blocks (the
   resize_inode is set up for a maximum of 2TB filesystems I think)
   
For now I'd be happy with just setting the JBD2_*_64BIT flag at mount for
filesystems  16TB, and refusing resize across 16TB.  We can fix it later.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-19 Thread Andreas Dilger
On Apr 19, 2007  17:41 -0700, Mingming Cao wrote:
 Any concerns about turn on META_BG by default for all new ext4 fs?
 Initially I thought we only need META_BG for support 256TB, so there is
 no rush to turn it on for all the new fs. But it appears there are
 multiple benefits to enable META_BG by default:

I would prefer not to have it default for the first 1TB or so of the
filesystem or so.  One reason is that using META_BG for all of the groups
give us only 2 backups of each group descriptor, and those are relatively
close together.  In the first 1TB we would get 17 backups of the group
descriptors, which should be plenty.

 - enable online resize 2TB

Actually, I don't think the current online resize support for META_BG.
There was a patch last year by Glauber de Oliveira Costa which added
support for online resizing with META_BG, which would need to be updated
to work with ext4.  Also, the usage of s_first_meta_bg in that patch is
incorrect.

 - support 256TB fs 

True, though not exactly pressing, and filesystems can be changed
to add META_BG support at any point.

 - Since metadatas(bitmaps, group descriptors etc) are not put at the
   beginning of each block group anymore, the 128MB limit(block group size
   with 4k block size) that used to limit an extent size is removed. 
 - Speed up fsck since metadata are placed closely. 

That isn't really true, even though descriptions of META_BG say this.
There will still be block and inode bitmaps and the inode table.
The ext3 code was missing support for moving the bitmaps/itable outside
their respective groups, and that has not been fixed yet in ext4.

The problem is that ext4_check_descriptors() in the kernel was never
changed to support META_BG, so it does not allow the bitmaps or inode
table to be outside the group.  Similarly, ext2fs_group_first_block()
and ext2fs_group_last_block() in lib/ext2fs also don't take META_BG
into account.

Also, since the extent format supports at most 2^15 blocks (128MB) it
doesn't really make much difference in that regard, though it does help
the allocator somewhat because it has more contiguous space to allocate
from.

 So I am wondering why not make it default?

It wouldn't be too hard to add in support for this I think, and there
is definitely some benefit.  Since neither e2fsprogs nor the kernel
handle this correctly, the placement of bitmaps and inode tables outside
of their respective groups may as well be a separate feature.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix up lazy_bg bitmap initialization at mkfs time

2007-04-20 Thread Andreas Dilger
On Apr 20, 2007  09:57 -0500, Eric Sandeen wrote:
 Anyway, how about something like this for calculating journal size in 
 the face of lazy_bg.  I know the last group may be smaller... but I figure
 this is just a heuristic anyway.
 
 Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
 
 Index: e2fsprogs-1.39_ext4_hg/misc/util.c
 ===
 --- e2fsprogs-1.39_ext4_hg.orig/misc/util.c
 +++ e2fsprogs-1.39_ext4_hg/misc/util.c
 @@ -252,8 +252,16 @@ void parse_journal_opts(const char *opts
  int figure_journal_size(int size, ext2_filsys fs)
  {
   blk_t j_blocks;
 + blk_t fs_size;
  
 - if (fs-super-s_blocks_count  2048) {
 + if (EXT2_HAS_COMPAT_FEATURE(fs-super, 
 + EXT2_FEATURE_COMPAT_LAZY_BG)) {
 + fs_size = fs-super-s_blocks_per_group * 2; 
 + } else {
 + fs_size = fs-super-s_blocks_count;
 + }

This should also check for !RO_COMPAT_UNINIT_GROUPS, since LAZY_BG
can be used in conjunction with UNINIT_GROUPS to mean don't zero
uninitialized groups but doesn't set bg_free_blocks_count == 0.

I was going to suggest using s_free_blocks_count, but that might
lead to confusion if e.g. e2fsck deletes a journal on a nearly-full
fs and then tune2fs recreates it much smaller than before.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix ext2 allocator overflows above 31 bit blocks

2007-04-20 Thread Andreas Dilger
On Apr 20, 2007  12:10 -0500, Eric Sandeen wrote:
 If ext3 can do 16T, ext2 probably should be able to as well.
 There are still int block containers in the block allocation path
 that need to be fixed up.

Yeah, but who wants to do 16TB e2fsck on every boot?  I think there
needs to be some limits imposed for the sake of usability.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andreas Dilger
On Apr 27, 2007  08:30 -0700, Linus Torvalds wrote:
 On a good filesystem, when you do fsync() on a file, nothing at all 
 happens to any other files. On ext3, it seems to sync the global journal, 
 which means that just about *everything* that writes even a single byte 
 (well, at least anything journalled, which would be all the normal 
 directory ops etc) to disk will just *stop* dead cold!
 
 It's horrid. And it really is ext3, not fsync().
 
 I used to run reiserfs, and it had its problems, but this was the 
 feature of ext3 that I've disliked most. If you run a MUA with local 
 mail, it will do fsync's for most things, and things really hickup if you 
 are doing some other writes at the same time. In contrast, with reiser, if 
 you did a big untar or some other big write, if somebody fsync'ed a small 
 file, it wasn't even a blip on the radar - the fsync would sync just that 
 small thing.

It's true that this is a feature of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too.  The reason is
that if a journal commit doesn't flush the data as well then a crash will
result in garbage (from old deleted files) being visible in the newly
allocated file.  People used to complain about this with reiserfs all the
time having corrupt data in new files after a crash, which is why I believe
it was fixed.

There definitely are some problems with the ext3 journal commit though.
If the journal is full it will cause the whole journal to checkpoint out
to the filesystem synchronously even if just space for a small transaction
is needed.  That is doubly bad if you have a very large journal.  I believe
Alex has a patch to have it checkpoint much smaller chunks to the fs.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large File Deletion Comparison (ext3, ext4, XFS)

2007-04-27 Thread Andreas Dilger
On Apr 27, 2007  14:33 -0400, Theodore Tso wrote:
  Here are the results obtained with a not very fragmented 100-GB file:
  
   | ext3   ext4 + extents  xfs
  
   nb of fragments | 796 798 15
   elapsed time|  2m0.306s0m11.127s   0m0.553s
   |
   blks read   |  2066006416352
   blks written|   13592   13064104
  
 
 The metablockgroups feature should help the file fragmentation level
 with extents.  It's easy enough to enable this for ext4 (we just need
 to remove some checks in ext4_check_descriptors), so we should just do
 it.

I agree in this case that the META_BG feature would help here (100GB / 128MB
is in fact the 800 fragments shown), I don't think that is the major
performance hit.

The fact that we need to read 6000 blocks and write 13000 blocks is the
more serious part.  I assume that since there are only 800 fragments
there should be only 800 extents.  We can fit (4096 / 12 - 1) = 340
extents into each block, and 4 index blocks into the inode, so this
should allow all 800 extents in only 3 index blocks.  It would be useful
to know where those 6416 block reads are going in the extent case.

I suspect that is because the tail first truncation mechanism of ext3
causes it to zero out FAR more blocks than needed.  With extents and a
default 128MB journal we should be able to truncate + unlink a file with
only writes to the inode and the 800 bitmap + gdt blocks.  The reads should
also be limited to the bitmap blocks and extent indexes (gdt being read at
mount time).

What is needed is for truncate to walk the inode block tree (extents or
indirect blocks) and count the bitmaps + gdt blocks dirtied, and then try
and do the whole truncate under a single transaction.  That avoids any need
for truncate to be restartable and then there is no need to zero out the
indirect blocks from the end one-at-a-time.

Doing the bitmap read/write will definitely be more efficient with META_BG,
but that doesn't explain the other 19k blocks undergoing IO.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large File Deletion Comparison (ext3, ext4, XFS)

2007-04-27 Thread Andreas Dilger
On Apr 27, 2007  15:41 +0200, Valerie Clement wrote:
 As asked by Alex, I included in the test results the file fragmentation 
 level and the number of I/Os done during the file deletion.
 
 Here are the results obtained with a not very fragmented 100-GB file:
 
  | ext3   ext4 + extents  xfs
 
  nb of fragments | 796 798 15
  elapsed time|  2m0.306s0m11.127s   0m0.553s
  |
  blks read   |  2066006416352
  blks written|   13592   13064104
 
 
 
 And with a more fragmented 100-GB file:
 
  | ext3   ext4 + extents   xfs
 
  nb of fragments |   20297   19841234
  elapsed time| 2m18.914s0m27.429s  0m0.892s
  |
  blks read   |  225624   25432592
  blks written|   52120   50664872
 
 
 
 More details on our web site:
 http://www.bullopensource.org/ext4/20070404/FileDeletion.html

Ah, one thing that is only mentioned in the URL is that the IO count is
in units of 512-byte sectors.  In the case of XFS doing logical journaling
this avoids a huge amount of double writes to the journal and then to the
filesystem.  I still think ext4 could do better than it currently does.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger
On Apr 30, 2007  08:09 -0400, Theodore Tso wrote:
 On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote:
  I'd prefer that such functionality be integrated with Takashi's online
  defrag tool, since it needs virtually the same functionality.  For that
  matter, this is also very similar to the block-mapped - extents tool
  from Aneesh.  It doesn't make sense to have so many separate tools for
  users, especially if they start interfering with each other (i.e. defrag
  undoes the remapping done by your tool).
 
 While we're at it, someone want to start thinking about on-line
 shrinking of ext4 filesystems?  Again, the same block remapping
 interfaces for defrag and file access optimizations should also be
 useful for shrinking filesystems (even if some of the files that need
 to be relocated are being actively used).  If not, that probably means
 we got the interface wrong.

Except one other issue with online shrinking is that we need to move
inodes on occasion and this poses a bunch of other problems over just
remapping the data blocks.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-1.39-tyt3

2007-05-01 Thread Andreas Dilger
On Apr 30, 2007  11:22 -0400, Theodore Ts'o wrote:
 One concern I still have is the fact that we're exposing a lot of
 interfaces in libext2fs.so which are very specifically tied to the
 current 48-bit physical/32-bit logical on-disk extent data structure.
 If/when we add support for the 64/64-bit extent data structure, and some
 kind of compressed/bit-packed extent format, those interfaces will have
 to be extended.

Some other comments:
- it appears you have all of extents.c copied into both block.c and
  in extent.c
- it appears you are missing a couple of extent sanity checks for headers
  and such that were added more recently, but those might also have been
  removed by you
- do the patches as they stand actually fix the duplicate block test
  cases yet?  It seems all of the BLOCK_CHANGED handling was removed
  from block_iterate_extents().
- it would be easier to read if the patches were generated with diff -up
  (can be set in .quiltrc via 'export QUILT_DIFF_OPTS=-up'.  I also like
  export QUILT_NO_DIFF_TIMESTAMPS=1 to avoid tons of gratuitous changes
  to patches when they are refreshed.

 Another problem is that while external extent blocks are getting byte
 swapped, there is no byte swapping going on for the extents stored in
 the inode.  I haven't tried it on a big endian system, such as Power
 machine, but I'm pretty sure it's going to blow up spectacularly.

Ah, interesting.  We have 64-bit machines for testing, but no big-endian
ones.

 The patches can be found at:
 
 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/e2fsprogs-1.39-tyt3

What else I think is important to get into this patch series is a change
to mkfs.ext4 so that the default inode size is increased to 256.  I'm
not sure how/if this can be done via mke2fs.conf.

I also wouldn't object to changing the default inode_ratio on large
filesystems to be a lot fewer than 1/8192 bytes.  At a minimum, allowing
an inode_ratio  1/4MB should be allowed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-1.39-tyt3

2007-05-01 Thread Andreas Dilger
On May 01, 2007  11:05 -0400, Theodore Tso wrote:
 Speaking of SCM's, how wedded is Clustrefs to mercurial?  I've been
 considering migrating e2fsprogs development to git, but one of the
 reasons I haven't is because I was concerned that might be incovenient
 for you folks.

Not at all, we just generate a 1.39 to 1.39-WIP patch, and then use
quilt + CVS to manage the resulting patches.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger
On May 01, 2007  11:28 -0400, Theodore Tso wrote:
 On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
  Except one other issue with online shrinking is that we need to move
  inodes on occasion and this poses a bunch of other problems over just
  remapping the data blocks.
 
 Well, I did say necessary, and not sufficient.  But yes, moving
 inodes, especially if the inode is currently open gets interesting.  I
 don't think there are that many user space applications that would
 notice or care if the st_ino of an open file changed out from under
 them, but there are obviously userspace applications, such as tar,
 that would most definitely care.

I think rm -r does a LOT of this kind of operation, like:

stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)

I think find does the same to avoid security problems with malicious
path manipulation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-03 Thread Andreas Dilger
On May 02, 2007  20:57 +1000, David Chinner wrote:
 On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
  HSM_READ is definitely _NOT_ required because all  
  it means is if the file is OFFLINE, bring it ONLINE and then return  
  the extent map.
 
 You've got the definition of HSM_READ wrong. If the flag is *not*
 set, then we bring everything back online and return the full extent
 map.
 
 Specifying the flag indicates that we do *not* want the offline
 extents brought back online.  i.e. it is a HSM or a datamover
 (e.g. backup program) that is querying the extents and we want to
 known *exactly* what the current state of the file is right now.
 
 So, if the HSM_READ flag is set, then the application is
 expecting the filesytem to be part of a HSM. Hence if it's not,
 it should return an error because somebody has done something wrong.

In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX
BMV_IF_NO_DMAPI_READ flag.  Data is retrieved from HSM only if the
HSM_READ flag is set.  That's why the flag is called HSM_READ instead
of HSM_NO_READ.

The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is
only disabled by specifying a flag.  It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise).  It would probably be equally
surprising and undesirable if the default behaviour was to force all
data out to HSM.



For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should
even be a part of this interface?  I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is UNMAPPED.

Having FIEMAP force the retrieval of data from HSM strikes me as something
that should be a part of a separate HSM interface, which also needs to be
able to do things like push specific files or parts thereof out to HSM,
set the aging policy, and return information like where does the HSM
file live and how many copies are there.

Do you know the reasoning behind including this into XFS_IOC_GETBMAPX?
Looking at the bmap.c comments it appears it is simply because the API
isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate
there is data in HSM but it has no blocks allocated in the filesystem.

I don't think it makes the operation significantly more efficient than
say ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP) if an application actually
needs the data to be present instead of just returning mapping info that
includes UNMAPPED.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-05-03 Thread Andreas Dilger
On May 02, 2007  18:23 +0530, Amit K. Arora wrote:
 On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
  On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
  
   For FA_ALLOCATE, it's supposed to change the file size if we
   allocate past EOF, right?
  
  I would argue no.  Use truncate for that.
 
 The patch I posted for ext4 *does* change the filesize after
 preallocation, if required (i.e. when preallocation is after EOF).
 I may have to change that, if we decide on not doing this.

I think I'd agree - it may be useful to allow preallocation beyond EOF
for some kinds of applications (e.g. PVR preallocating live TV in 10
minute segments or something, but not knowing in advance how long the
show will actually be recorded or the final encoded size).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Andreas Dilger
On May 06, 2007  21:40 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] 
 wrote:
 
  2.6.20.6, FC4:
  
  I created a 91248k ext3 fs with 4k blocksize:
  
  Next, I tried to resize it to about 3G using ext2online while mounted:
  
  At that time the kernel said:
  
  |JBD: ext2online wants too many credits (744  256)
  
  What is the limitation I should be aware of? Has it something to do with
  the journal log size?

Yes, for very small filesystems the default journal size is only 4MB, and
the maximum transaction size is 1MB (256 blocks).

If the filesystem was 128MB or larger in the initial mke2fs then the
journal would be 16MB and the resize would get up to 1024 blocks for
the transaction.

I'm not sure what the right solution is for this.  If you know you will
be resizing the fs you could increase the initial journal size at mke2fs
time (-J size=16).  Alternately, it is possible resize to an intermediate
size and then delete and recreate the journal via tune2fs (which would
be the larger size by default) but that can only be done offline.  

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 03, 2007  21:31 -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:16 +0530, Aneesh Kumar K.V wrote:
 Theodore Tso wrote:
 Actually, changing the inode size is actually going to be more useful
 to a larger number of people, since they can use it today to support
 in-inode EA's.  In addition, changing the inode size must be done
 off-line, whereas it's not so clear that off-line conversion to
 extents is the best way to go (more on that below).
 
 That way we don't need to add kernel code that convert ext3 inode to 
 ext4 inode while defragmentation. Last time i looked at the online 
 defrag code, i was not able to figure out a easy way to take indirect 
 map inode as input, then defrag the same and write the result in extent 
 map.

I think the online defrag code _should_ be fairly easily made to handle
defrag of block-mapped files, so long as it is always creating extent-
mapped files in the end.  All that should be necessary is to have the
kernel read data from the block-mapped file and write it into the newly
mapped blocks.  That functionality needs to exist for a long time anyways
to support upgrading the filesystem so it shouldn't be a huge burden.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Andreas Dilger
On May 07, 2007  11:50 -0400, Theodore Tso wrote:
 On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote:
  On Mon, 7 May 2007, Theodore Tso wrote:
  We're going to have to rethink the defaults of the journal size as it
  relates to on-line resizing, but there might not be a lot of great
  answers here.  We'll also have to look at the on-line resizing code
  and see if there's a way to break up the resize operation into smaller
  transactions as a way of avoiding this problem --- but that would
  still leave the user stuck with a pathetically small 4M journal on a
  3G filesystem.
 
 Clearly the right thing for resize2fs to do when it's doing an
 off-line resize is to just adjust the journal inode and make it
 bigger.  I'll add that to my todo list, but of course, patches to do
 that would be gratefully accepted...

For that matter, there was a paper from U. Wisconsin showing that
having the journal in the middle of the filesystem gave noticably
better performance due to lower average seek times.  If anyone is
looking at messing with the journal that is probably also a good
and easy place to start (e.g. may be as easy as just giving a goal
block of s_blocks_count / 2 to the journal create code).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Andreas Dilger
On May 07, 2007  14:52 -0500, Eric Sandeen wrote:
 Theodore Tso wrote:
  How likely do you think the case will be that mke2fs.conf would be
  missing?  I'm trying to figure out how high priority of an item this
  really is.
 
 Well, not too likely, although for some reason I guess it happened in
 the installer root in FC6 or so.  That's what raised the issue.

Ah, good point - there are probably lots of installers and rescue disks
that grab mke2fs but don't know anything about /etc/mke2fs.conf.

  We could enhance the profile code so that it could read in the profile
  from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
  but that adds bloat --- the question is how necessary do we think that
  really is?
 
 I guess it doesn't really sound *necessary* - it's just that if we have
 2 different defaults and they drift, it can be confusing...

Since the shipped mke2fs.conf is only on the order of a few hundred bytes
I don't think it is a huge issue to include it.  I like the idea of it
being compiled into the code and yet consistent with the default install.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only is that this is the only way to allocate space in the
 filesystem without doing the explicit zeroing.
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating a 32bit blocks filesystem.

2007-05-09 Thread Andreas Dilger
On May 09, 2007  16:57 +0200, Valerie Clement wrote:
 Jose R. Santos wrote:
 I think this has more to do with the fact that I'm on a 32bit
 architecture and there are still a couple places where blocks are
 represented using unsigned long.  I'm trying to get access to a 64bit
 arch to confirm this.

 Oh, I didn't catch that you use a 32-bit system.
 On 32-bit architectures, the page cache index size imposes a 16TB limit 
 on the filesystem size (with 4KB blocksize). So you need a 64-bit system 
 for your test.

The mke2fs code should warn the user in this case that the filesystem will
not be usable on 32-bit systems.  I believe mke2fs already checks PAGE_SIZE
in order to validate blocksize  4096 filesystem requests, and a simple
check for sizeof(long) to see if it is a 32-bit system would be enough.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit only journal entries older than commit period?

2007-05-09 Thread Andreas Dilger
On May 09, 2007  15:31 +0200, Leon Woestenberg wrote:
 this is something I have long wondered about but have been afraid to ask.
 
 When my system is chewing away on builds, the disk I/O write access
 pattern of my ext3 root filesystem (using CFQ, Intel SATA controller,
 hard disk) when visualized by GNOME System Monitor clearly shows a
 repetitive landscape of large peaks, 5 seconds apart, which not much
 activity inbetween.
 
 I understand that's due to the ex3 journal commit interval (defaults
 to 5 seconds).
 
 But why isn't the filesystem continuously committing only that part of
 the journal that is older than 5 seconds?
 
 I would then expect the write requests to be smoothened over time,
 which can only be good in terms of performance and low latency.

The journal cannot start checkpoint of blocks to the filesystem until
the transaction is committed, otherwise a crash would leave the filesystem
inconsistent.  After the transaction is committed the blocks are written
into the fs and I'd guess that this can happen quickly enough that it
only takes a fraction of a second.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger
On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.
- If we do not unallocate normal (non-preallocated) blocks and we
  do not change the file size on preallocation, then this is a
  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the this file was modified
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext4 devel interlock meeting minutes (May 7, 2007)

2007-05-09 Thread Andreas Dilger
On May 09, 2007  10:02 -0700, Avantika Mathur wrote:
 Metablock Groups:
 
 - Mingming mentioned that in the metablock group feature, the inode 
 metadata is not moved with the rest of the metadata.  Ted will check 
 this, as he thought this had been implemented.
 
 - Online resize does not use metablock groups, and is currently limited 
 to 2TB.
 
 - Also need to remove sanity checks in the mount code
 
 - Once these issues are fixed, the metablock group feature can be turned 
 on by default for ext4 filesysytems

In my recent investigation of META_BG, it appears that BOTH the kernel
and e2fsprogs do NOT handle group metadata (bitmaps, itable) outside the
group.

In the kernel ext[234]_check_descriptors() requires the bitmaps and itable
to be inside the group even though META_BG is supported for a long time.

In e2fsck/super.c check_super_block() also requires that the bitmaps and
itable be inside the group even though e2fsck has claimed META_BG for a
long time.

That means it is virtually impossible to allow the group metadata to move
under the META_BG feature.  It might be possible to predicate this change
under both 64BIT and META_BG, though this is somewhat non-standard.

Alternately, we could allow the relocation of bitmaps and itables under
BIG_BG, which is itself covers the relocation aspects and also allows
the ability to change the group size.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >