Re: booked-page-flag.patch

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote:

  Andrew Morton (AM) writes:
 
  AM Well, one could just assume that the page has no disk mapping and go and
  AM make the space reservation.  Things will work out OK when we come to do
  AM writepage().
 
  AM Or one could do both: call get_block() only if the page was inside 
 i_size.
 
 well, even so we need to reserve not that block ony, but
 also needed metadata (for the worst case).

Right.  Reserve all the blocks for the page and all the metadata for a page
at that file offset.  Worst case.

 probably this
 is work for get_block or some different method?

umm, when I did this I think I added a new -reservepage address_space op
and called that from within the VFS's delalloc_prepare_write().

 anyway,
 we have to call it if the page is being written partial. 

Not necessarily.  If we're operating in nobh mode that page is brought
uptodate in prepare_write().

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-16 Thread Andreas Dilger
On Feb 16, 2007  00:06 -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote:
  well, even so we need to reserve not that block ony, but
  also needed metadata (for the worst case).
 
 Right.  Reserve all the blocks for the page and all the metadata for a page
 at that file offset.  Worst case.

This only really becomes an issue when the filesystem (+reservations)
is close to being full.  If we are close enough to worry about running
out of space we can also refuse to do delayed allocation.
 
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-16 Thread Andreas Dilger
On Feb 15, 2007  09:18 -0800, Eric Sandeen wrote:
 This was for the delayed allocation patchset, right; and by managing 
 this at the page level that means we can't do this for block size  page 
 size, I think...

We could always disable delayed allocation in the PAGE_SIZE != blocksize
case.  It would be no worse than what we have now, and for IO performance
you need larger blocksize anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version_1_vfs_layer

2007-02-16 Thread Cordenner jean noel

Andrew Morton a écrit :

This:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.20-ext4-1/broken-out/i_version_1_vfs_layer

significantly deoptimises file_update_time() for major filesystems (eg, ext3).

The changelog offers no reason for this alteration.  In fact the changelog
basically says nothing at all and needs a lot of work.

What is this patch doing and why does it need to make that change?



Actually, this set of patches are still in progress. I recently profile 
them. It shows that the CPU usage is really huge, so it has to be improved.


The i_version field is a counter that is set on every inode creation and
that is incremented every time the inode data is modified (similarly to
the ctime time-stamp).
The aim is to fulfill NFSv4 requirements for rfc3530.
For the moment, the counter is only a 32bit value but it is planned to 
be 64bit as required.


The patch is divided into 3 parts, the vfs layer, the ext4 specific code
and an user part to check i_version changes via stat.

(cf http://permalink.gmane.org/gmane.comp.file-systems.ext4/923)

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-16 Thread Dave Kleikamp
On Thu, 2007-02-15 at 23:46 -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 10:30:39 +0300 Alex Tomas [EMAIL PROTECTED] wrote:
  hmm. I thought it has to call get_block() at least to know whether
  the block is already allocated. and I was going to reserve space
  in prepare_write for which some fs-specific method is needed becase
  only fs knows how much metadata it'd need.
 
 Well, one could just assume that the page has no disk mapping and go and
 make the space reservation.  Things will work out OK when we come to do
 writepage().
 
 Or one could do both: call get_block() only if the page was inside i_size.

Or call get_block() with create = 0.  Or replace the create argument
with a flags field that can take either GET_BLOCK_CREATE or
GET_BLOCK_RESERVE.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] free blocks in case of error

2007-02-16 Thread Alex Tomas

Good day,

when ext4_ext_insert_extent() fails to insert new blocks
we should free just allocated blocks. please, consider
for review.

thanks, Alex


Signed-off-by: Alex Tomas [EMAIL PROTECTED]

Index: linux-2.6.20/fs/ext4/extents.c
===
--- linux-2.6.20.orig/fs/ext4/extents.c 2007-02-11 23:16:58.0 +0300
+++ linux-2.6.20/fs/ext4/extents.c  2007-02-17 00:34:32.0 +0300
@@ -2044,8 +2044,12 @@
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, newex);
-   if (err)
+   if (err) {
+   /* free data blocks we just allocated */
+   ext4_free_blocks(handle, inode, ext_pblock(newex),
+   le16_to_cpu(newex.ee_len));
goto out2;
+   }
 
if (extend_disksize  inode-i_size  EXT4_I(inode)-i_disksize)
EXT4_I(inode)-i_disksize = inode-i_size;
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sanity checks for on-disk data in extents code

2007-02-16 Thread Alex Tomas

Good day,

with the patch all headers are checked. the code should become
more resistant to on-disk corruptions. needless BUG_ON() have
been removed. please, review for inclusion.

thanks, Alex


Signed-off-by: Alex Tomas [EMAIL PROTECTED]

Index: linux-2.6.20/fs/ext4/extents.c
===
--- linux-2.6.20.orig/fs/ext4/extents.c 2007-02-17 00:34:32.0 +0300
+++ linux-2.6.20/fs/ext4/extents.c  2007-02-17 00:50:24.0 +0300
@@ -92,36 +92,6 @@
ix-ei_leaf_hi = cpu_to_le16((unsigned long) ((pb  31)  1)  
0x);
 }
 
-static int ext4_ext_check_header(const char *function, struct inode *inode,
-   struct ext4_extent_header *eh)
-{
-   const char *error_msg = NULL;
-
-   if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
-   error_msg = invalid magic;
-   goto corrupted;
-   }
-   if (unlikely(eh-eh_max == 0)) {
-   error_msg = invalid eh_max;
-   goto corrupted;
-   }
-   if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
-   error_msg = invalid eh_entries;
-   goto corrupted;
-   }
-   return 0;
-
-corrupted:
-   ext4_error(inode-i_sb, function,
-   bad header in inode #%lu: %s - magic %x, 
-   entries %u, max %u, depth %u,
-   inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
-   le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
-   le16_to_cpu(eh-eh_depth));
-
-   return -EIO;
-}
-
 static handle_t *ext4_ext_journal_restart(handle_t *handle, int needed)
 {
int err;
@@ -270,6 +240,70 @@
return size;
 }
 
+static inline int
+ext4_ext_max_entries(struct inode *inode, int depth)
+{
+   int max;
+
+   if (depth == ext_depth(inode)) {
+   if (depth == 0)
+   max = ext4_ext_space_root(inode);
+   else
+   max = ext4_ext_space_root_idx(inode);
+   } else {
+   if (depth == 0)
+   max = ext4_ext_space_block(inode);
+   else
+   max = ext4_ext_space_block_idx(inode);
+   }
+
+   return max;
+}
+
+static int __ext4_ext_check_header(const char *function, struct inode *inode,
+   struct ext4_extent_header *eh,
+   int depth)
+{
+   const char *error_msg = NULL;
+   int max = 0;
+
+   if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
+   error_msg = invalid magic;
+   goto corrupted;
+   }
+   if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) {
+   error_msg = unexpected eh_depth;
+   goto corrupted;
+   }
+   if (unlikely(eh-eh_max == 0)) {
+   error_msg = invalid eh_max;
+   goto corrupted;
+   }
+   max = ext4_ext_max_entries(inode, depth);
+   if (unlikely(le16_to_cpu(eh-eh_max)  max)) {
+   error_msg = too large eh_max;
+   goto corrupted;
+   }
+   if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
+   error_msg = invalid eh_entries;
+   goto corrupted;
+   }
+   return 0;
+
+corrupted:
+   ext4_error(inode-i_sb, function,
+   bad header in inode #%lu: %s - magic %x, 
+   entries %u, max %u(%u), depth %u(%u),
+   inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
+   le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
+   max, le16_to_cpu(eh-eh_depth), depth);
+
+   return -EIO;
+}
+
+#define ext4_ext_check_header(inode,eh,depth)  \
+   __ext4_ext_check_header(__FUNCTION__,inode,eh,depth)
+
 #ifdef EXT_DEBUG
 static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
 {
@@ -330,6 +364,7 @@
 /*
  * ext4_ext_binsearch_idx:
  * binary search for the closest index of the given block
+ * the header must be checked before calling this
  */
 static void
 ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int 
block)
@@ -337,9 +372,6 @@
struct ext4_extent_header *eh = path-p_hdr;
struct ext4_extent_idx *r, *l, *m;
 
-   BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC);
-   BUG_ON(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max));
-   BUG_ON(le16_to_cpu(eh-eh_entries) = 0);
 
ext_debug(binsearch for %d(idx):  , block);
 
@@ -389,6 +421,7 @@
 /*
  * ext4_ext_binsearch:
  * binary search for closest extent of the given block
+ * the header must be checked before calling this
  */
 static void
 ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block)
@@ -396,9 +429,6 @@
struct ext4_extent_header *eh = path-p_hdr;
struct ext4_extent *r, *l, *m;
 
-   

Re: data=journal busted

2007-02-16 Thread Randy Dunlap
On Thu, 15 Feb 2007 20:44:45 -0800 Andrew Morton wrote:

 
 I have a report from a google person who just did some basic
 power-it-off-during-a-write testing on 2.6.20's ext3.  ordered-data is OK,
 but data=journal came back with crap in the file data.
 
 Is anyone doing any formal recovery stress-testing?
 
 I suspect we should resurrect and formalise my old
 make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
 very useful for stress-testing recovery.

Where is your make-troubles patch, please sir?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: data=journal busted

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 14:50:22 -0800
Randy Dunlap [EMAIL PROTECTED] wrote:

 On Thu, 15 Feb 2007 20:44:45 -0800 Andrew Morton wrote:
 
  
  I have a report from a google person who just did some basic
  power-it-off-during-a-write testing on 2.6.20's ext3.  ordered-data is OK,
  but data=journal came back with crap in the file data.
  
  Is anyone doing any formal recovery stress-testing?
  
  I suspect we should resurrect and formalise my old
  make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
  very useful for stress-testing recovery.
 
 Where is your make-troubles patch, please sir?

gee, it's all buried in the ext3 debug patch

resurrects, makes it compile


But that's just the concept.  I think the core should be formalisation of the
concept, so that other journalling filesystems can use it.

And that concept is: make a device stop accepting writes when an interrupt
goes off.  I think using a timer is appropriate, and I think setting that
timer at [re]mount-time is suitable.  So it'd be a matter of hooking that
into the block layer (or the drivers) at the appropriate point, exposing
the APIs to the filesystems and then wiring up ext4.

Perhaps it has some synergy with the new fault-injection code, dunno.



otoh, perhaps filesystems don't need to be involved at all.  Just do

echo 42  /sys/block/sda3/reject-writes-after-this-many-milliseconds

while [ $(cat /sys/block/sda3/reject...) -gt 0 ]
do
sleep 1
done

kill test app
unmount
mount
unmount
fsck
mount
check contents of files
repeat




 drivers/ide/ide-io.c |   23 ++
 fs/Kconfig   |4 
 fs/Makefile  |2 
 fs/buffer.c  |  103 -
 fs/direct-io.c   |1 
 fs/ext3/inode.c  |7 
 fs/ext3/namei.c  |   40 -
 fs/ext3/super.c  |   82 ++
 fs/ext3/xattr.c  |4 
 fs/jbd-kernel.c  |  255 +
 fs/jbd/commit.c  |   17 +-
 fs/jbd/transaction.c |   45 +++--
 include/linux/buffer-trace.h |   83 ++
 include/linux/buffer_head.h  |4 
 include/linux/jbd.h  |   24 +--
 15 files changed, 650 insertions(+), 44 deletions(-)

diff -puN drivers/ide/ide-io.c~ext3-debug drivers/ide/ide-io.c
--- a/drivers/ide/ide-io.c~ext3-debug
+++ a/drivers/ide/ide-io.c
@@ -999,6 +999,29 @@ static ide_startstop_t start_request (id
HWIF(drive)-name, (unsigned long) rq);
 #endif
 
+#ifdef CONFIG_JBD_DEBUG
+   {
+   /*
+* Silently stop writing to this disk to simulate a crash.
+*/
+   extern struct block_device *journal_no_write[2];
+   int i;
+
+   if (!(rq-cmd_flags  REQ_RW))
+   goto its_a_read;
+
+   for (i = 0; i  2; i++) {
+   struct block_device *bdev = journal_no_write[i];
+   if (bdev  bdev_get_queue(bdev) == rq-q) {
+   printk(%s: write suppressed\n, __FUNCTION__);
+   ide_end_request(drive, 1, rq-nr_sectors);
+   return ide_stopped;
+   }
+   }
+   }
+its_a_read:
+   ;
+#endif
/* bail early if we've exceeded max_failures */
if (drive-max_failures  (drive-failures  drive-max_failures)) {
goto kill_rq;
diff -puN fs/Kconfig~ext3-debug fs/Kconfig
--- a/fs/Kconfig~ext3-debug
+++ a/fs/Kconfig
@@ -265,6 +265,10 @@ config JBD2_DEBUG
  generated.  To turn debugging off again, do
  echo 0  /proc/sys/fs/jbd2-debug.
 
+config BUFFER_DEBUG
+   bool buffer-layer tracing
+   depends on JBD
+
 config FS_MBCACHE
 # Meta block cache for Extended Attributes (ext2/ext3/ext4)
tristate
diff -puN fs/Makefile~ext3-debug fs/Makefile
--- a/fs/Makefile~ext3-debug
+++ a/fs/Makefile
@@ -31,6 +31,8 @@ obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout
 obj-$(CONFIG_BINFMT_EM86)  += binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)  += binfmt_misc.o
 
+obj-$(CONFIG_BUFFER_DEBUG) += jbd-kernel.o
+
 # binfmt_script is always there
 obj-y  += binfmt_script.o
 
diff -puN fs/buffer.c~ext3-debug fs/buffer.c
--- a/fs/buffer.c~ext3-debug
+++ a/fs/buffer.c
@@ -35,6 +35,8 @@
 #include linux/hash.h
 #include linux/suspend.h
 #include linux/buffer_head.h
+#include linux/buffer-trace.h
+#include linux/jbd.h
 #include linux/task_io_accounting_ops.h
 #include linux/bio.h
 #include linux/notifier.h
@@ -53,6 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e
 {
bh-b_end_io = handler;
bh-b_private = private;
+   buffer_trace_init(bh-b_history);
 }
 
 static int sync_buffer(void *word)
@@ -61,6 +64,7 @@ static int sync_buffer(void *word)
struct buffer_head *bh
 

Re: data=journal busted

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 16:31:09 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

  I suspect we should resurrect and formalise my old
  make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
  very useful for stress-testing recovery.
 
 We have a patch that we use for Lustre testing which allows you to set a
 block device readonly (silently discarding all writes), without the
 filesystem immediately keeling over dead like set_disk_ro.  The readonly
 state persists until the the last reference on the block device is dropped,
 so there are no races w.r.t. VFS cleanup of inodes and flushing buffers
 after the filesystem is unmounted.

Not sure I understand all that.

For this application, we *want* to expose VFS races, errors in handling
EIO, errors in handling lost writes, etc.  It's another form of for-developers
fault injection, not a thing-for-production.

The reason I prefer doing it from the timer interrupt is to toss more
randomness in there, avoid the possibility of getting synchronised
with application or kernel activity in some fashion.

I don't know if there's much value in that, but it provides peace-of-mind.


I'm now seeing reports that ordered-data is corrupting data and metadata,
as is data=writeback.  I'm hoping that the blame lies with the
allededly-battery-backed RAID controller, but it could be ext3.  Has anyone
actually done any decent recovery testing in the past half decade?
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: data=journal busted

2007-02-16 Thread Andreas Dilger
On Feb 16, 2007  15:42 -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 16:31:09 -0700
 Andreas Dilger [EMAIL PROTECTED] wrote:
  We have a patch that we use for Lustre testing which allows you to set a
  block device readonly (silently discarding all writes), without the
  filesystem immediately keeling over dead like set_disk_ro.  The readonly
  state persists until the the last reference on the block device is dropped,
  so there are no races w.r.t. VFS cleanup of inodes and flushing buffers
  after the filesystem is unmounted.
 
 Not sure I understand all that.

Actually, the patch was originally based on your just-posted readonly code.
The problem with that code (unlike ours) is that it doesn't do the explicit
ro clearing in ext3_put_super(), but rather not until ALL inodes/buffers/etc
are cleared out for the block device by the VFS.  Otherwise, it is possible
to re-enable rw on the block device, and there still be dirty buffers that
are being flushed out to disk, guaranteeing filesystem inconsistency.

It allows multiple devices to be marked ro at the same time without problem.

It also works on all block devices instead of just IDE.

 For this application, we *want* to expose VFS races, errors in handling
 EIO, errors in handling lost writes, etc.  It's another form of for-developers
 fault injection, not a thing-for-production.

I understand that part, and I agree our patch doesn't include the random
part of the set-readonly code you have.  Both of the patches do have the
drawback that they don't return errors like set_disk_ro(), so that isn't
exercising the error-handling code in ext3/jbd as much as it could.

When we first worked on this in 2.4 the ext3/jbd code was far to fragile to
handle -EROFS under load without oopsing so we chose the silent failure
mode to allow testing w/o crashing all the time.  Maybe today it is better
to just call set_dev_ro() and fix the (hopefully few) -EROFS problems.

 The reason I prefer doing it from the timer interrupt is to toss more
 randomness in there, avoid the possibility of getting synchronised
 with application or kernel activity in some fashion.

Definitely this has its place too.  We chose the opposite route and have
fault-injection calls sprinkled throughout our code, so that we can create
complex regression tests that need specific series of events to be hit.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html