Re: booked-page-flag.patch
On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton (AM) writes: AM Well, one could just assume that the page has no disk mapping and go and AM make the space reservation. Things will work out OK when we come to do AM writepage(). AM Or one could do both: call get_block() only if the page was inside i_size. well, even so we need to reserve not that block ony, but also needed metadata (for the worst case). Right. Reserve all the blocks for the page and all the metadata for a page at that file offset. Worst case. probably this is work for get_block or some different method? umm, when I did this I think I added a new -reservepage address_space op and called that from within the VFS's delalloc_prepare_write(). anyway, we have to call it if the page is being written partial. Not necessarily. If we're operating in nobh mode that page is brought uptodate in prepare_write(). - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: booked-page-flag.patch
On Feb 16, 2007 00:06 -0800, Andrew Morton wrote: On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote: well, even so we need to reserve not that block ony, but also needed metadata (for the worst case). Right. Reserve all the blocks for the page and all the metadata for a page at that file offset. Worst case. This only really becomes an issue when the filesystem (+reservations) is close to being full. If we are close enough to worry about running out of space we can also refuse to do delayed allocation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: booked-page-flag.patch
On Feb 15, 2007 09:18 -0800, Eric Sandeen wrote: This was for the delayed allocation patchset, right; and by managing this at the page level that means we can't do this for block size page size, I think... We could always disable delayed allocation in the PAGE_SIZE != blocksize case. It would be no worse than what we have now, and for IO performance you need larger blocksize anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: i_version_1_vfs_layer
Andrew Morton a écrit : This: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.20-ext4-1/broken-out/i_version_1_vfs_layer significantly deoptimises file_update_time() for major filesystems (eg, ext3). The changelog offers no reason for this alteration. In fact the changelog basically says nothing at all and needs a lot of work. What is this patch doing and why does it need to make that change? Actually, this set of patches are still in progress. I recently profile them. It shows that the CPU usage is really huge, so it has to be improved. The i_version field is a counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill NFSv4 requirements for rfc3530. For the moment, the counter is only a 32bit value but it is planned to be 64bit as required. The patch is divided into 3 parts, the vfs layer, the ext4 specific code and an user part to check i_version changes via stat. (cf http://permalink.gmane.org/gmane.comp.file-systems.ext4/923) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: booked-page-flag.patch
On Thu, 2007-02-15 at 23:46 -0800, Andrew Morton wrote: On Fri, 16 Feb 2007 10:30:39 +0300 Alex Tomas [EMAIL PROTECTED] wrote: hmm. I thought it has to call get_block() at least to know whether the block is already allocated. and I was going to reserve space in prepare_write for which some fs-specific method is needed becase only fs knows how much metadata it'd need. Well, one could just assume that the page has no disk mapping and go and make the space reservation. Things will work out OK when we come to do writepage(). Or one could do both: call get_block() only if the page was inside i_size. Or call get_block() with create = 0. Or replace the create argument with a flags field that can take either GET_BLOCK_CREATE or GET_BLOCK_RESERVE. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] free blocks in case of error
Good day, when ext4_ext_insert_extent() fails to insert new blocks we should free just allocated blocks. please, consider for review. thanks, Alex Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.20/fs/ext4/extents.c === --- linux-2.6.20.orig/fs/ext4/extents.c 2007-02-11 23:16:58.0 +0300 +++ linux-2.6.20/fs/ext4/extents.c 2007-02-17 00:34:32.0 +0300 @@ -2044,8 +2044,12 @@ ext4_ext_store_pblock(newex, newblock); newex.ee_len = cpu_to_le16(allocated); err = ext4_ext_insert_extent(handle, inode, path, newex); - if (err) + if (err) { + /* free data blocks we just allocated */ + ext4_free_blocks(handle, inode, ext_pblock(newex), + le16_to_cpu(newex.ee_len)); goto out2; + } if (extend_disksize inode-i_size EXT4_I(inode)-i_disksize) EXT4_I(inode)-i_disksize = inode-i_size; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] sanity checks for on-disk data in extents code
Good day, with the patch all headers are checked. the code should become more resistant to on-disk corruptions. needless BUG_ON() have been removed. please, review for inclusion. thanks, Alex Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.20/fs/ext4/extents.c === --- linux-2.6.20.orig/fs/ext4/extents.c 2007-02-17 00:34:32.0 +0300 +++ linux-2.6.20/fs/ext4/extents.c 2007-02-17 00:50:24.0 +0300 @@ -92,36 +92,6 @@ ix-ei_leaf_hi = cpu_to_le16((unsigned long) ((pb 31) 1) 0x); } -static int ext4_ext_check_header(const char *function, struct inode *inode, - struct ext4_extent_header *eh) -{ - const char *error_msg = NULL; - - if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { - error_msg = invalid magic; - goto corrupted; - } - if (unlikely(eh-eh_max == 0)) { - error_msg = invalid eh_max; - goto corrupted; - } - if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { - error_msg = invalid eh_entries; - goto corrupted; - } - return 0; - -corrupted: - ext4_error(inode-i_sb, function, - bad header in inode #%lu: %s - magic %x, - entries %u, max %u, depth %u, - inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), - le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), - le16_to_cpu(eh-eh_depth)); - - return -EIO; -} - static handle_t *ext4_ext_journal_restart(handle_t *handle, int needed) { int err; @@ -270,6 +240,70 @@ return size; } +static inline int +ext4_ext_max_entries(struct inode *inode, int depth) +{ + int max; + + if (depth == ext_depth(inode)) { + if (depth == 0) + max = ext4_ext_space_root(inode); + else + max = ext4_ext_space_root_idx(inode); + } else { + if (depth == 0) + max = ext4_ext_space_block(inode); + else + max = ext4_ext_space_block_idx(inode); + } + + return max; +} + +static int __ext4_ext_check_header(const char *function, struct inode *inode, + struct ext4_extent_header *eh, + int depth) +{ + const char *error_msg = NULL; + int max = 0; + + if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) { + error_msg = invalid magic; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) { + error_msg = unexpected eh_depth; + goto corrupted; + } + if (unlikely(eh-eh_max == 0)) { + error_msg = invalid eh_max; + goto corrupted; + } + max = ext4_ext_max_entries(inode, depth); + if (unlikely(le16_to_cpu(eh-eh_max) max)) { + error_msg = too large eh_max; + goto corrupted; + } + if (unlikely(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max))) { + error_msg = invalid eh_entries; + goto corrupted; + } + return 0; + +corrupted: + ext4_error(inode-i_sb, function, + bad header in inode #%lu: %s - magic %x, + entries %u, max %u(%u), depth %u(%u), + inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic), + le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max), + max, le16_to_cpu(eh-eh_depth), depth); + + return -EIO; +} + +#define ext4_ext_check_header(inode,eh,depth) \ + __ext4_ext_check_header(__FUNCTION__,inode,eh,depth) + #ifdef EXT_DEBUG static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path) { @@ -330,6 +364,7 @@ /* * ext4_ext_binsearch_idx: * binary search for the closest index of the given block + * the header must be checked before calling this */ static void ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int block) @@ -337,9 +372,6 @@ struct ext4_extent_header *eh = path-p_hdr; struct ext4_extent_idx *r, *l, *m; - BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC); - BUG_ON(le16_to_cpu(eh-eh_entries) le16_to_cpu(eh-eh_max)); - BUG_ON(le16_to_cpu(eh-eh_entries) = 0); ext_debug(binsearch for %d(idx): , block); @@ -389,6 +421,7 @@ /* * ext4_ext_binsearch: * binary search for closest extent of the given block + * the header must be checked before calling this */ static void ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block) @@ -396,9 +429,6 @@ struct ext4_extent_header *eh = path-p_hdr; struct ext4_extent *r, *l, *m; -
Re: data=journal busted
On Thu, 15 Feb 2007 20:44:45 -0800 Andrew Morton wrote: I have a report from a google person who just did some basic power-it-off-during-a-write testing on 2.6.20's ext3. ordered-data is OK, but data=journal came back with crap in the file data. Is anyone doing any formal recovery stress-testing? I suspect we should resurrect and formalise my old make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing. It was very useful for stress-testing recovery. Where is your make-troubles patch, please sir? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data=journal busted
On Fri, 16 Feb 2007 14:50:22 -0800 Randy Dunlap [EMAIL PROTECTED] wrote: On Thu, 15 Feb 2007 20:44:45 -0800 Andrew Morton wrote: I have a report from a google person who just did some basic power-it-off-during-a-write testing on 2.6.20's ext3. ordered-data is OK, but data=journal came back with crap in the file data. Is anyone doing any formal recovery stress-testing? I suspect we should resurrect and formalise my old make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing. It was very useful for stress-testing recovery. Where is your make-troubles patch, please sir? gee, it's all buried in the ext3 debug patch resurrects, makes it compile But that's just the concept. I think the core should be formalisation of the concept, so that other journalling filesystems can use it. And that concept is: make a device stop accepting writes when an interrupt goes off. I think using a timer is appropriate, and I think setting that timer at [re]mount-time is suitable. So it'd be a matter of hooking that into the block layer (or the drivers) at the appropriate point, exposing the APIs to the filesystems and then wiring up ext4. Perhaps it has some synergy with the new fault-injection code, dunno. otoh, perhaps filesystems don't need to be involved at all. Just do echo 42 /sys/block/sda3/reject-writes-after-this-many-milliseconds while [ $(cat /sys/block/sda3/reject...) -gt 0 ] do sleep 1 done kill test app unmount mount unmount fsck mount check contents of files repeat drivers/ide/ide-io.c | 23 ++ fs/Kconfig |4 fs/Makefile |2 fs/buffer.c | 103 - fs/direct-io.c |1 fs/ext3/inode.c |7 fs/ext3/namei.c | 40 - fs/ext3/super.c | 82 ++ fs/ext3/xattr.c |4 fs/jbd-kernel.c | 255 + fs/jbd/commit.c | 17 +- fs/jbd/transaction.c | 45 +++-- include/linux/buffer-trace.h | 83 ++ include/linux/buffer_head.h |4 include/linux/jbd.h | 24 +-- 15 files changed, 650 insertions(+), 44 deletions(-) diff -puN drivers/ide/ide-io.c~ext3-debug drivers/ide/ide-io.c --- a/drivers/ide/ide-io.c~ext3-debug +++ a/drivers/ide/ide-io.c @@ -999,6 +999,29 @@ static ide_startstop_t start_request (id HWIF(drive)-name, (unsigned long) rq); #endif +#ifdef CONFIG_JBD_DEBUG + { + /* +* Silently stop writing to this disk to simulate a crash. +*/ + extern struct block_device *journal_no_write[2]; + int i; + + if (!(rq-cmd_flags REQ_RW)) + goto its_a_read; + + for (i = 0; i 2; i++) { + struct block_device *bdev = journal_no_write[i]; + if (bdev bdev_get_queue(bdev) == rq-q) { + printk(%s: write suppressed\n, __FUNCTION__); + ide_end_request(drive, 1, rq-nr_sectors); + return ide_stopped; + } + } + } +its_a_read: + ; +#endif /* bail early if we've exceeded max_failures */ if (drive-max_failures (drive-failures drive-max_failures)) { goto kill_rq; diff -puN fs/Kconfig~ext3-debug fs/Kconfig --- a/fs/Kconfig~ext3-debug +++ a/fs/Kconfig @@ -265,6 +265,10 @@ config JBD2_DEBUG generated. To turn debugging off again, do echo 0 /proc/sys/fs/jbd2-debug. +config BUFFER_DEBUG + bool buffer-layer tracing + depends on JBD + config FS_MBCACHE # Meta block cache for Extended Attributes (ext2/ext3/ext4) tristate diff -puN fs/Makefile~ext3-debug fs/Makefile --- a/fs/Makefile~ext3-debug +++ a/fs/Makefile @@ -31,6 +31,8 @@ obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout obj-$(CONFIG_BINFMT_EM86) += binfmt_em86.o obj-$(CONFIG_BINFMT_MISC) += binfmt_misc.o +obj-$(CONFIG_BUFFER_DEBUG) += jbd-kernel.o + # binfmt_script is always there obj-y += binfmt_script.o diff -puN fs/buffer.c~ext3-debug fs/buffer.c --- a/fs/buffer.c~ext3-debug +++ a/fs/buffer.c @@ -35,6 +35,8 @@ #include linux/hash.h #include linux/suspend.h #include linux/buffer_head.h +#include linux/buffer-trace.h +#include linux/jbd.h #include linux/task_io_accounting_ops.h #include linux/bio.h #include linux/notifier.h @@ -53,6 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e { bh-b_end_io = handler; bh-b_private = private; + buffer_trace_init(bh-b_history); } static int sync_buffer(void *word) @@ -61,6 +64,7 @@ static int sync_buffer(void *word) struct buffer_head *bh
Re: data=journal busted
On Fri, 16 Feb 2007 16:31:09 -0700 Andreas Dilger [EMAIL PROTECTED] wrote: I suspect we should resurrect and formalise my old make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing. It was very useful for stress-testing recovery. We have a patch that we use for Lustre testing which allows you to set a block device readonly (silently discarding all writes), without the filesystem immediately keeling over dead like set_disk_ro. The readonly state persists until the the last reference on the block device is dropped, so there are no races w.r.t. VFS cleanup of inodes and flushing buffers after the filesystem is unmounted. Not sure I understand all that. For this application, we *want* to expose VFS races, errors in handling EIO, errors in handling lost writes, etc. It's another form of for-developers fault injection, not a thing-for-production. The reason I prefer doing it from the timer interrupt is to toss more randomness in there, avoid the possibility of getting synchronised with application or kernel activity in some fashion. I don't know if there's much value in that, but it provides peace-of-mind. I'm now seeing reports that ordered-data is corrupting data and metadata, as is data=writeback. I'm hoping that the blame lies with the allededly-battery-backed RAID controller, but it could be ext3. Has anyone actually done any decent recovery testing in the past half decade? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data=journal busted
On Feb 16, 2007 15:42 -0800, Andrew Morton wrote: On Fri, 16 Feb 2007 16:31:09 -0700 Andreas Dilger [EMAIL PROTECTED] wrote: We have a patch that we use for Lustre testing which allows you to set a block device readonly (silently discarding all writes), without the filesystem immediately keeling over dead like set_disk_ro. The readonly state persists until the the last reference on the block device is dropped, so there are no races w.r.t. VFS cleanup of inodes and flushing buffers after the filesystem is unmounted. Not sure I understand all that. Actually, the patch was originally based on your just-posted readonly code. The problem with that code (unlike ours) is that it doesn't do the explicit ro clearing in ext3_put_super(), but rather not until ALL inodes/buffers/etc are cleared out for the block device by the VFS. Otherwise, it is possible to re-enable rw on the block device, and there still be dirty buffers that are being flushed out to disk, guaranteeing filesystem inconsistency. It allows multiple devices to be marked ro at the same time without problem. It also works on all block devices instead of just IDE. For this application, we *want* to expose VFS races, errors in handling EIO, errors in handling lost writes, etc. It's another form of for-developers fault injection, not a thing-for-production. I understand that part, and I agree our patch doesn't include the random part of the set-readonly code you have. Both of the patches do have the drawback that they don't return errors like set_disk_ro(), so that isn't exercising the error-handling code in ext3/jbd as much as it could. When we first worked on this in 2.4 the ext3/jbd code was far to fragile to handle -EROFS under load without oopsing so we chose the silent failure mode to allow testing w/o crashing all the time. Maybe today it is better to just call set_dev_ro() and fix the (hopefully few) -EROFS problems. The reason I prefer doing it from the timer interrupt is to toss more randomness in there, avoid the possibility of getting synchronised with application or kernel activity in some fashion. Definitely this has its place too. We chose the opposite route and have fault-injection calls sprinkled throughout our code, so that we can create complex regression tests that need specific series of events to be hit. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html