Re: rfc: [patch] change attribute for ext3
On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote: the change attribute is a simple counter that is reset to zero on inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). To start, I'm supportive of this concept, my comments are only to get the most efficient implementation. This appears to be very similar to the i_version field that is already in the inode (which is also modified only by ext3), so instead of increasing the size of the inode further we could use that field and just make it persistent on disk. The i_version field is incremented in mkdir/rmdir/create/rename. For Lustre it would also be desirable to modify the version field for regular files when they are modified (e.g. setattr, write), and it appears NFS v4 also wants the same (assumed from use of file_update_time()). The question is whether this should be handled internal to the filesystem (as ext3 does now) or if it should be part of the VFS. The patch also adds a new ``st_change_attribute'' field in the stat structure, and modifies the stat(2) syscall accordingly. Currently the change is only visible on i386 and x86_64 archs. Is this really necessary for knfsd? @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_change_attribute = 0; Initializing to zero is more dangerous than any non-zero number, since this is the most likely outcome of corruption... The current ext3 code initializes i_version to a random number, and we can use comparisons similar to jiffies as long as we don't expect 2^31 changes between comparisons. +++ fs/inode.c13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file) return; now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) - sync_it = 1; inode-i_mtime = now; - - if (!timespec_equal(inode-i_ctime, now)) - sync_it = 1; inode-i_ctime = now; - - if (sync_it) - mark_inode_dirty_sync(inode); + inode-i_change_attribute++; + mark_inode_dirty_sync(inode); Ugh, this would definitely hurt performance, because ext3_dirty_inode() packs-for-disk the whole inode each time. I believe Stephen had patches at one time to do the inode packing at transaction commit time instead of when it is changed, so we only do the packing once. Having a generic mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty inode to the buffer) would also be useful for other things like doing the inode or group descriptor checksum only once per transaction... Index: include/linux/ext3_fs.h === RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v retrieving revision 1.1.1.3 retrieving revision 1.1.1.3.2.1 diff -u -p -r1.1.1.3 -r1.1.1.3.2.1 --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 - 1.1.1.3 +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 - 1.1.1.3.2.1 @@ -286,7 +286,7 @@ struct ext3_inode { __u16 i_pad1; __le16 l_i_uid_high; /* these 2 fields*/ __le16 l_i_gid_high; /* were reserved2[0] */ - __u32 l_i_reserved2; + __le32 l_i_change_attribute; There was some other use of the reserved fields for ext4 64-bit-blocks support. One was for i_file_acl_hi (I think this is using the i_pad1 above), one was for i_blocks_hi (I believe this was proposed to use the i_frag and i_fsize bytes). Is this conflicting with anything else? There were a lot of proposals for these fields, and I don't recall which ones are still out there. @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb +#define ATTR_CHANGE_ATTRIBUTE 16384 Do you need a setattr interface for this, or is it sufficient to use the i_version field from the inode, and let the filesystem manage i_version updates as it is doing now? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rfc: [patch] change attribute for ext3
On Sep 14, 2006 15:21 +0200, Alexandre Ratchov wrote: IMHO, the natural place to do this stuff is the VFS, because it can be common to all file-systems supporting this feature. Currently it's the same with ctime, mtime and atime. These are in the VFS even if there are file-systems that don't support all of them. Well, that is only partly true. I see lots of places in ext3 that are setting i_ctime and i_mtime... Ugh, this would definitely hurt performance, because ext3_dirty_inode() packs-for-disk the whole inode each time. I believe Stephen had patches at one time to do the inode packing at transaction commit time instead of when it is changed, so we only do the packing once. Having a generic mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty inode to the buffer) would also be useful for other things like doing the inode or group descriptor checksum only once per transaction... afaik, for an open file, there is always a copy of the inode in memory, because there is a reference to it in the file structure. So, in principle, we shouldn't need to make dirty the inode. I don't know if this is feasable perhaps am i missing something here. The in-memory inode needs to be copied into the buffer so that it is part of the transaction being committed to disk, or updates are lost. This was a common bug with early ext3 - marking the inode dirty and then changing a field in the in-core inode - which would not be saved to disk. In other filesystems this is only a few-cycle race, but in ext3 the in-core inode is not written to disk unless the inode is again marked dirty. The potential benefit of making this a callback from the JBD layer is it avoids copying the inode for EVERY dirty, and only doing it once per transaction. Add a list of callbacks hooked onto the transaction to be called before it is committed, and the callback data is the inode pointer which does a single ext3_do_update_inode() call if the inode is still marked dirty. it's not strictly necessary; it's not more necessary that the interface to ctime or other attributes. It's here for completness, in my opinion the change attribute is the same as the ctime time-stamp Then makes sense to just improve the ctime mechanism instead of adding new code and interfaces... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2/3 create large filesystem takes too much time; solutions
On Sep 16, 2006 16:06 -0400, Theodore Tso wrote: On Sat, Sep 16, 2006 at 05:56:02PM +0300, Pavel Mironchik wrote: I agree with you and I would prefer to send something more serious on that list than those previous patches - I like your idea with counters. Btw I assume crc is more preferable than just control sum for block group descriptors Yes, when I said checksum I meant a cyclic redundancy checksum, and not an additive checksum... (and one of the things we can do is to build in the superblock UUID into the CRC, so that if the filesystem gets recreated we can distinguish an old inode from a new one). Just to avoid duplication of effort, I'm attaching the current work-in-progress patches for the uninitialized groups (kernel + e2fsprogs). They are really at the barely compile stage (if that), but at least people can look at them and start improving them instead of starting from scratch. The patches are based on work done by Anshu Goel [EMAIL PROTECTED], but have been reworked a fair amount since they were given to me (i.e. bugs added are mine). I've been sitting on them for too long and they should see the light of day instead of continuing to stagnate. I also just incorporated Ted's suggestion to include the filesystem UUID into the checksum. I previously had added in the group number, so that if the block is written out to the wrong location it wouldn't verify correctly. Things that need to be done: - the kernel block/inode allocation needs to be reworked: - initialize a whole block worth of inodes at one time instead of single inodes. - I don't think we need to zero out the unused inodes - the kernel should already be doing this if the inode block is unused - find a happy medium between using existing groups (inodes/blocks) and initializing new ones - we likely need to verify the checksum in more places in e2fsck before trusting the UNINIT flags I won't be able to work more on this for a while, so have at it :-). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. Index: linux-stage/fs/ext3/mballoc.c === --- linux-stage.orig/fs/ext3/mballoc.c 2006-09-17 00:59:44.0 -0600 +++ linux-stage/fs/ext3/mballoc.c 2006-09-17 00:59:44.0 -0600 @@ -198,7 +198,8 @@ void ext3_mb_release_blocks(struct super_block *, int); void ext3_mb_poll_new_transaction(struct super_block *, handle_t *); void ext3_mb_free_committed_blocks(struct super_block *); - +unsigned long free_blks_after_init(struct super_block *, + struct ext3_group_desc *, int); #if BITS_PER_LONG == 64 #define mb_correct_addr_and_bit(bit,addr) \ { \ @@ -527,7 +528,12 @@ unlock_buffer(bh[i]); continue; } - + if (desc-bg_flags EXT3_BG_BLOCK_UNINIT) { + init_block_bitmap(sb, bh[i], desc, first_group + i); + set_buffer_uptodate(bh[i]); + unlock_buffer(bh[i]); + continue; + } get_bh(bh[i]); bh[i]-b_end_io = end_buffer_read_sync; submit_bh(READ, bh[i]); @@ -1525,9 +1531,16 @@ mb_set_bits(bitmap_bh-b_data, ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len); spin_lock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); + if (gdp-bg_flags EXT3_BG_BLOCK_UNINIT) { + gdp-bg_flags = ~EXT3_BG_BLOCK_UNINIT; + gdp-bg_free_blocks_count = + cpu_to_le16(free_blks_after_init(sb, gdp, + ac.ac_b_ex.fe_group)); + } gdp-bg_free_blocks_count = cpu_to_le16(le16_to_cpu(gdp-bg_free_blocks_count) - ac.ac_b_ex.fe_len); + gdp-bg_checksum = ext3_group_desc_csum(es, gdp, ac.ac_b_ex.fe_group); spin_unlock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); percpu_counter_mod(sbi-s_freeblocks_counter, - ac.ac_b_ex.fe_len); @@ -2377,6 +2390,7 @@ spin_lock(sb_bgl_lock(sbi, block_group)); gdp-bg_free_blocks_count = cpu_to_le16(le16_to_cpu(gdp-bg_free_blocks_count) + count); + gdp-bg_checksum = ext3_group_desc_csum(es, gdp, block_group); spin_unlock(sb_bgl_lock(sbi, block_group)); percpu_counter_mod(sbi-s_freeblocks_counter, count); Index: linux-stage/fs/ext3/ialloc.c === --- linux-stage.orig/fs/ext3/ialloc.c 2006-09-17 00:59:44.0 -0600 +++ linux-stage/fs/ext3/ialloc.c2006-09-17 00:59:44.0 -0600 @@ -23,7 +23,6 @@ #include linux/buffer_head.h #include linux/random.h #include linux/bitops.h - #include asm/byteorder.h #include xattr.h @@ -59,8 +58,14 @@ desc
Re: [RFC] [PATCH] Documentation/filesystems/ext4.txt
On Oct 10, 2006 15:02 -0500, Dave Kleikamp wrote: + - It's still mke2fs -j /dev/hda1 I would suggest mke2fs -j -O dir_index -I 256 /dev/XXX to be more representative of what will be used in the future. +programs:http://e2fsprogs.sourceforge.net/ + http://ext2resize.sourceforge.net You should likely remove ext2resize from this list, it hasn't got any support for extent-mapped files. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Design alternatives for fragments/file tail support in ext4
On Oct 13, 2006 08:23 -0400, Theodore Tso wrote: On Fri, Oct 13, 2006 at 02:56:46PM +0400, Alex Tomas wrote: Theodore Tso (TT) writes: TT I suggest this be tunable by superblock field, and not by a /proc TT tunable. This is the sort of thing which might be different TT per-filesystem, and the algorithm will be most effective if the TT filesystem always use the same cluster size from the time when it was TT first created. I'd be happy to assign a superblock field for this TT purpose, and add the appropriate tune2fs support if we have general TT agreement on this point. that would be good. there is even a stride option to mke2fs? Yes, there is. And just as we have -E stride=stripe-size and -E resize=max-online-resize, we can also -E cluster-size=bytes parameter in mke2fs. It would also make sense to make this be something that can be defaulted in /etc/mke2fs.conf, since even for IDE or SATA disks it probably makes sense to make the cluster size be 16k or 32k or maybe even higher. We probably need to do some benchmarks to see whether or not this makes sense. I think what Alex meant is that the mke2fs -E stride= value should just be put into the superblock. This would allow tuning the mballoc code to match the RAID alignment, and would also make life easier for resizers so they can continue the RAID-stepping of bitmaps that mke2fs does without having to extrapolate the stride value from the bitmap locations. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ixt3
On Oct 17, 2006 10:36 +0200, Jure Pečar wrote: http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf I think I got this link from OLS papers or from fs workshop ... and I really like what it says. How many of ixt3 features are going to be implemented in ext4? None, yet. The journal checksum code has been brought into a basically ready-to-go state, and there is also work to add checksums to the group descriptors, so there is some chance that this will be added before ext4 is declared closed. Nothing done yet to checksum inodes, superblock, file index or extent blocks. None of the retry code has been looked at. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ixt3
On Oct 17, 2006 11:06 +0200, Jure Pečar wrote: On Tue, 17 Oct 2006 02:51:30 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: None of the retry code has been looked at. This is what I'm mostly interested in ... predictable, well behaving and tested error handling recovery. Is ext4 going to be any better here than ext3? Not unless someone massages the ixt3 code into a form usable in the kernel. I'd suggest you contact the authors if you are interested in doing this work. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH, E2FSPROGS] On-disk format for large number of subdirectors
On Oct 18, 2006 02:24 -0400, Theodore Ts'o wrote: On-disk format for large number of subdirectories - EXT4_FEATURE_RO_COMPAT_DIR_NLINK (0x0020?) - allow directories to have 65000 subdirectories (i_nlinks) by setting i_nlinks = 1 for such directories. RO_COMPAT protects old filesystems from unlinking such directories incorrectly and losing all files therein. Signed-off-by: Theodore Ts'o [EMAIL PROTECTED] Looks good. Will try to get a new patch out with this support. Index: e2fsprogs/lib/ext2fs/ext2_fs.h === --- e2fsprogs.orig/lib/ext2fs/ext2_fs.h 2006-10-18 01:42:48.0 -0400 +++ e2fsprogs/lib/ext2fs/ext2_fs.h2006-10-18 01:49:51.0 -0400 @@ -71,7 +71,7 @@ /* * Maximal count of links to a file */ -#define EXT2_LINK_MAX32000 +#define EXT2_LINK_MAX65000 /* * Macro-instructions used to manage several block sizes @@ -606,6 +606,7 @@ /* #define EXT2_FEATURE_RO_COMPAT_BTREE_DIR 0x0004 not used */ #define EXT4_FEATURE_RO_COMPAT_HUGE_FILE 0x0008 #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010 +#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020 #define EXT2_FEATURE_INCOMPAT_COMPRESSION0x0001 #define EXT2_FEATURE_INCOMPAT_FILETYPE 0x0002 Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption
On Oct 18, 2006 16:56 -0500, Eric Sandeen wrote: Andreas Dilger wrote: The directory leaf data is kept in the page cache and there is a helper function ext2_check_page() to mark the page checked. That means the page only needs to be checked once after being read from disk, instead of each time through readdir. ah, sure. Hm... well, this might be a bit of a performance hit if it's checking cached data... let me think on that. Well, having something like ext3_dir_bread() that verifies the leaf block once if (!uptodate()) would be almost the same as ext2 with fairly little effort. It would help performance in several places, at the slight risk of not handling in-memory corruption after the block is read... I'm not sure whether this is a win or not. It means that if there is ever a directory with a bad leaf block any entries beyond that block are not accessible anymore. I'm amazed at how hard ext3 works to cope with bad blocks ;-) It would fail all of your tests otherwise, right? That is one virtue of ext2 having grown up in the days when bad blocks existed. Those days are (sadly) coming back again, hence desire for fs-level checksums, etc. The existing !bh case already marks the filesystem in error. Maybe as a special case we can check in if (!bh) if i_size and i_blocks make sense. Something like: if (!bh) { : : + if (filp-f_pos inode-i_blocks 9) { + break; filp-f_pos += sb-s_blocksize - offset; continue; } This obviously won't help if the whole inode is bogus, but then nothing will catch all errors. Yep, I'd thought maybe a size vs. blocks test might make sense; I think there can never legitimately be a sparse directory? Not currently, though there was some desire to allow this during htree development, to allow shrinking large-but-empty directories. Since this already provokes an ext3_error() (which might be a panic()) to hit a hole we can assume that this needs to be carefully implmemented. I guess if the intent is to soldier on in the face of adversity, it doesn't matter if it's an umappable offset or an IO error; ext3 wants to go ahead try the next one block anyway. So the size test probably makes sense as a stopping point. Well, it would also be possible to look into inode-i_blocks to see what blocks exist past this offset, but that is complicated by the introduction Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size
On Oct 18, 2006 18:16 -0400, Theodore Tso wrote: On Wed, Oct 18, 2006 at 01:34:38PM -0600, Andreas Dilger wrote: There was some discussion about moving the i_ctime_extra field into the small inode, for use as a version field by NFSV4. The proposed field was l_i_reserved2 in the original Bull patch. The potential problem with this is what if program makes a huge number of changes to inodes, such that we end up having to bump the ctime field into the future? Well, if you are changing the inode billions of times/second for any extended amount of time, then yes that might be possible. Normally setting it to the current ctime and incrementing the nsec timestamp on the rare cases two threads are updating it at once is sufficient. Note that in this version is only critical to compare for a single inode, and not filesystem-wide. But I have another question --- why does this the change attribute counter have to be a persistent value that takes up space in the inode at all? This is strictly a NFSv4 only hack, and NFSv4 simply requires a 64-bit increasing counter when any inode attributes have changed, so it can determine whether client-side caches are refreshed. Well, Lustre has a need for this also, in order to know if a particular modification to an inode has been done or not. Using the nsec ctime is a way of saving space in the inode, while at the same time allowing two operations to be ordered relative to each other. This needs to be persistent across a server reboot for both NFSv4 and Lustre. Since i_ctime_extra needs to be stored on disk anyways, I don't see why there would be an objection to having it serve a dual purpose. For Lustre we don't actually care whether it is in the core inode, since we always format filesystems with large inodes anyways, but in consideration of NFSv4 exporting existing ext3 filesystems I suggested i_ctime_extra go into the core inode. So let the high 32-bits of the attribute value be the time in seconds that the system was booted (or the filesystem was mounted; I don't care), and let the low 32-bit values be a value which is stored in the in-core inode which is set to a global counter value which is incremented whenever inode is changed. If the in-core inode gets dropped and then later reloaded, it will get a newer global counter value, which may forced some clients to reload their cache when they may not have needed to --- and the same would happen if the server gets rebooted, since the high 32-bits would get bumped to the new mount time. This would work, I suppose. It doesn't allow this value to be exported to userspace without additional API changes as the i_ctime_extra one can. The initial Bull patch consumed an extra field in struct stat, stat64, struct inode, and struct iattr in order to allow userspace access also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption
On Oct 18, 2006 19:26 -0500, Eric Sandeen wrote: Well, it would also be possible to look into inode-i_blocks to see what blocks exist past this offset, but that is complicated by the introduction eom introduction of ...? :) Sorry - introduction of extents. So we can't just look into the i_blocks {d,t,}indirect blocks to work out the maximum reasonable size for an inode without adding decoding of extents into this code. Maybe if SEEK_DATA is added to ext3 (patch was proposed this past week) then we could seek past the hole efficiently. For now I'm happy to assume i_blocks * 512 is a safe upper limit on the file size. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption
On Oct 19, 2006 17:43 -0500, Eric Sandeen wrote: Eric Sandeen wrote: How about just tweaking the existing ext3_bread so that it lets the caller know whether or not it found an uptodate buffer? Seems conceivable that more than just the dir code might want to do a data sanity check, based on if this is a fresh read or not. Could maybe even change the *err argument to a *retval; negative on errors, else 0 == not read (found uptodate), 1 == fresh read (not found uptodate). Or is that too much overloading... I played around with this a little bit today, and it seems to have some tangible results. A fairly unsophisticated test of running find over my whole root filesystem 10 times :) with and without re-checking cached directory entries, yielded about a 10% speedup when skipping the re-checks. Is this something we want to do? Are we comfortable with only checking directory entries the first time they are read from disk? Well, we already do this on ext2 without noticable problems. As you say, if we are getting memory corruption we are in for a world of hurt in other areas. The only case that might be worth checking inside the loop is if rec_len == 0, so that we don't spin on a bad entry forever. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:31 +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? In some cases this is useful (e.g. if file has small fragments after being written in small pieces or in a fragmented free space). In other cases the user tool HAS to be able to specify the new mapping in order to make progress. Consider if there are two very large fragmented files and user-space defrag tool wants to make contiguous free space. If kernel is left to do allocation it will always consume the largest chunk of free space first, even if it is not yet optimal (e.g. large 1MB aligned extent). I would make this interface optionally allow the target extent to be specified, but if target block == 0 then the kernel is free to do its own allocation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 10:16 -0400, Theodore Tso wrote: As a suggestion, I would pass the inode number and inode generation number into the ext3_file_mode_data array: struct ext3_file_move_data { int extents; struct ext3_reloc_extent __user *ext_array; }; This will be much more efficient for the userspace relocator, since it won't need to translate from an inode number to a pathname, and then try to open the file before relocating it. I'd also use an explicit 64-bit block numbers type so that we don't have to worry about the ABI changing when we support 64-bit block numbers. I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. Also, I would specify both the source extent and the target extent in the inode. This first allows defragmenting only part of the file instead of (it appears) requiring the whole file to be relocated. That would be a killer if the file being defragmented is larger than free space. It secondly provides a level of insurance that what the kernel is relocating matches what userspace thinks it is doing. It would protect against problems if the kernel ever does block relocation itself (e.g. merge fragments into a single extent on (re)write, or for snapshot/COW). The other problem I see with this patch is that there will be cache coherency problems between the buffer cache and the page cache. I think you will want to pull the data blocks of the file into the page cache, and then write them out from the page cache, and only *then* update the indirect blocks and commit the transaction. Alternately (maybe even better) is to treat it as O_DIRECT and ensure the page cache is flushed. This also avoids polluting the whole page cache while running a defragmenter on the filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On Oct 23, 2006 16:45 +0200, Andre Noll wrote: stress tests on a 6.3T ext3 filesystem which runs on top of software raid 6 revealed the following: [663594.224641] init_special_inode: bogus i_mode (4412) [663596.355652] init_special_inode: bogus i_mode (5123) [663596.355832] init_special_inode: bogus i_mode (71562) This would appear to be inode table corruption. [663763.319514] EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 517570560, length 1 This is bitmap corruption. [663763.339423] EXT3-fs error (device md0): ext3_free_blocks: Freeing blocks in system zones - Block = 517570560, count = 1 Hmm, this would appear to be a buglet in error handling. If the block just allocated above is in the system zone it should be marked in-use in the bitmap but otherwise ignored. We definitely should NOT be freeing it on error. Yikes! It seems a patch I submitted to 2.4 that fixed the behaviour of ext3_new_block() so that if we detect this block shouldn't be allocated it is skipped instead of corrupting the filesystem if it is running with errors=continue... It looks like ext3_free_blocks() needs a similar fix - i.e. report an error and don't actually free those blocks. This system is currently not in production use, so I can use it for tests ATM. I would suggest that you give this a try on RAID5, just for starters. I'm not aware of any problems in the existing ext3 code for anything below 8TB. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On Oct 24, 2006 11:14 +0200, Andre Noll wrote: On 14:02, Andreas Dilger wrote: Something like the this? (only compile tested). And no, I do _not_ know, what I'm doing ;) Don't worry, everyone starts out not knowing what they are doing. The ext3_free_blocks() part looks OK from a cursory review. @@ -1372,12 +1370,24 @@ allocated: in_range(ret_block, le32_to_cpu(gdp-bg_inode_table), EXT3_SB(sb)-s_itb_per_group) || in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table), + EXT3_SB(sb)-s_itb_per_group)) { + int j; + ext3_error(sb, __FUNCTION__, Allocating block in system zone - blocks from E3FSBLK, length %lu, ret_block, num); - + /* Note: This will potentially use up one of the handle's + * buffer credits. Normally we have way too many credits, + * so that is OK. In _very_ rare cases it might not be OK. + * We will trigger an assertion if we run out of credits, + * and we will have to do a full fsck of the filesystem - + * better than randomly corrupting filesystem metadata. + */ + j = find_next_usable_block(-1, gdp, EXT3_BLOCKS_PER_GROUP(sb)); I'm not sure why the find_next_usable_block() part is in here? At this point we KNOW that ret_block is not a block we should be allocating, yet it is marked free in the bitmap. So we should just mark the block(s) in-use in the bitmap and look for a different block(s). + if (j = 0) + ext3_set_bit(j, gdp_bh-b_data); Note that we need to loop over num blocks and set any bits that should be set, as is done in the ext3_free_blocks() code. This function has changed since 2.4 in that it can now potentially allocate multiple blocks. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 25, 2006 16:54 +0200, Jan Kara wrote: I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. I suspect that the natural choice for metadata blocks is to keep the block which has the most metadata unchanged. For example, if you are doing a full-file relocation then you would naturally keep all of the new {dt}indirect blocks. If you are relocating a small chunk of the file you would keep the old {dt}indirect blocks and just copy a few block pointers over. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On Oct 25, 2006 11:44 +0200, Andre Noll wrote: Are you saying that ext3_set_bit() should simply be called with ret_block as its first argument? If yes, that is what the revised patch below does. You might need to call ext3_set_bit_atomic() (as claim_block() does, not sure. @@ -1372,12 +1370,21 @@ allocated: in_range(ret_block, le32_to_cpu(gdp-bg_inode_table), EXT3_SB(sb)-s_itb_per_group) || in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table), + EXT3_SB(sb)-s_itb_per_group)) { + ext3_error(sb, __FUNCTION__, Allocating block in system zone - blocks from E3FSBLK, length %lu, ret_block, num); + /* Note: This will potentially use up one of the handle's + * buffer credits. Normally we have way too many credits, + * so that is OK. In _very_ rare cases it might not be OK. + * We will trigger an assertion if we run out of credits, + * and we will have to do a full fsck of the filesystem - + * better than randomly corrupting filesystem metadata. + */ + ext3_set_bit(ret_block, gdp_bh-b_data); + goto repeat; + } The other issue is that you need to potentially set num bits in the bitmap here, if those all overlap metadata. In fact, it might just make more sense at this stage to walk all of the bits in the bitmaps, the inode table and the backup superblock and group descriptor to see if they need fixing also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On Oct 27, 2006 17:34 +0200, Andre Noll wrote: On 12:01, Andreas Dilger wrote: Well, since we know at least one bit needs fixing and results in the block being written to disk then setting the bits for all of the other metadata blocks in this group has no extra IO cost (only a tiny amount of CPU). Re-setting the bits if they are already set is not harmful. I.e, something like int i; ext3_fsblk_t bit; unsigned long gdblocks = EXT3_SB(sb)-s_gdb_count; for (i = 0, bit = 1; i gdblocks; i++, bit++) ext3_set_bit(bit, gdp_bh-b_data); Is that correct? Well, it needs to also handle backup superblock, bitmaps, inode table: if (ext3_bg_has_super()) ext3_set_bit(0, gdp_bh-b_data); gdblocks = ext3_bg_num_gdb(sb, group); for (i = 0, bit = 1; i gdblocks; i++, bit++) /* actually a bit more complex for INCOMPAT_META_BG fs */ ext3_set_bit(i, gdp_bh-b_data); ext3_set_bit(gdp-bg_inode_bitmap % EXT3_BLOCKS_PER_GROUP(sb), ...); ext3_set_bit(gdp-bg_block_bitmap % EXT3_BLOCKS_PER_GROUP(sb), ...); for (i = 0, bit = gdp-bg_inode_table % EXT3_BLOCKS_PER_GROUP(sb); i EXT3_SB(sb)-s_itb_per_group; i++, bit++) ext3_set_bit(i, gdp_bh-b_data); (or something close to this). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shred mount option for ext4?
On Oct 31, 2006 15:14 -0500, Nikolai Joukov wrote: 1. One of the patches performs N overwrites with configurable patterns (can comply with NIST and NISPOM standards). Because of the transaction compaction we had to separately add overwriting as separate transactions. Fortunately, the whole procedure is still atomic due to the orphan list. The problem that we have right now is per-file syncing of dirty data buffers between overwrites. We sync the whole device at the moment. Did anyone discuss doing this with crypto instead of actually overwriting the whole file? It would be pretty easy to store a per-file crypto key in each inode as an EA, then to delete the file all that would be needed would be to erase the key in a secure matter (which is a great deal easier because inodes don't move around on disk). The drawback is there is a runtime overhead to encrypt/decrypt the file data, but honestly, if people care about secure deletion don't they also care about security of the undeleted data also? By having an (unknown to the user) per-file crypto key then if the file is deleted the user can also plausibly deny the ability to recover the file data even if they are forced to surrender their key. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2 readdir/lookup/check_page behavior
On Nov 14, 2006 13:38 -0600, Eric Sandeen wrote: Andreas Dilger wrote: It would make sense to fix ext2 in the same way. I'd suggest bailing out early == min(i_size blocksize, i_blocks). The i_blocks count is an upper limit, because it includes the overhead of indirect blocks. Directories cannot be sparse. so we could either a) keep processing pages based on i_size, until we have passed i_blocks, or b) if i_size i_blocks don't match, immediately bail out because we know we have found a corrupted inode (vs. a normal unreadable block...) Do we already ext3_error() in this case? That allows the admin to determine the behaviour already. If it is errors=continue or errors=remount-ro then we should continue I think. We might consider the inode fatally corrupted if (i_blocks 9 i_size || i_blocks i_size (blockbits - 8) + /* blocks */ i_size (blockbits * 2 - 8 - 2) + /* indirect */ i_size (blockbits * 3 - 8 - 2) + /* dindirect */ i_size (blockbits * 4 - 8 - 2)) /* tindirect */ I think... Trying to account for indirect blocks. It is already given a 100% margin (-8 instead of -9) to cover rounding, EA blocks, some small bugs in block counting, extents format, etc. FYI, the -2 is 4 bytes/addr. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/4] BIG_BG: support of large block groups
On Nov 30, 2006 14:41 -0500, Theodore Tso wrote: * We ignore the problem, and accept that there are some kinds of filesystem corruptions which e2fsck will not be able to fix --- or at least not without adding complexity which would allow it to relocate data blocks in order to make a contiguous range of blocks to be used for the allocation bitmaps. The last alternative sounds horrible, but if we assume that some other layer (i.e., the hard drive's bad block replacement pool) provides us the illusion of a flawless storage media, and CRC to protect metadata will prevent us from relying on an corrupted bitmap block, maybe it is acceptable that e2fsck may not be able to fix certain types of filesystem corruption. I'd agree that even with media errors, the bad-block replacement pool is almost certainly available to handle this case. Even if there are media errors on the read of the bitmap, they will generally go away if the bitmap is rewritten (because of relocation). At worst, we would no longer allow new blocks/inodes to be allocated that are tracked by that block, and if we are past 256TB then the sacrifice of 128MB of space is not fatal. It wouldn't even have to impact any files that are already allocated in that space. without any of these protections, I'd want to keep the block group size under 32k so we can avoid dealing with these issues for as long as possible. Even if we assume laptop drives will double in size every 12 months, we still have a good 10+ years before we're in danger of seeing a 512TB laptop drives. :-) Agreed, I think there isn't any reason to increase the group size unless it is really needed, or it is specified with mke2fs -g {blocks} or the number of inodes requires it. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code
On Dec 14, 2006 11:03 -0500, Theodore Tso wrote: There was discussion on yesterday's call about whether or not 32-bit was enough for NFSv4, or whether it also requried 64-bits of change notification in the RFC's. So one of the questions is whether this is something that would justify requiring 64-bits --- and if so, maybe we need to require that big inodes be used and store the entire 64-bit value beyond 128 bytes. This would mean that NFSv4 cache management couldn't be fully implemented without big inodes, or we'd have to make do by using the inode ctime as a partial substitute. Per Trond and Bruce Field's reply to my email it seems that NFSv4 only needs the version to compare for inequality. If the change numbers are sequential for a given inode it can OPTIONALLY extract additional information about the server (i.e. it still has an up-to-date cache because it was the only one that did an update on a given file). So, I think for basic NFSv4 setups that 2^32 is sufficient (per Bull's original patch) but 2^64 is desirable to avoid collisions and allow the sequential updates logic to work properly for long-lived files. So, I think a 32-bit field in the small inode, and an additional 32-bit field in the large inode would be perfect. It allows this functionality to work with existing ext3 filesystems, if not quite optimally. In addition, for Lustre, could we get a 64-bit field in the superblock which contains the fs-wide version number. I'm proposing that, per the original Bull patch, l_i_reserved1 be changed to be i_version for linux, and we add i_version_hi after cr_time_extra in the large inode. The disk i_version would be stored in the vfs_inode i_version (which is already used for this same purpose). It would be good for NFSv4 if the i_version field could be expanded to 64 bits to avoid the need for it to have fs-specific operations, but failing that we can put the high word into ext4_inode_info and NFS can access it via export_operations I think. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] ext4 online defrag (ver 0.2)
On Jan 19, 2007 14:19 +0900, Takashi Sato wrote: On Jan 16, 2007 21:03 +0900, [EMAIL PROTECTED] wrote: 1. Add new ioctl(EXT4_IOC_DEFRAG) which returns the first physical block number of the specified file. With this ioctl, a command gets the specified directory's. Maybe I don't understand, but how is this different from the long-time FIBMAP ioctl? I can use FIBMAP instead of my new ioctl. You are right. I should have used FIBMAP ioctl... I have to get the physical block number of the specified directory. But FIBMAP is available only for a regular file, not for a directory. So I will use my new ioctl. Though it might make sense to implement FIBMAP for a directory, to keep it consistent and allow user-space tools like filefrag to work on directories also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [patch 0/3] i_version update for ext4
On Jan 23, 2007 18:23 +0100, Cordenner jean noel wrote: I've updated what was previously the change attribute patch for ext4 initially posted by Alexandre Ratchov. The previous patch was introducing a change_attribute field, now it uses the i_version field of the inode. The i_version field is a counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill NFSv4 requirements for rfc3530. For the moent, the counter is only a 32bit value but it is planned to be 64bit as required. The patch is divided into 3 parts, the vfs layer, the ext4 specific code and an user part to check i_version changes via stat. Have you had a chance to look at the performance impact of this change (possible with oprofile)? Always marking the inodes dirty for ext3 may have some noticable overhead. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [patch 3/3] i_version update for ext4: user interface
On Jan 23, 2007 18:24 +0100, Cordenner jean noel wrote: This patch adds a ``st_i_version'' field in the stat structure, and modifies the stat(2) syscall accordingly. Currently the change is only visible on i386 and x86_64 archs. What is the need for exporting i_version to userspace? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] e2fsprogs: user selectable dup block handling in fsck
On Jan 31, 2007 08:22 -0800, Jim Garlick wrote: It also adds a check to make sure only one -E option is passed on the command line as -E option parsing is not cumulative. @@ -633,6 +639,8 @@ static errcode_t PRS(int argc, char *arg case 'E': + if (extended_opts) + fatal_error(ctx, _(-E must only be specified once)); extended_opts = optarg; In such cases I've usually just changed the code to do the parsing as the option is passed. Otherwise, it isn't possible to override previously-specified options. This sometimes is needed if you have an alias or script that is passing a bunch of options, and in some rare cases you don't want the default, e.g. alias mye2fsck=e2fsck -f -p -E clone=dup # mye2fsck -y -E clone=zero /dev/really-broken Ted, is there a reason that the call to parse_extended_opts() can't just be moved in place of saving the options in extended_opts? I can't see anything in -E (yet) that depends on other options that might not be set yet. Also, it looks like that function leaks the duplicated string in buf, since that variable goes out of scope without freeing the allocation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [BUG -mm] ext3_orphan_add() accessing corrupted list on a corrupted ext3fs
to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Resubmit][PATCH 5/5] Secure Deletion and Trash-Bin Support for Ext4
On Feb 01, 2007 12:41 -0800, Mingming Cao wrote: On Thu, 2007-02-01 at 19:32 +, Dave Kleikamp wrote: Unfortunately, we need some file system-specific code to access per-file secure deletion and per-file trash bit attributes. These attributes are supported only by some file systems and in different ways. The check for fs specific attributes has to be underlying fs code. But the code the handling the secure delete and trash bin (although now is only two functions being called) are identical for all fs, could be move to VFS layer. Yeah, I did see that. I wonder adding some inode or file operation just to query the existence of those attributes (or something more generic) would be too ugly. I gave a brief thought on that yesterday, it was not very pretty:) Actually, the major filesystems (ext3, reiserfs, jfs, xfs) all use the same lsattr/chattr ioctl as ext2 (EXT2_IOC_GETFLAGS). Maybe this code can just do an ioctl inside the kernel? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 1/1] nanosecond timestamps
On Feb 05, 2007 23:09 -0500, Theodore Tso wrote: On Fri, Feb 02, 2007 at 08:09:40PM +0530, Kalpak Shah wrote: This patch is a spinoff of the old nanosecond patches. It includes some cleanups and addition of a creation timestamp. The EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with s_{min, want}_extra_isize fields in struct ext3_super_block. Thanks for sending it. I haven't had a chance to go over it in detail yet, but one quick comment; the patch looks like it got line-wrapped by your mail agent (looks like you're using Evolution 2.0). Could you send it as a text/plain attachment, or otherwise fix your mailer to not wrap your patches? It might be nice if the patch had a way of adjusting the granularity by masking off bits of the nanoseconds field, so we can benchmark how much overhead constantly updating the ctime field is going to cost us. Can anyone suggest a benchmark which will test this area? Bull had done some testing with the inode version patch (it also forces an update for every change to an inode) and reported no noticable performance loss. That could have been because of CPU headroom available to do repeat copies of the in-core inode to the on-disk inode, which may hurt in a more CPU constrained environment (other server code, multiple filesystems, etc). Before we go to changing the patch, we may as well start by just testing before vs. after patch (still using large inodes, for consistency). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 1/1] Nanosecond timestamps
On Feb 06, 2007 16:12 +0100, Johann Lombardi wrote: + if (sbi-s_inode_size EXT3_GOOD_OLD_INODE_SIZE) { + EXT3_SB(sb)-s_want_extra_isize = sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE; Maybe EXT3_SB(sb)- could be replaced by sbi- here and in the lines below. Yes, this should definitely be done. It also increases clarity between sbi-s_want_extra_isize and es-s_want_extra_isize. + if (EXT3_SB(sb)-s_want_extra_isize + le32_to_cpu(es-s_min_extra_isize)) ^^ + EXT3_SB(sb)-s_want_extra_isize = + le32_to_cpu(es-s_min_extra_isize); ^^ Since es-s_{min,want}_extra_isize are both __u16 (BTW, shouldn't it be __le16?), I think you should use le16_to_cpu() instead of le32_to_cpu(). You are right - this works fine on little endian systems, but fails on big endian systems where you will get the other half of the word. This has been a bug in several places already, and I wonder if the le*_to_cpu() and cpu_to_le*() macros shouldn't do some type checking instead of just casting the variable to the specified type? The only problem is if casting constants it would be a bit of a pain to have to cast them explicitly, though we could have something like: #define le16_to_cpu(var) (__builtin_constant(var) || !typecheck(__u16, var) ? \ __constant_cpu_to_le16(var) : __le16_to_cpu(var)) The only question is whether typecheck adds extra variables on the stack or if the compiler will always optimize them away. + /* Check if enough inode space is available */ + if (EXT3_GOOD_OLD_INODE_SIZE + EXT3_SB(sb)-s_want_extra_isize + sbi-s_inode_size) { + EXT3_SB(sb)-s_want_extra_isize = sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE; + printk(KERN_INFO EXT3-fs: required extra inode space not + available.\n); + } If the inode size is EXT3_GOOD_OLD_INODE_SIZE, sbi-s_want_extra_isize won't be initialized. However, it should not be an issue because the ext3_sb_info is set to zero in ext3_fill_super(). So I'm not sure I understand if you have an objection or if this is just a comment. sbi-s_want_extra_isize will be zero and it is not possible for sbi-s_inode_size EXT3_GOOD_OLD_INODE_SIZE so this case won't be hit. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing ext4 persistent preallocation patches for 64 bit features
On Feb 07, 2007 16:06 +0530, Suparna Bhattacharya wrote: On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote: - disable preallocation if the filesystem free blocks is under some low watermarks, to save space for near future real block allocation? A policy decision like this is probably worth a discussion during today's call. - is de-preallocation something worth doing? As discussed in the call - I don't think we can remove preallocations. The whole point of database preallocation is to guarantee that this space is available in the filesystem when writing into a file at random offsets (which would otherwise be sparse). Similarly, persistent preallocation shouldn't be considered differently than an efficient way of doing zero filling of blocks. At least that is my understanding... Is this code implementing the uninitialized extents for databases (via explicit preallocation via fallocate/ioctl) so that they don't have to zero-fill large files, or is there also automatic preallocation of space to files (e.g. for O_APPEND files)? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] e2fsck journal recovery can corrupt all superblock backups
Ted, this was sent with the first patch, and it looks like it is a very serious problem. Looking through the e2fsck code it would also seem possible to move the setting of EXT2_FLAG_MASTER_SB_ONLY before the journal replay. That is the second patch. I'm not sure which one is better. Jim Garlick wrote: When fsck replays the journal, it clears the EXT3_FEATURE_INCOMPAT_RECOVER feature, dirties the superblock, and closes the file system. Unfortunately, the file system EXT2_FLAG_MASTER_SB_ONLY flag is not set at this time, so it copies the primary superblock and group descriptors over all the backups. Then fsck restarts and checks the superblock for consistancy. If the superblock or group descriptors are then found to be bad, all your backups are now also bad. Index: e2fsprogs+chaos/lib/ext2fs/openfs.c === --- e2fsprogs+chaos.orig/lib/ext2fs/openfs.c +++ e2fsprogs+chaos/lib/ext2fs/openfs.c @@ -101,6 +101,8 @@ errcode_t ext2fs_open2(const char *name, memset(fs, 0, sizeof(struct struct_ext2_filsys)); fs-magic = EXT2_ET_MAGIC_EXT2FS_FILSYS; fs-flags = flags; + /* don't overwrite sb backups unless flag is explicitly cleared */ + fs-flags |= EXT2_FLAG_MASTER_SB_ONLY; fs-umask = 022; retval = ext2fs_get_mem(strlen(name)+1, fs-device_name); if (retval) --- Index: e2fsprogs/e2fsck/unix.c === --- e2fsprogs.orig/e2fsck/unix.c2006-12-27 17:12:23.0 -0700 +++ e2fsprogs/e2fsck/unix.c 2007-02-08 22:05:13.0 -0700 @@ -1153,6 +1153,15 @@ restart: } /* +* We only update the master superblock because (a) paranoia; +* we don't want to corrupt the backup superblocks, and (b) we +* don't need to update the mount count and last checked +* fields in the backup superblock (the kernel doesn't +* update the backup superblocks anyway). +*/ + fs-flags |= EXT2_FLAG_MASTER_SB_ONLY; + + /* * Check to see if we need to do ext3-style recovery. If so, * do it, and then restart the fsck. */ @@ -1227,15 +1236,6 @@ restart: !(ctx-options E2F_OPT_READONLY)) ext2fs_mark_super_dirty(fs); - /* -* We only update the master superblock because (a) paranoia; -* we don't want to corrupt the backup superblocks, and (b) we -* don't need to update the mount count and last checked -* fields in the backup superblock (the kernel doesn't -* update the backup superblocks anyway). -*/ - fs-flags |= EXT2_FLAG_MASTER_SB_ONLY; - ehandler_init(fs-io); if (ctx-superblock) Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs coverity patch cid-4.diff
On Feb 09, 2007 18:11 -0800, Brian D. Behlendorf wrote: This check is unnecessary since fs_type is guaranteed to be set earlier in the function. I'm not positive this is the right fix. If we are creating a journal device, we shouldn't be using the normal defaults based on the size of the journal device because this will often be much smaller than the actual filesystem. I'd argue that we set fs_type=journal earlier when fs_type is being first set, if INCOMPAT_JOURNAL_DEV is set. Index: e2fsprogs+chaos/misc/mke2fs.c === --- e2fsprogs+chaos.orig/misc/mke2fs.c +++ e2fsprogs+chaos/misc/mke2fs.c @@ -1310,8 +1310,6 @@ static void PRS(int argc, char *argv[]) if (fs_param.s_feature_incompat EXT3_FEATURE_INCOMPAT_JOURNAL_DEV) { - if (!fs_type) - fs_type = journal; reserved_ratio = 0; fs_param.s_feature_incompat = EXT3_FEATURE_INCOMPAT_JOURNAL_DEV; fs_param.s_feature_compat = 0; Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: booked-page-flag.patch
On Feb 16, 2007 00:06 -0800, Andrew Morton wrote: On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas [EMAIL PROTECTED] wrote: well, even so we need to reserve not that block ony, but also needed metadata (for the worst case). Right. Reserve all the blocks for the page and all the metadata for a page at that file offset. Worst case. This only really becomes an issue when the filesystem (+reservations) is close to being full. If we are close enough to worry about running out of space we can also refuse to do delayed allocation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: booked-page-flag.patch
On Feb 15, 2007 09:18 -0800, Eric Sandeen wrote: This was for the delayed allocation patchset, right; and by managing this at the page level that means we can't do this for block size page size, I think... We could always disable delayed allocation in the PAGE_SIZE != blocksize case. It would be no worse than what we have now, and for IO performance you need larger blocksize anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data=journal busted
On Feb 16, 2007 15:42 -0800, Andrew Morton wrote: On Fri, 16 Feb 2007 16:31:09 -0700 Andreas Dilger [EMAIL PROTECTED] wrote: We have a patch that we use for Lustre testing which allows you to set a block device readonly (silently discarding all writes), without the filesystem immediately keeling over dead like set_disk_ro. The readonly state persists until the the last reference on the block device is dropped, so there are no races w.r.t. VFS cleanup of inodes and flushing buffers after the filesystem is unmounted. Not sure I understand all that. Actually, the patch was originally based on your just-posted readonly code. The problem with that code (unlike ours) is that it doesn't do the explicit ro clearing in ext3_put_super(), but rather not until ALL inodes/buffers/etc are cleared out for the block device by the VFS. Otherwise, it is possible to re-enable rw on the block device, and there still be dirty buffers that are being flushed out to disk, guaranteeing filesystem inconsistency. It allows multiple devices to be marked ro at the same time without problem. It also works on all block devices instead of just IDE. For this application, we *want* to expose VFS races, errors in handling EIO, errors in handling lost writes, etc. It's another form of for-developers fault injection, not a thing-for-production. I understand that part, and I agree our patch doesn't include the random part of the set-readonly code you have. Both of the patches do have the drawback that they don't return errors like set_disk_ro(), so that isn't exercising the error-handling code in ext3/jbd as much as it could. When we first worked on this in 2.4 the ext3/jbd code was far to fragile to handle -EROFS under load without oopsing so we chose the silent failure mode to allow testing w/o crashing all the time. Maybe today it is better to just call set_dev_ro() and fix the (hopefully few) -EROFS problems. The reason I prefer doing it from the timer interrupt is to toss more randomness in there, avoid the possibility of getting synchronised with application or kernel activity in some fashion. Definitely this has its place too. We chose the opposite route and have fault-injection calls sprinkled throughout our code, so that we can create complex regression tests that need specific series of events to be hit. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 1/1] nanosecond timestamps
On Feb 25, 2007 02:36 -0800, Andrew Morton wrote: On Fri, 02 Feb 2007 20:09:40 +0530 Kalpak Shah [EMAIL PROTECTED] wrote: +#define EXT3_INODE_GET_XTIME(xtime, extra_xtime, inode, raw_inode) \ +do { \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ + \ + if (offsetof(typeof(*raw_inode), extra_xtime) - \ + offsetof(typeof(*raw_inode), i_extra_isize) + \ + sizeof((raw_inode)-extra_xtime) = \ + le16_to_cpu((raw_inode)-i_extra_isize)) { \ + if (sizeof((inode)-xtime.tv_sec) 4)\ + (inode)-xtime.tv_sec |= \ + (__u64)(le32_to_cpu((raw_inode)-extra_xtime)\ + EXT3_EPOCH_MASK) 32; \ + (inode)-xtime.tv_nsec = \ + (le32_to_cpu((raw_inode)-extra_xtime) \ + EXT3_NSEC_MASK) 2; \ + } \ +} while (0) ow, my eyes. Can we find a way to do this in C rather than in cpp? The macro is working on the field names of the inode in most places (e.g. i_mtime, i_ctime, etc) rather than the values. You _could_ do it in C, but it would mean moving all of the offsetof() into the caller (basically putting the whole macro in-line at the caller) or doing math on the pointer addresses at runtime instead of compile time. It boils down to a check whether the particular nanosecond field is inside the reserved space in the inode or not, so it ends up a comparison against a constant. For ctime: if (4 = le32_to_cpu((raw_inode)-i_extra_isize) { And the second if decides whether to save bits 32 in the seconds field for 64-bit architectures, so it is also evaluated at compile time. Better to have this in a macro than in the code itself. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] added sanity check for xattr validation
Ted, the attached patch adds an extra validity test in check_ext_attr(). If an attribute's e_value_size is zero the current code does not allocate a region for it and as a result the e_value_offs value is not verified. However, if e_value_offs is very large then the later call to ext2fs_ext_attr_hash_entry() can dereference bad memory and crash e2fsck. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Jim Garlick [EMAIL PROTECTED] === e_value_offs.patch Index: e2fsprogs-1.39/e2fsck/pass1.c === --- e2fsprogs-1.39.orig/e2fsck/pass1.c 2007-02-10 15:43:36.0 +0530 +++ e2fsprogs-1.39/e2fsck/pass1.c 2007-02-10 15:45:36.0 +0530 @@ -1346,6 +1347,11 @@ if (fix_problem(ctx, PR_1_EA_BAD_VALUE, pctx)) goto clear_extattr; } + if (entry-e_value_offs + entry-e_value_size fs-blocksize) { + if (fix_problem(ctx, PR_1_EA_BAD_VALUE, pctx)) + goto clear_extattr; + break; + } if (entry-e_value_size region_allocate(region, entry-e_value_offs, EXT2_EXT_ATTR_SIZE(entry-e_value_size))) { === e_value_offs.patch Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 1/1] nanosecond timestamps
On Feb 26, 2007 17:20 -0600, Dave Kleikamp wrote: I wanted to see if I could clean this up, and this is what I came up with. I also did some macro magic to make these macros take one less argument. I ended up breaking two macros up into three smaller macros and two inline functions. Does this look any better? (This is incremental against Kalpak's patch.) Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Looks like an improvement to me, if only to give a name and clarity to the EXT3_FITS_IN_INODE() part of the function. One other note - there was a problem with the original patch because i_crtime is also in the extended part of the inode so it needs to be validated against i_extra_isize so we need to submit a new patch for that anyways and may as well include this cleanup in that patch. Thanks Dave. diff -Nurp linux.orig/fs/ext3/inode.c linux/fs/ext3/inode.c --- linux.orig/fs/ext3/inode.c2007-02-26 17:10:22.0 -0600 +++ linux/fs/ext3/inode.c 2007-02-26 17:02:28.0 -0600 @@ -2674,10 +2674,10 @@ void ext3_read_inode(struct inode * inod inode-i_nlink = le16_to_cpu(raw_inode-i_links_count); inode-i_size = le32_to_cpu(raw_inode-i_size); - EXT3_INODE_GET_XTIME(i_ctime, i_ctime_extra, inode, raw_inode); - EXT3_INODE_GET_XTIME(i_mtime, i_mtime_extra, inode, raw_inode); - EXT3_INODE_GET_XTIME(i_atime, i_atime_extra, inode, raw_inode); - EXT3_INODE_GET_XTIME(i_crtime, i_crtime_extra, ei, raw_inode); + EXT3_INODE_GET_XTIME(i_ctime, inode, raw_inode); + EXT3_INODE_GET_XTIME(i_mtime, inode, raw_inode); + EXT3_INODE_GET_XTIME(i_atime, inode, raw_inode); + EXT3_INODE_GET_XTIME(i_crtime, ei, raw_inode); ei-i_state = 0; ei-i_dir_start_lookup = 0; @@ -2829,10 +2829,10 @@ static int ext3_do_update_inode(handle_t } raw_inode-i_links_count = cpu_to_le16(inode-i_nlink); raw_inode-i_size = cpu_to_le32(ei-i_disksize); - EXT3_INODE_SET_XTIME(i_ctime, i_ctime_extra, inode, raw_inode); - EXT3_INODE_SET_XTIME(i_mtime, i_mtime_extra, inode, raw_inode); - EXT3_INODE_SET_XTIME(i_atime, i_atime_extra, inode, raw_inode); - EXT3_INODE_SET_XTIME(i_crtime, i_crtime_extra, ei, raw_inode); + EXT3_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT3_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT3_INODE_SET_XTIME(i_atime, inode, raw_inode); + EXT3_INODE_SET_XTIME(i_crtime, ei, raw_inode); raw_inode-i_blocks = cpu_to_le32(inode-i_blocks); raw_inode-i_dtime = cpu_to_le32(ei-i_dtime); diff -Nurp linux.orig/include/linux/ext3_fs.h linux/include/linux/ext3_fs.h --- linux.orig/include/linux/ext3_fs.h2007-02-26 17:10:22.0 -0600 +++ linux/include/linux/ext3_fs.h 2007-02-26 17:07:20.0 -0600 @@ -331,37 +331,42 @@ struct ext3_inode { #define EXT3_EPOCH_MASK ((1 EXT3_EPOCH_BITS) - 1) #define EXT3_NSEC_MASK (~0UL EXT3_EPOCH_BITS) +#define EXT3_FITS_IN_INODE(ext3_inode, field)\ + ((offsetof(typeof(*ext3_inode), field) -\ + offsetof(typeof(*ext3_inode), i_extra_isize) +\ + sizeof((ext3_inode)-field)) = \ + le16_to_cpu((ext3_inode)-i_extra_isize)) + +static inline __le32 ext3_encode_extra_time(struct timespec *time) +{ + return cpu_to_le32((sizeof(time-tv_sec) 4 ? time-tv_sec 32 : 0) | +((time-tv_nsec 2) EXT3_NSEC_MASK)); I think this should be: return cpu_to_le32((sizeof(time-tv_sec) 4 ? (time-tv_sec 32) EXT3_EPOCH_MASK : 0) | (time-tv_nsec EXT3_EPOCH_BITS)); We should use EXT3_EPOCH_BITS, and don't really need EXT3_NSEC_MASK unless there is some reason to believe that tv_nsec is 1e9. Even then it would be truncated at 32 bits by the (__u32) cast in cpu_to_le32(). +static inline void ext3_decode_extra_time(struct timespec *time, __le32 extra) { + if (sizeof(time-tv_sec) 4) + time-tv_sec |= (__u64)(le32_to_cpu(extra) EXT3_EPOCH_MASK) + 32; + time-tv_nsec = (le32_to_cpu(extra) EXT3_NSEC_MASK) 2; And this should be: time-tv_nsec = (le32_to_cpu(extra) EXT3_NSEC_MASK) EXT3_EPOCH_BITS; We could again skip the EXT3_NSEC_MASK because le32_to_cpu(extra) will truncate the input at 32 bits, so: time-tv_nsec = le32_to_cpu(extra) EXT3_EPOCH_BITS; Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH take3 1/1] nanosecond timestamps
On Feb 27, 2007 15:14 +0530, Kalpak Shah wrote: +#define EXT4_EPOCH_BITS 2 + +static inline __le32 ext4_encode_extra_time(struct timespec *time) +{ + return cpu_to_le32((sizeof(time-tv_sec) 4 ? +time-tv_sec 32 : 0) | +((time-tv_nsec 2) EXT4_NSEC_MASK)); This should be (time-tv_nsec EXT4_EPOCH_BITS). We don't strictly need the EXT4_NSEC_MASK because cpu_to_le32() will truncate the field to 32 bits anyways. +static inline void ext4_decode_extra_time(struct timespec *time, __le32 extra) { + if (sizeof(time-tv_sec) 4) +time-tv_sec |= (__u64)(le32_to_cpu(extra) EXT4_EPOCH_MASK) + 32; + time-tv_nsec = (le32_to_cpu(extra) EXT4_NSEC_MASK) 2; This should be EXT4_EPOCH_BITS. Similarly, le32_to_cpu() will truncate extra to 32 bits and we shift away the epoch part of the field, so we don't need the EXT4_NSEC_MASK at all. +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do {\ + if (EXT4_FITS_IN_INODE(raw_inode, xtime)) \ + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime); \ We might consider doing this in the caller for the crtime field. That is only read/written in a single place, and it is the only one that needs the extra check to determine if the seconds field is in the body of the inode. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsck and human intervention
On Mar 06, 2007 09:27 -0500, Daniel Drake wrote: OK. Given that write caching may be required for performance reasons or there might be other possible reasons which would result in preen-unrepairable fs corruption on power loss, my question is now: Is it a really bad idea to run e2fsck -y on every boot? If your primary concern is not halting the boot, then yes. 99% of people only know to answer y to e2fsck anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232
On Mar 09, 2007 07:40 -0800, Andrew Morton wrote: EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232 Assertion failure in dx_probe() at fs/ext3/namei.c:384: dx_get_limit(entries) == dx_root_limit(dir, root-info.info_length) Looks like corrupted on-disk or in-memory htree metadata. We should probably handle this more gracefully, if indeed it is corrupted on disk. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] improving ext4 stat/unlink performance
In recent discussions on #linuxfs with Coly Li (I hope I have the email correct), we have ressurected the idea of improving the stat and unlink performance of ext3 htree directories. Coly seems interested in working on fixing this problem, so I thought I'd start an email thread in the list to get wider input, and also keep a record of the discussions. This problem was exposed widely in an old thread started by Andrew Morton: http://marc.theaimsgroup.com/?l=linux-ext4m=105711174128574w=4 and can be summarized as htree directories have random readdir order compared to inode number (and vice versa). And my email in that thread explains the solution relatively cleanly: http://marc.theaimsgroup.com/?l=linux-ext4m=105713438820870w=4 and can be summarized as allocate new inodes in a manner that takes the name hash into account. Apparently, this has become a hot topic for the kernel.org systems because Git is fond of creating huge directories. It should be possible to make a patch that improves the performance of readdir+stat or readdir+unlink (for directories created under this scheme) but does not affect the on-disk format. Ted has also proposed the LD_PRELOAD library for doing sorting of the readdir output in userspace before returning it to the application. This can be helpful, but until this approach is widely available (e.g. built into glibc or something) it will have limited benefit to users. This is complementary with any approach taken here. The basic goal, as discussed in my above email is to map the filename to a goal inode number in (growing) range of inodes/blocks in the itable. The resulting inode order is piece-wise in approximately hash order. The goal selection part is relatively straight forward, especially for new directories, where we could keep a count of the number of inodes in a directory and scale the itable window size as a function of the number of inodes. We would likely reserve the itable blocks for a given directory in the same manner (and possibly even the same code) that ext3 reservation works today. For existing directories, the number of inodes in the directory (and hence the itable window size) can also be approximated by the current size of the directory divided by an average filename length. Whether the directory is full or not is less important, if we assume that past usage of the directory is an indication of future usage. It is in fact better to know the directory size in advance because we get fewer pieces in the hash to itable mapping. What remains open for discussion (hopefully someone will have a good answer) with existing directories is how to find the previous itable allocation window(s) across a remount. For directories with more than tens of thousands of inodes in them the inodes by necessity must span multiple block groups, and there is no guarantee that the itable ranges are contiguous (no point in reserving a range of itable blocks that are full). One possibility is to scan the directory leaf block that will ultimately hold the new inode and try to allocate in the same itable block(s) that the other entries in this leaf are using (preferring the block that is used closest hashes in that leaf). Drawbacks of this are that renamed entries will have non-local inode numbers (which is a drawback of the whole scheme, actually) and we don't store hashes with each entry. We could also persistently store a hint for the itable allocation window in the directory somewhere (e.g. the htree header by increasing the info_length). Another possibility, first suggested by Ted in the same thread above, and something I've been thinking about since the meeting in Portland last year, is to change the directory format so that the directory leaf blocks are actually the inode table itself. It is complex enough that it should go into a separate email, and is only worth considering if we are going to have another incompatible change to the on-disk format (though we could bundle a number of inode- and directory- specific changes together). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qla2xxx BUG: workqueue leaked lock or atomic
On Mar 12, 2007 16:22 +0100, Valerie Clement wrote: Mingming Cao wrote: IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc) on 10TB ext3, I think RedHat and BULL have done similar test on 8TB ext3 too. Is there not a problem of backward-compatibility with old kernels? Doesn't we need to handle a new INCOMPAT flag in e2fsprogs and kernel before allowing ext3 filesystems greater than 8T? No, it really depends on the kernel. There were some bugs that caused problems with 8TB because of signed 32-bit int problems, so it isn't really recommended to use 8TB unless you know this is fixed in your kernel (and any older kernel you might have to downgrade to). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs coverity patch cid-1.diff
On Mar 18, 2007 09:40 -0400, Theodore Tso wrote: On Fri, Feb 09, 2007 at 06:08:16PM -0800, Brian D. Behlendorf wrote: Coverity ID: 2: Checked Return Would I be right in assuming this should be Coverity ID 1, and not 2? (cid-2.diff is also labelled as Coverity ID 1). What is Coverity ID's, anyway? I assume they are something you locally assigned? Yes, this is just the error number within a given software project at a particular site. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cleaned up ext4 patch series
On Mar 19, 2007 09:15 -0800, Mingming Cao wrote: I wonder if we should create two branches: one branch for patches that are well discussed and tested, which Andrew could trust and pull to mm tree; and create another branch to store patches that are still under discussion and likely to be rewriten based on the review feedback. Yes, that makes a lot of sense. It would avoid issues like Andrew finding problems with the nanosecond patches because they hadn't been widely tested yet. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 5/19 e2fsprogs f_extents_ee_len test
Regression test for a bad ee_len field in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_ee_len.patch Description: Binary data
[PATCH] 3/19 e2fsprogs f_extents_bad_blk test
Regression test for a bad block in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_bad_blk.patch Description: Binary data
[PATCH] 4/19 e2fsprogs f_extents_ee_block test
Regression test for a bad ee_block field in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_ee_block.patch Description: Binary data
[PATCH] 6/19 e2fsprogs f_extents_ee_start test
Regression test for a bad ee_start field in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_ee_start.patch Description: Binary data
[PATCH] 7/19 e2fsprogs f_extents_eh_depth test
Regression test for a bad eh_depth field in an extent file. Binary patch file. This test still fails with the current extents patch. It takes 2 runs of e2fsck to correct the problem. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_eh_depth.patch Description: Binary data
[PATCH] 8/19 e2fsprogs f_extents_eh_entries test
Regression test for a bad eh_entries field in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_eh_entries.patch Description: Binary data
[PATCH] 12/19 e2fsprogs f_extents_ei_leaf test
Regression test for a bad ei_leaf field in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_ei_leaf.patch Description: Binary data
[PATCH] 14/19 e2fsprogs f_extents_inrlevel-incons test
Regression test for an inter-level index depth inconsistency in an extent file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_inrlevel-incons.patch Description: Binary data
[PATCH] 15/19 e2fsprogs f_extents_orphan_blks test
Regression test for orphaned blocks in an extent filesystem. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_orphan_blks.patch Description: Binary data
Re: [PATCH] 18/19 e2fsprogs f_extents_shrd_blk test
Regression test for unsorted extents within a file. Binary patch file. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_unsorted.patch Description: Binary data
[PATCH] 18/19 e2fsprogs f_extents_shrd_blk test
Regression test for two extent files sharing the same blocks. Binary patch file. This test is known not to work currently. The problem is that there is not enough space in the index to split the extent and clone all of the blocks. It is hoped that Ted's change of e2fsprogs to be extent based instead of block iterator based will allow this problem to be fixed. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_extents_shrd_blk.patch Description: Binary data
Re: 2.6.20 ext3 rename() returns success but doesn't unlink the source
On Mar 24, 2007 01:18 +0200, Timo Sirainen wrote: On Fri, 2007-03-23 at 23:24 +0100, Guillaume Chazarain wrote: Timo Sirainen a écrit : rename(/mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina, /mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2,); $ ls -li /mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina /mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2, 122481 -rw--- 2 1000 1000 3091 Mar 23 08:43 /mnt/Maildir/.Trash/cur/1174635781.P25986Q0M341350.hurina:2, 122481 -rw--- 2 1000 1000 3091 Mar 23 08:43 /mnt/Maildir/.Trash/new/1174635781.P25986Q0M341350.hurina All your files in the list are hard linked. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=fs/namei.c;h=ee60cc4d3453209723d6f70982e7083d7cb39477;hb=HEAD#l2447 seems to handle your case as you see it: if (old_dentry-d_inode == new_dentry-d_inode) return 0; Hmm. Oh. So this is this intentional? First I ever heard of it, and also goes against my logic. I'll go complain to man pages people then. This will make my error handling difficult.. Yes, it is a very counter-intuitive requirement in the POSIX spec. The GNU user tools do what is expected (unlink the source file). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
group descriptor contents and LAZY_BG
Ted, the LAZY_BG feature was originally written to test large filesystems and avoid the need to format (or even allocate, if sparse) large parts of the inode table and bitmaps. To make the feature COMPAT the bg_free_blocks_count and bg_free_inodes_count are initialized to zero, and the kernel will skip these groups entirely without checking the bitmap. For the GDT_CSUM feature, we are using the LAZY_BG feature to not initialize the bitmaps and inode table. However, setting the bg_free_*_count to zero is troublesome because it means we always need to check the GDT_CSUM feature flag to know whether a group actually has free blocks/inodes, and e2fsck is also getting confused about the number of free blocks/inodes in the groups and filesystem (we have to have a double accounting for real free blocks and uninitialized free blocks. What do you think for the GDT_CSUM feature that we initialize the group descriptors as if the group had actually been formatted? I think our use of the LAZY_BG feature is actually misguided - while they share some components (e.g. UNINIT flags, mke2fs code, some e2fsck code), since the GDT_CSUM feature is RO_COMPAT there isn't much reason to even enable LAZY_BG at format time... We have to replace all uses of bg_free_*_count with the below macros: +/* Macro-instructions used to calculate Free inodes and blocks count. */ +#define EXT3_BG_INODES_FREE(sb,gr,gdp) ((EXT3_HAS_RO_COMPAT_FEATURE(sb, \ + EXT4_FEATURE_RO_COMPAT_GDT_CSUM) \ + (gdp)-bg_flags \ + cpu_to_le16(EXT3_BG_INODE_UNINIT)) ? \ + EXT3_INODES_PER_GROUP(sb) : \ + le16_to_cpu((gdp)-bg_itable_unused) + \ + le16_to_cpu((gdp)-bg_free_inodes_count)) +#define EXT3_BG_BLOCKS_FREE(sb,gr,gdp) ((EXT3_HAS_RO_COMPAT_FEATURE(sb, \ + EXT4_FEATURE_RO_COMPAT_GDT_CSUM) \ + (gdp)-bg_flags \ + cpu_to_le16(EXT3_BG_BLOCK_UNINIT)) ? \ + ext3_free_blocks_after_init(sb,gr,gdp) :\ + le16_to_cpu((gdp)-bg_free_blocks_count)) Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext4 benchmarks
On Mar 30, 2007 10:43 +0200, Johann Lombardi wrote: On Wed, Mar 28, 2007 at 09:20:50AM -0500, Jose R. Santos wrote: While it may be to late for the purposes of your OLS paper, one thing that doesn't seem to be getting much attention is the performance of a file system while doing many meta-data operations or throughput testing during heavy journal log activity. Back in 2005, Alex sent a patch to monitor journal activity through procfs: http://marc.info/?l=linux-ext4m=113538565128617w=2 We have been using this patch for more than 1 year and it is very useful for investigating performance issues. Does anybody know why this patch hasn't found its way into the mainline? Maybe it could be merged in jbd2? Yes, we use this functionality failry often, so having it in jbd2 would be quite useful. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Correction to check_filetype()
On Mar 30, 2007 20:44 -0400, Theodore Tso wrote: On Wed, Feb 21, 2007 at 02:45:59PM +0530, Kalpak Shah wrote: If the mode of a directory gets corrupted, check_filetype() makes wrong decisions for all its sub-directories. For example, using debugfs we can corrupt the mode of a directory to 0140755 (i.e. a socket). e2fsck will set the filetype of all its subdirectories as 6 (filetype for socket). All the subdirectories would be moved to lost+found, and in second run of e2fsck their filetype would be set back to 2. Um, I'm not seeing this. Using stock e2fsprogs, given the following test image, I'm not seeing the behavior you describe. The log of the e2fsck (on a test filesystem with deliberately introduced corruption) is available at: https://bugzilla.lustre.org/show_bug.cgi?id=11645 I've also added a testcase (created by making a parent directory with several subdirs, then using debugfs to change the mode of the parent directory). https://bugzilla.lustre.org/attachment.cgi?id=9958 (also attached here). It doesn't exhibit the filetype breakage in the upstream e2fsck because the test is run with the patch applied, but it does still show the size is wrong on second e2fsck problem you observed. The test case is created as if that problem was also fixed already (i.e. second e2fsck is clean). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. e2fsprogs-tests-f_check_filetype.patch Description: Binary data
Re: [PATCH] Add i_version_hi for 64-bit version
On Apr 02, 2007 14:47 -0700, Mingming Cao wrote: On Sun, 2007-03-18 at 15:43 +0530, Kalpak Shah wrote: This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used for 64-bit inode versions. This field will store the higher 32 bits of the version, while Jean Noel's patch has added support to store the lower 32-bits in osd1.linux1.l_i_version. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Index: linux-2.6.20/include/linux/ext4_fs.h === --- linux-2.6.20.orig/include/linux/ext4_fs.h +++ linux-2.6.20/include/linux/ext4_fs.h @@ -336,6 +336,7 @@ struct ext4_inode { __le32 i_atime_extra; /* extra Access time (nsec 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra File Creation time (nsec 2 | epoch) */ + __u32 i_version_hi; /* high 32 bits for 64-bit version */ }; Any reason not using __le32 for the i_version_hi? Not really. I've also asked Kalpak to run sparse with Shaggy's flags against the other patches to fix up any similar issues. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Correction to check_filetype()
On Mar 31, 2007 10:39 -0400, Theodore Tso wrote: I'm going to let this one soak for a bit to make sure we don't pick up any fase positives or negatives in the hueristics. @@ -133,11 +133,10 @@ int e2fsck_pass1_check_device_inode(ext2 + * If the index flag is set, then this is a bogus + * device/fifo/socket */ - if ((ext2fs_inode_data_blocks(fs, inode) != 0) || - (inode-i_flags EXT2_INDEX_FL)) + if (inode-i_flags EXT2_INDEX_FL) return 0; There were ancient versions of the kernel that left EXT2_INDEX_FL set on all files, instead of just directories... I'm not sure if those were in actual released kernels, or just in patches. +static void check_is_really_dir(e2fsck_t ctx, struct problem_context *pctx, + char *buf) +{ + if (ext2fs_read_dir_block(ctx-fs, inode-i_block[0], buf)) + return; Do we call check_blocks() on this inode shortly thereafter? If we do then the overhead of reading the first block is offset by not reading it again later. Otherwise, this could slow things down. + dirent = (struct ext2_dir_entry *) buf; + if (((dirent-name_len 0xFF) != 1) || + (dirent-name[0] != '.') || + (dirent-inode != pctx-ino) || + (dirent-rec_len 12) || + (dirent-rec_len % 4) || + (dirent-rec_len = ctx-fs-blocksize - 12)) + return; + + dirent = (struct ext2_dir_entry *) (buf + dirent-rec_len); + if (((dirent-name_len 0xFF) != 2) || + (dirent-name[0] != '.') || + (dirent-name[1] != '.') || + (dirent-rec_len 12) || + (dirent-rec_len % 4)) + return; + + if (fix_problem(ctx, PR_1_TREAT_AS_DIRECTORY, pctx)) { + inode-i_mode = (inode-i_mode 0) | LINUX_S_IFDIR; + e2fsck_write_inode_full(ctx, pctx-ino, inode, + EXT2_INODE_SIZE(ctx-fs-super), + check_is_really_dir); } The one worry I have here (though I don't think it is necessarily IN the code you propose) is that someone could create a regular file which looks like a directory and somehow get it linked into the filesystem tree, giving them escalated access (e.g. device files owned by them, suid executables, links to otherwise unreadable files, etc). It would seem that this is only a danger if the mode on the file is corrupted, which shouldn't really be doable by a regular user, but it is definitely something to consider. I take it that this code fixes the test image I previously posted? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 2] e2fsprogs: Add ext4migrate
On Apr 03, 2007 15:37 +0530, Aneesh Kumar K.V wrote: The extent insert code is derived out of the latest ext4 kernel source. I have tried to keep the code as close as possible to the kernel sources. This makes sure that any fixes for the tree building code in kernel should be easily applied to ext4migrate. The ext3_ext naming convention instead of ext4_ext found in kernel is to make sure we are in sync with rest of e2fsprogs source. Of course, the other way to do this would be to temporarily mount the filesystem as ext4, copy non-extent files via cp (can use lsattr to check for extent flag) and then rename new file over old one. Care must be taken to not mount filesystem on visible mountpoint, so that users cannot be changing the filesystem while copy is being done. This can be done to convert an ext4 filesystem back to ext3 also, if the ext4 filesystem is mounted with noextents (to disable creation of new files with extent mapping). The only minor issue is that the inode numbers of the files will change. The inode modification is done only at the last stage. This is to make sure that if we fail at any intermediate stage, we exit without touching the disk. The inode update is done as below a) Walk the extent index blocks and write them to the disk. If failed exit b) Write the inode. if failed exit. c) Write the updated block bitmap. if failed exit ( This could be a problem because we have already updated the inode i_block filed to point to new blocks.). But such inconsistancy between inode i_block and block bitmap can be fixed by fsck IIUC. Why not mark all the relevant blocks in use (for both exent- and block-mapped copies) until the copy is done, then write everything out, and only mark the block-mapped file blocks free after the inode is written to disk? This avoids the danger that the new extent-mapped file's blocks are marked free and get double-allocated (corrupting the file data, possibly the whole filesystem). I don't think there is a guarantee that an impatient user will run a lengthy e2fsck after interrupting the migrate. Also, you should mark the filesystem unclean at first change unless everything completes successfully. That way e2fsck will at least run automatically on the next boot. Other general notes: - wrap lines at 80 columns - would be good to have a -R mode that walked the whole filesystem, since startup time is very long for large filesystems - also allow specifying multiple files on the command-line - changing the operation to be multi-file allows avoiding sync of bitmaps two times (once after extents are allocated and inode written, once after indirect blocks are freed). There only needs to be one sync per file. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune2fs -l stale info
On Apr 03, 2007 15:55 +0200, Vincent Caron wrote: BTW, if ondisk superblocks are not updated until specific events occur (umount, statfs), what is the consequence of a system crash ? Does journalization come into play (superblock=metadata?), does fsck fixes figures from other ondisk structures ? Just being curious... The free blocks/inodes count in the superblock aren't really used for anything these days. At mount time the kernel sums up the free inode and block counts from the group descriptors to put into the percpu counters, and only the group descriptors are kept uptodate. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext4 benchmarks
On Apr 04, 2007 19:06 +0200, Cordenner jean noel wrote: here is the first results of the round: http://www.bullopensource.org/ext4/20070404/ Jean Noel, thank you for the test results. It is always nice to see that ext4 is doing so well compared to ext3 and XFS. Ming Ming, it should be possible to just include the mballoc+delalloc patches that Jean Noel used into the upstream ext4 patch series. When Alex or Christoph get a chance to do the VFS delalloc rewrite we can move to that new patch, but until then it seems pointless to not include this functionality which improves the performance so much. Also, if we include those patches the mballoc and delalloc features (along with extents) should be enabled by default if INCOMPAT_EXTENTS is in the superblock unless: - noextents, nomballoc, or nodelalloc mount options are given - delalloc needs to be disabled if blocksize != PAGE_SIZE Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add a norecovery option to ext3/4?
On Apr 08, 2007 22:24 -0500, Eric Sandeen wrote: Samuel Thibault wrote: Distribution installers usually try to probe OSes for building a suited grub menu. Unfortunately, mounting an ext3 partition, even in read-only mode, does perform some operations on the filesystem (log recovery). This is not a good idea since it may silently garbage data. Can you elaborate? Under what circumstances is log replay going to harm data? Do you mean that the installer mounts partitions, looking for what OS is installed? How is that harmful? If that disk was actually in use on another system but just exported via a SAN to this node you've potentially corrupted the filesystem. It's a bad idea to just go ahead and mount filesystems that you aren't told to mount. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3, BKL, journal replay, multiple non-bind mounts of same device
On Apr 10, 2007 20:49 -0400, John Anthony Kazos Jr. wrote: Since it is possible for the same block device to be mounted multiple times concurrently by the same filesystem, and since ext3 explicitly disables the BKL during its fill_super operation which would prevent this, what is the result of mounting it multiple times this way? Especially if the filesystem is dirty and a journal is replayed. (In any case, what operation is being performed by ext3/ext4 that requires the BKL to be dropped? What's the need to even consider the BKL during fill_super?) And in general, how does a filesystem deal with being mounted multiple times in this way? In my testing and exploration so far, everything seems to generally work, but I haven't tried deliberately using different instances of the mount concurrently. Do we end up with locks not being held properly on the superblock because the super_block structure instances don't know about each other? Has dealing with this behavior of bd_claim really been considered before, and if so, what's the general scheme for handling it? It is a myth (that actually frightened me quite a bit when I first did it) that the filesystem is mounted twice in this case. The truth of the matter is if you mount -t ext3 /dev/ /mnt/1 and ... /mnt/2 you actually get the equivalent of a bind mount for this block device on the two mount points. You can see this easily because e.g. you don't get two kjournald threads for the two mounts, and it doesn't completely blow up. If, on the other hand, you tried one mount with ext3 and another with ext4 it will fail the second with -EBUSY. As for the BKL changes, your best bet is to go back through GIT and/or BK or search the mailing lists to see when and why that was added. It appears to have been 2.6.11, but I don't know why. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs - pass1c terminates early if hard links
On Apr 11, 2007 06:42 -0400, Theodore Tso wrote: We don't currently have a way to modify indirect blocks directly (although I may create that soon or patches would be greatfully accepted :-), and of course there is a similar issue with extents and extended attributes blocks. Note that as part of the i_extra_isize e2fsck support we are adding functions to manage EAs to libext2fs that could be used as the basis for this. At the moment I'm assuming that e2fsprogs block allocation algorithms (which are not as sophisticated as what's in the kernel) aren't going to be changing. I thought as much myself. If they change, you're right, they could break the test. At the minimum I should document where the numbers are coming from in the test. At least the test would break and any change that caused it could be seen immediately. I could imagine a debugfs extension where we could do things like: set_var $shared_blk /dir4/quux block[0] set_inode_field /dir/foo block[0] $shared_blk set_inode_field /dir2/bar block[0] $shared_blk set_inode_field /dir3/baz block[0] $shared_blk Of course the temptation then would be to start adding conditions, functions, maybe an entire tcl interpreter :-) Yes, I thought of this too, and also the let's build a full interpreted language into debugfs conclusion. It wouldn't be fatal though - in some cases it would be nice to have debugfs loop over inodes to fix some unusual corruption. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: debugfs fill_bgs command?
On Apr 11, 2007 15:17 -0500, Eric Sandeen wrote: I was wondering if others think it would be useful to have a fill_bgs command in debugfs; this would (minimally) mark the lower X bgs as full for both inodes blocks (possibly either/or), to allow testing the higher block groups. I'm using a hacked up version of this to do a little ext3 testing above 8T. Given that this would really only be a testing option in nature, would it be accepted into e2fsprogs? I'd probably need some sort of unfill command as well, to put the bg counters back where they should be; probably by actually reading the bitmaps. This way fsck would still find a consistent filesystem... I had originally written a set_bg_field function too, to go with inode sb variants, though for marking the first few thousand bg's it was going to get a bit tedious... :) Try mke2fs -O lazy_bg XXX and be happy. Can be used on any kernel as it is a COMPAT feature and marks all but first and last groups as full. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] add FIEMAP ioctl to efficiently map file allocation
to allow this interface to be mapped onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped filesystems, they can at least improve over the -bmap() case by skipping holes in files that cover [dt]indirect blocks (saving thousands of calls). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: On 12 Apr 2007, at 12:05, Andreas Dilger wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags for input request */ XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN 0x02 Sound good but I would add: #define FIEMAP_LEN_NO_DIRECT_ACCESS This would say that the offset on disk can move at any time or that the data is compressed or encrypted on disk thus the data is not useful for direct disk access. This makes sense. Even for Reiserfs the same is true with packed tails, and I believe if FIBMAP is called on a tail it will migrate the tail into a block because this is might be a sign that the file is a kernel that LILO wants to boot. I'd rather not have any such feature in FIEMAP, and just return the on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. My main reason for FIEMAP is being able to investigate allocation patterns of files. By no means is my flag list exhaustive, just the ones that I thought would be needed to implement this for ext4 and Lustre. Also why are you not using 0xff00, i.e. two more zeroes at the end? Seems unnecessary to drop an extra 8 bits of significance from the byte size... It was actually just a typo (this was the first time I'd written the structs and flags down, it is just at the discussion stage). I'd meant for it to be 2^56 bytes for the file size as I wrote later in the email. That said, I think that 2^48 bytes is probably sufficient for most uses, so that we get 16 bits for flags. As it is this email already discusses 5 flags, and that would give little room for expansion in the future. Remember, this is the mapping for a single file (which can't practially be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to return a few separate extents which are actually contiguous (assuming that there will actually be files in filesystems with 2^48 bytes of contiguous space). Since the API is that it will return the extent that contains the requested start byte, the kernel will be able to detect this case also, since it won't be able to specify a length for the extent that contains the start byte. At most we'd have to call the ioctl() 65536 times for a completely contiguous 2^64 byte file if the buffer was only large enough for a single extent. In reality, I expect any file to have some discontinuities and the buffer to be large enough for a thousand or more entries so the corner case is not very bad. Finally please make sure that the file system can return in one way or another errors for example when it fails to determine the extents because the system ran out of memory, there was an i/o error, whatever... It may even be useful to be able to say here is an extent of size X bytes but we do not know where it is on disk because there was an error determining this particular extent's on-disk location for some reason or other... Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated to tape and currently has no blocks allocated in the filesystem. We want to return some indication that there is actual file data and not just a hole, but at the same time we don't want this to actually return the file from tape just to generate block mappings for it. This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ, but this needs to be specified on input to prevent the file being mapped and I'd rather the opposite (not getting file from tape) be the default, by principle of least surprise. block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). Why the fe_start == 0? Surely just the flag is sufficient... On NTFS it is perfectly valid to have fe_start == 0 and to have that not be sparse (normally the $Boot system file is stored in the first 8 sectors of the volume)... I thought
Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
Just a quick note before I forget. I thought there was a call in ext4 to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has more than 2^32 blocks? I don't see that anywhere in the 2.6.20 ext4. Is that in the upstream git repo? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Copy i_flags to ext3 inode flags on write
On Apr 16, 2007 19:05 +0200, Jan Kara wrote: attached is a patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */ +void ext3_get_inode_flags(struct inode *inode) +{ + unsigned int flags = inode-i_flags; + + EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL| + EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL); + if (flags S_SYNC) + EXT3_I(inode)-i_flags |= EXT3_SYNC_FL; + if (flags S_APPEND) + EXT3_I(inode)-i_flags |= EXT3_APPEND_FL; + if (flags S_IMMUTABLE) + EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL; + if (flags S_NOATIME) + EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL; + if (flags S_DIRSYNC) + EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL; +} It probably makes sense to pass struct ext3_inode_info *ei to this function, which is available at both callsites and avoids some pointer math for each access. void ext3_read_inode(struct inode * inode) { struct ext3_iloc iloc; @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t if (ei-i_state EXT3_STATE_NEW) memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size); + ext3_get_inode_flags(inode); raw_inode-i_mode = cpu_to_le16(inode-i_mode); if(!(test_opt(inode-i_sb, NO_UID32))) { raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid)); diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c --- linux-2.6.21-rc6/fs/ext3/ioctl.c 2007-02-07 12:03:23.0 +0100 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c 2007-04-12 18:22:54.0 +0200 @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st switch (cmd) { case EXT3_IOC_GETFLAGS: + ext3_get_inode_flags(inode); flags = ei-i_flags EXT3_FL_USER_VISIBLE; return put_user(flags, (int __user *) arg); case EXT3_IOC_SETFLAGS: { Looks fine otherwise - there is already ext3_set_inode_flags() which sets the VFS inode flags properly in ext3_read_inode(). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 16, 2007 18:01 +1000, Timothy Shimmin wrote: --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fiemap { struct fiemap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fiemap_extent fm_extents[0]; } # define FIEMAP_LEN_MASK 0xff # define FIEMAP_LEN_HOLE 0x01 # define FIEMAP_LEN_UNWRITTEN0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Well, that's what stood out for me. I was wondering where the fe_block field had gone - the physical address. So is your fe_start; /* starting offset */ actually the disk location (not a logical file offset) _except_ in the header (fiemap) where it is the desired logical offset. Correct. The fm_extent in the request contains the logical start offset and length in bytes of the requested fiemap region. In the returned header it represents the logical start offset of the extent that contained the requested start offset, and the logical length of all the returned extents. I haven't decided whether the returned length should be until EOF, or have the virtual hole at the end of the file. I think EOF makes more sense. The fe_start + fe_len in the fm_extents represent the physical location on the block device for that extent. fm_extent[i].fe_start (per Anton) is undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. Right, I'll write up a new RFC based on feedback here, and correcting the various errors in the original proposal. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). Correct. It saves space in the common case. Caller works something like: char buf[4096]; struct fiemap *fm = (struct fiemap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_start.fe_start = 0; /* start of file */ fm-fm_start.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0; /* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last entry will have less extents than the maximum */ while (fm-fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i fm-fm_extent_count; i++) { __u64 len = fm-fm_extents[i].fe_len FIEMAP_LEN_MASK; __u64 fm_next = fm-fm_start.fe_start + len; int hole = fm-fm_extents[i].fe_len FIEMAP_LEN_HOLE; int unwr = fm-fm_extents[i].fe_len FIEMAP_LEN_UNWRITTEN; printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n, fm-fm_start.fe_start, fm_next - 1, hole ? 0 : fm-fm_extents[i].fe_start, hole ? 0 : fm-fm_extents[i].fe_start + fm-fm_extents[i].fe_len - 1, len, hole ? (hole) : , unwr ? (unwritten) : ); /* get ready for printing next extent, or next ioctl */ fm-fm_start.fe_start = fm_next; } } Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 16, 2007 21:22 +1000, David Chinner wrote: On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fiemap { struct fiemap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fiemap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN0x02 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fiemap_extent for this. Christoph expressed the same concern. I'm not dead set against having an extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may mean the need for 50% more ioctls if the file is large. Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC0x0001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ0x0002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ /* flags for the returned extents */ #define FIEMAP_EXTENT_HOLE 0x0001 /* no space allocated */ #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* uninitialized space */ #define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ #define FIEMAP_EXTENT_ERROR 0x0008 /* error mapping space */ #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* no direct data access */ SUMMARY OF CHANGES == - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) - separate flags word for fm_flags: - FIEMAP_FLAG_SYNC = range should be synced to disk before returning mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future if there is agreement on whether that is desirable to have or if it is better to call ioctl(FIEMAP) on an XATTR fd. - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it - __u64 fm_unused does not take up an extra space on all power-of-two buffer sizes (would otherwise be at end of buffer), and may be handy in the future. - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data encrypted, compressed, etc), may want separate flags for these? - add new fe_lun word per extent for filesystems that manage multiple devices (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Can you clarify the terminology here? What is a stripe unit and what is a stripe width? Are there N * stripe_unit = stripe_width in e.g. a RAID
Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
On Apr 19, 2007 12:15 -0700, Mingming Cao wrote: On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote: Just a quick note before I forget. I thought there was a call in ext4 to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has more than 2^32 blocks? Question about the online resize case. If the fs is increased to more than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the journal. What about existing transactions that still stores 32 bit block numbers? I guess the journal need to commit them all so that revoke will not get confused about the bits for block numbers later. After that done then JBD2 can set this feature safely. Well, there are two options here: 1) refuse resizing filesystems beyond 16TB - this is required if they were not formatted as ext4 to start with, as the group descriptors will not be large enough to handle the _hi word in the bitmap/inode table locations - this is also a problem for block-mapped files that need to allocate blocks beyond 16TB (though this could just fail on those files with e.g. ENOSPC or EFBIG or something similar) 2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond 16TB. This would also require changing over to META_BG at some point, because there cannot be enough reserved group descriptor blocks (the resize_inode is set up for a maximum of 2TB filesystems I think) For now I'd be happy with just setting the JBD2_*_64BIT flag at mount for filesystems 16TB, and refusing resize across 16TB. We can fix it later. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
On Apr 19, 2007 17:41 -0700, Mingming Cao wrote: Any concerns about turn on META_BG by default for all new ext4 fs? Initially I thought we only need META_BG for support 256TB, so there is no rush to turn it on for all the new fs. But it appears there are multiple benefits to enable META_BG by default: I would prefer not to have it default for the first 1TB or so of the filesystem or so. One reason is that using META_BG for all of the groups give us only 2 backups of each group descriptor, and those are relatively close together. In the first 1TB we would get 17 backups of the group descriptors, which should be plenty. - enable online resize 2TB Actually, I don't think the current online resize support for META_BG. There was a patch last year by Glauber de Oliveira Costa which added support for online resizing with META_BG, which would need to be updated to work with ext4. Also, the usage of s_first_meta_bg in that patch is incorrect. - support 256TB fs True, though not exactly pressing, and filesystems can be changed to add META_BG support at any point. - Since metadatas(bitmaps, group descriptors etc) are not put at the beginning of each block group anymore, the 128MB limit(block group size with 4k block size) that used to limit an extent size is removed. - Speed up fsck since metadata are placed closely. That isn't really true, even though descriptions of META_BG say this. There will still be block and inode bitmaps and the inode table. The ext3 code was missing support for moving the bitmaps/itable outside their respective groups, and that has not been fixed yet in ext4. The problem is that ext4_check_descriptors() in the kernel was never changed to support META_BG, so it does not allow the bitmaps or inode table to be outside the group. Similarly, ext2fs_group_first_block() and ext2fs_group_last_block() in lib/ext2fs also don't take META_BG into account. Also, since the extent format supports at most 2^15 blocks (128MB) it doesn't really make much difference in that regard, though it does help the allocator somewhat because it has more contiguous space to allocate from. So I am wondering why not make it default? It wouldn't be too hard to add in support for this I think, and there is definitely some benefit. Since neither e2fsprogs nor the kernel handle this correctly, the placement of bitmaps and inode tables outside of their respective groups may as well be a separate feature. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix up lazy_bg bitmap initialization at mkfs time
On Apr 20, 2007 09:57 -0500, Eric Sandeen wrote: Anyway, how about something like this for calculating journal size in the face of lazy_bg. I know the last group may be smaller... but I figure this is just a heuristic anyway. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: e2fsprogs-1.39_ext4_hg/misc/util.c === --- e2fsprogs-1.39_ext4_hg.orig/misc/util.c +++ e2fsprogs-1.39_ext4_hg/misc/util.c @@ -252,8 +252,16 @@ void parse_journal_opts(const char *opts int figure_journal_size(int size, ext2_filsys fs) { blk_t j_blocks; + blk_t fs_size; - if (fs-super-s_blocks_count 2048) { + if (EXT2_HAS_COMPAT_FEATURE(fs-super, + EXT2_FEATURE_COMPAT_LAZY_BG)) { + fs_size = fs-super-s_blocks_per_group * 2; + } else { + fs_size = fs-super-s_blocks_count; + } This should also check for !RO_COMPAT_UNINIT_GROUPS, since LAZY_BG can be used in conjunction with UNINIT_GROUPS to mean don't zero uninitialized groups but doesn't set bg_free_blocks_count == 0. I was going to suggest using s_free_blocks_count, but that might lead to confusion if e.g. e2fsck deletes a journal on a nearly-full fs and then tune2fs recreates it much smaller than before. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix ext2 allocator overflows above 31 bit blocks
On Apr 20, 2007 12:10 -0500, Eric Sandeen wrote: If ext3 can do 16T, ext2 probably should be able to as well. There are still int block containers in the block allocation path that need to be fixed up. Yeah, but who wants to do 16TB e2fsck on every boot? I think there needs to be some limits imposed for the sake of usability. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote: On a good filesystem, when you do fsync() on a file, nothing at all happens to any other files. On ext3, it seems to sync the global journal, which means that just about *everything* that writes even a single byte (well, at least anything journalled, which would be all the normal directory ops etc) to disk will just *stop* dead cold! It's horrid. And it really is ext3, not fsync(). I used to run reiserfs, and it had its problems, but this was the feature of ext3 that I've disliked most. If you run a MUA with local mail, it will do fsync's for most things, and things really hickup if you are doing some other writes at the same time. In contrast, with reiser, if you did a big untar or some other big write, if somebody fsync'ed a small file, it wasn't even a blip on the radar - the fsync would sync just that small thing. It's true that this is a feature of ext3 with data=ordered (the default), but I suspect the same thing is now true in reiserfs too. The reason is that if a journal commit doesn't flush the data as well then a crash will result in garbage (from old deleted files) being visible in the newly allocated file. People used to complain about this with reiserfs all the time having corrupt data in new files after a crash, which is why I believe it was fixed. There definitely are some problems with the ext3 journal commit though. If the journal is full it will cause the whole journal to checkpoint out to the filesystem synchronously even if just space for a small transaction is needed. That is doubly bad if you have a very large journal. I believe Alex has a patch to have it checkpoint much smaller chunks to the fs. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large File Deletion Comparison (ext3, ext4, XFS)
On Apr 27, 2007 14:33 -0400, Theodore Tso wrote: Here are the results obtained with a not very fragmented 100-GB file: | ext3 ext4 + extents xfs nb of fragments | 796 798 15 elapsed time| 2m0.306s0m11.127s 0m0.553s | blks read | 2066006416352 blks written| 13592 13064104 The metablockgroups feature should help the file fragmentation level with extents. It's easy enough to enable this for ext4 (we just need to remove some checks in ext4_check_descriptors), so we should just do it. I agree in this case that the META_BG feature would help here (100GB / 128MB is in fact the 800 fragments shown), I don't think that is the major performance hit. The fact that we need to read 6000 blocks and write 13000 blocks is the more serious part. I assume that since there are only 800 fragments there should be only 800 extents. We can fit (4096 / 12 - 1) = 340 extents into each block, and 4 index blocks into the inode, so this should allow all 800 extents in only 3 index blocks. It would be useful to know where those 6416 block reads are going in the extent case. I suspect that is because the tail first truncation mechanism of ext3 causes it to zero out FAR more blocks than needed. With extents and a default 128MB journal we should be able to truncate + unlink a file with only writes to the inode and the 800 bitmap + gdt blocks. The reads should also be limited to the bitmap blocks and extent indexes (gdt being read at mount time). What is needed is for truncate to walk the inode block tree (extents or indirect blocks) and count the bitmaps + gdt blocks dirtied, and then try and do the whole truncate under a single transaction. That avoids any need for truncate to be restartable and then there is no need to zero out the indirect blocks from the end one-at-a-time. Doing the bitmap read/write will definitely be more efficient with META_BG, but that doesn't explain the other 19k blocks undergoing IO. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large File Deletion Comparison (ext3, ext4, XFS)
On Apr 27, 2007 15:41 +0200, Valerie Clement wrote: As asked by Alex, I included in the test results the file fragmentation level and the number of I/Os done during the file deletion. Here are the results obtained with a not very fragmented 100-GB file: | ext3 ext4 + extents xfs nb of fragments | 796 798 15 elapsed time| 2m0.306s0m11.127s 0m0.553s | blks read | 2066006416352 blks written| 13592 13064104 And with a more fragmented 100-GB file: | ext3 ext4 + extents xfs nb of fragments | 20297 19841234 elapsed time| 2m18.914s0m27.429s 0m0.892s | blks read | 225624 25432592 blks written| 52120 50664872 More details on our web site: http://www.bullopensource.org/ext4/20070404/FileDeletion.html Ah, one thing that is only mentioned in the URL is that the IO count is in units of 512-byte sectors. In the case of XFS doing logical journaling this avoids a huge amount of double writes to the journal and then to the filesystem. I still think ext4 could do better than it currently does. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Apr 30, 2007 08:09 -0400, Theodore Tso wrote: On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). While we're at it, someone want to start thinking about on-line shrinking of ext4 filesystems? Again, the same block remapping interfaces for defrag and file access optimizations should also be useful for shrinking filesystems (even if some of the files that need to be relocated are being actively used). If not, that probably means we got the interface wrong. Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-1.39-tyt3
On Apr 30, 2007 11:22 -0400, Theodore Ts'o wrote: One concern I still have is the fact that we're exposing a lot of interfaces in libext2fs.so which are very specifically tied to the current 48-bit physical/32-bit logical on-disk extent data structure. If/when we add support for the 64/64-bit extent data structure, and some kind of compressed/bit-packed extent format, those interfaces will have to be extended. Some other comments: - it appears you have all of extents.c copied into both block.c and in extent.c - it appears you are missing a couple of extent sanity checks for headers and such that were added more recently, but those might also have been removed by you - do the patches as they stand actually fix the duplicate block test cases yet? It seems all of the BLOCK_CHANGED handling was removed from block_iterate_extents(). - it would be easier to read if the patches were generated with diff -up (can be set in .quiltrc via 'export QUILT_DIFF_OPTS=-up'. I also like export QUILT_NO_DIFF_TIMESTAMPS=1 to avoid tons of gratuitous changes to patches when they are refreshed. Another problem is that while external extent blocks are getting byte swapped, there is no byte swapping going on for the extents stored in the inode. I haven't tried it on a big endian system, such as Power machine, but I'm pretty sure it's going to blow up spectacularly. Ah, interesting. We have 64-bit machines for testing, but no big-endian ones. The patches can be found at: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/e2fsprogs-1.39-tyt3 What else I think is important to get into this patch series is a change to mkfs.ext4 so that the default inode size is increased to 256. I'm not sure how/if this can be done via mke2fs.conf. I also wouldn't object to changing the default inode_ratio on large filesystems to be a lot fewer than 1/8192 bytes. At a minimum, allowing an inode_ratio 1/4MB should be allowed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-1.39-tyt3
On May 01, 2007 11:05 -0400, Theodore Tso wrote: Speaking of SCM's, how wedded is Clustrefs to mercurial? I've been considering migrating e2fsprogs development to git, but one of the reasons I haven't is because I was concerned that might be incovenient for you folks. Not at all, we just generate a 1.39 to 1.39-WIP patch, and then use quilt + CVS to manage the resulting patches. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On May 01, 2007 11:28 -0400, Theodore Tso wrote: On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote: Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Well, I did say necessary, and not sufficient. But yes, moving inodes, especially if the inode is currently open gets interesting. I don't think there are that many user space applications that would notice or care if the st_ino of an open file changed out from under them, but there are obviously userspace applications, such as tar, that would most definitely care. I think rm -r does a LOT of this kind of operation, like: stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.) I think find does the same to avoid security problems with malicious path manipulation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On May 02, 2007 20:57 +1000, David Chinner wrote: On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: HSM_READ is definitely _NOT_ required because all it means is if the file is OFFLINE, bring it ONLINE and then return the extent map. You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called HSM_READ instead of HSM_NO_READ. The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like where does the HSM file live and how many copies are there. Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP) if an application actually needs the data to be present instead of just returning mapping info that includes UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] fallocate system call
On May 02, 2007 18:23 +0530, Amit K. Arora wrote: On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: For FA_ALLOCATE, it's supposed to change the file size if we allocate past EOF, right? I would argue no. Use truncate for that. The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. I think I'd agree - it may be useful to allow preallocation beyond EOF for some kinds of applications (e.g. PVR preallocating live TV in 10 minute segments or something, but not knowing in advance how long the show will actually be recorded or the final encoded size). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On May 06, 2007 21:40 -0700, Andrew Morton wrote: On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] wrote: 2.6.20.6, FC4: I created a 91248k ext3 fs with 4k blocksize: Next, I tried to resize it to about 3G using ext2online while mounted: At that time the kernel said: |JBD: ext2online wants too many credits (744 256) What is the limitation I should be aware of? Has it something to do with the journal log size? Yes, for very small filesystems the default journal size is only 4MB, and the maximum transaction size is 1MB (256 blocks). If the filesystem was 128MB or larger in the initial mke2fs then the journal would be 16MB and the resize would get up to 1024 blocks for the transaction. I'm not sure what the right solution is for this. If you know you will be resizing the fs you could increase the initial journal size at mke2fs time (-J size=16). Alternately, it is possible resize to an intermediate size and then delete and recreate the journal via tune2fs (which would be the larger size by default) but that can only be done offline. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 03, 2007 21:31 -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do something. If the file is using bitmaps then we will do something else. But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + int ret, ret2, nblocks = 0, retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode-i_blkbits; + + /* Currently supporting (pre)allocate mode _only_ */ + if (mode != FA_ALLOCATE) + return -EOPNOTSUPP; + + if (!(EXT4_I(inode)-i_flags EXT4_EXTENTS_FL)) + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) +- block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On May 07, 2007 19:16 +0530, Aneesh Kumar K.V wrote: Theodore Tso wrote: Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). That way we don't need to add kernel code that convert ext3 inode to ext4 inode while defragmentation. Last time i looked at the online defrag code, i was not able to figure out a easy way to take indirect map inode as input, then defrag the same and write the result in extent map. I think the online defrag code _should_ be fairly easily made to handle defrag of block-mapped files, so long as it is always creating extent- mapped files in the end. All that should be necessary is to have the kernel read data from the block-mapped file and write it into the newly mapped blocks. That functionality needs to exist for a long time anyways to support upgrading the filesystem so it shouldn't be a huge burden. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On May 07, 2007 11:50 -0400, Theodore Tso wrote: On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote: On Mon, 7 May 2007, Theodore Tso wrote: We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. Clearly the right thing for resize2fs to do when it's doing an off-line resize is to just adjust the journal inode and make it bigger. I'll add that to my todo list, but of course, patches to do that would be gratefully accepted... For that matter, there was a paper from U. Wisconsin showing that having the journal in the middle of the filesystem gave noticably better performance due to lower average seek times. If anyone is looking at messing with the journal that is probably also a good and easy place to start (e.g. may be as easy as just giving a goal block of s_blocks_count / 2 to the journal create code). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On May 07, 2007 14:52 -0500, Eric Sandeen wrote: Theodore Tso wrote: How likely do you think the case will be that mke2fs.conf would be missing? I'm trying to figure out how high priority of an item this really is. Well, not too likely, although for some reason I guess it happened in the installer root in FC6 or so. That's what raised the issue. Ah, good point - there are probably lots of installers and rescue disks that grab mke2fs but don't know anything about /etc/mke2fs.conf. We could enhance the profile code so that it could read in the profile from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, but that adds bloat --- the question is how necessary do we think that really is? I guess it doesn't really sound *necessary* - it's just that if we have 2 different defaults and they drift, it can be confusing... Since the shipped mke2fs.conf is only on the order of a few hundred bytes I don't think it is a huge issue to include it. I like the idea of it being compiled into the code and yet consistent with the default install. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 07, 2007 19:02 -0400, Jeff Garzik wrote: Andreas Dilger wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating a 32bit blocks filesystem.
On May 09, 2007 16:57 +0200, Valerie Clement wrote: Jose R. Santos wrote: I think this has more to do with the fact that I'm on a 32bit architecture and there are still a couple places where blocks are represented using unsigned long. I'm trying to get access to a 64bit arch to confirm this. Oh, I didn't catch that you use a 32-bit system. On 32-bit architectures, the page cache index size imposes a 16TB limit on the filesystem size (with 4KB blocksize). So you need a 64-bit system for your test. The mke2fs code should warn the user in this case that the filesystem will not be usable on 32-bit systems. I believe mke2fs already checks PAGE_SIZE in order to validate blocksize 4096 filesystem requests, and a simple check for sizeof(long) to see if it is a 32-bit system would be enough. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: commit only journal entries older than commit period?
On May 09, 2007 15:31 +0200, Leon Woestenberg wrote: this is something I have long wondered about but have been afraid to ask. When my system is chewing away on builds, the disk I/O write access pattern of my ext3 root filesystem (using CFQ, Intel SATA controller, hard disk) when visualized by GNOME System Monitor clearly shows a repetitive landscape of large peaks, 5 seconds apart, which not much activity inbetween. I understand that's due to the ex3 journal commit interval (defaults to 5 seconds). But why isn't the filesystem continuously committing only that part of the journal that is older than 5 seconds? I would then expect the write requests to be smoothened over time, which can only be good in terms of performance and low latency. The journal cannot start checkpoint of blocks to the filesystem until the transaction is committed, otherwise a crash would leave the filesystem inconsistent. After the transaction is committed the blocks are written into the fs and I'd guess that this can happen quickly enough that it only takes a fraction of a second. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On May 09, 2007 21:31 +0530, Amit K. Arora wrote: 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. 3) If above is true, the file size will need to be changed for unallocation when block holding the EOF gets unallocated. - If we do not unallocate normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. 4) Should we update mtime ctime on a successfull allocation/ unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the this file was modified point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext4 devel interlock meeting minutes (May 7, 2007)
On May 09, 2007 10:02 -0700, Avantika Mathur wrote: Metablock Groups: - Mingming mentioned that in the metablock group feature, the inode metadata is not moved with the rest of the metadata. Ted will check this, as he thought this had been implemented. - Online resize does not use metablock groups, and is currently limited to 2TB. - Also need to remove sanity checks in the mount code - Once these issues are fixed, the metablock group feature can be turned on by default for ext4 filesysytems In my recent investigation of META_BG, it appears that BOTH the kernel and e2fsprogs do NOT handle group metadata (bitmaps, itable) outside the group. In the kernel ext[234]_check_descriptors() requires the bitmaps and itable to be inside the group even though META_BG is supported for a long time. In e2fsck/super.c check_super_block() also requires that the bitmaps and itable be inside the group even though e2fsck has claimed META_BG for a long time. That means it is virtually impossible to allow the group metadata to move under the META_BG feature. It might be possible to predicate this change under both 64BIT and META_BG, though this is somewhat non-standard. Alternately, we could allow the relocation of bitmaps and itables under BIG_BG, which is itself covers the relocation aspects and also allows the ability to change the group size. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html