Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ # define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ # define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from HSM */ # define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ /* flags for the returned extents */ # define FIEMAP_EXTENT_HOLE 0x0001 /* no space allocated */ # define FIEMAP_EXTENT_UNWRITTEN0x0002 /* uninitialized space */ # define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ # define FIEMAP_EXTENT_ERROR0x0008 /* error mapping space */ # define FIEMAP_EXTENT_NO_DIRECT0x0010 /* no direct data access */ SUMMARY OF CHANGES == - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) I much prefer that - it makes it a lot clearer to me to have fiemap_extent just for fm_extents (no different meanings now). (Don't like the word offset in comment without physical or some such but whatever;-) I also prefer the flags as separate fields too :) --Tim - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7
On Thu, Apr 19 2007, Valerie Clement wrote: Jens Axboe wrote: Please tell me how you are running ffsb, and also please include a dmessg from a booted system. Hi, our mails crossed! please see my response to Andrew. You could reproduce the problem with dd command as suggested, it's more easy. I'm sending you the dmesg info. For my tests I used the scsci sdc device. Thanks, it does. Can you try one thing for me? If you run the test on sdc, try doing: # echo 64 /sys/block/sdc/queue/iosched/quantum and repeat the test. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted
Hi, This patch removes a code snippet from check_ea_in_inode() in pass1 which checks if the EA values in the inode are sorted or not. The comments in fs/ext*/xattr.c state that the EA values in the external EA block are sorted but those in the inode need not be sorted. I have also attached a test image which has unsorted EAs in the inodes. The current e2fsck wrongly clears the EAs in the inode. Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Index: e2fsprogs-1.40/e2fsck/pass1.c === --- e2fsprogs-1.40.orig/e2fsck/pass1.c +++ e2fsprogs-1.40/e2fsck/pass1.c @@ -246,7 +246,7 @@ static void check_ea_in_inode(e2fsck_t c struct ext2_inode_large *inode; struct ext2_ext_attr_entry *entry; char *start, *end; - unsigned int storage_size, remain, offs; + unsigned int storage_size, remain; int problem = 0; inode = (struct ext2_inode_large *) pctx-inode; @@ -261,7 +261,6 @@ static void check_ea_in_inode(e2fsck_t c /* take finish entry 0UL into account */ remain = storage_size - sizeof(__u32); - offs = end - start; while (!EXT2_EXT_IS_LAST_ENTRY(entry)) { @@ -285,15 +284,6 @@ static void check_ea_in_inode(e2fsck_t c goto fix; } - /* check value placement */ - if (entry-e_value_offs + - EXT2_XATTR_SIZE(entry-e_value_size) != offs) { - printf((entry-e_value_offs + entry-e_value_size: %d, offs: %d)\n, entry-e_value_offs + entry-e_value_size, offs); - pctx-num = entry-e_value_offs; - problem = PR_1_ATTR_VALUE_OFFSET; - goto fix; - } - /* e_value_block must be 0 in inode's ea */ if (entry-e_value_block != 0) { pctx-num = entry-e_value_block; @@ -309,7 +299,6 @@ static void check_ea_in_inode(e2fsck_t c } remain -= entry-e_value_size; - offs -= EXT2_XATTR_SIZE(entry-e_value_size); entry = EXT2_EXT_ATTR_NEXT(entry); } Thanks, Kalpak Shah. [EMAIL PROTECTED] foo.img.gz Description: GNU Zip compressed data
Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7
On Thu, Apr 19 2007, Jens Axboe wrote: On Thu, Apr 19 2007, Valerie Clement wrote: Jens Axboe wrote: Please tell me how you are running ffsb, and also please include a dmessg from a booted system. Hi, our mails crossed! please see my response to Andrew. You could reproduce the problem with dd command as suggested, it's more easy. I'm sending you the dmesg info. For my tests I used the scsci sdc device. Thanks, it does. Can you try one thing for me? If you run the test on sdc, try doing: # echo 64 /sys/block/sdc/queue/iosched/quantum and repeat the test. And, then try this one as well (and don't tweak quantum for that kernel): diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..9e37971 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -986,9 +986,9 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, * expire an async queue immediately if it has used up its slice. idle * queue always expire after 1 dispatch round. */ - if ((!cfq_cfqq_sync(cfqq) + if (cfqd-busy_queues 1 ((!cfq_cfqq_sync(cfqq) cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) || - cfq_class_idle(cfqq)) { + cfq_class_idle(cfqq))) { cfqq-slice_end = jiffies + 1; cfq_slice_expired(cfqd, 0, 0); } @@ -1051,19 +1051,21 @@ cfq_dispatch_requests(request_queue_t *q, int force) while ((cfqq = cfq_select_queue(cfqd)) != NULL) { int max_dispatch; - /* -* Don't repeat dispatch from the previous queue. -*/ - if (prev_cfqq == cfqq) - break; + if (cfqd-busy_queues 1) { + /* +* Don't repeat dispatch from the previous queue. +*/ + if (prev_cfqq == cfqq) + break; - /* -* So we have dispatched before in this round, if the -* next queue has idling enabled (must be sync), don't -* allow it service until the previous have continued. -*/ - if (cfqd-rq_in_driver cfq_cfqq_idle_window(cfqq)) - break; + /* +* So we have dispatched before in this round, if the +* next queue has idling enabled (must be sync), don't +* allow it service until the previous have continued. +*/ + if (cfqd-rq_in_driver cfq_cfqq_idle_window(cfqq)) + break; + } cfq_clear_cfqq_must_dispatch(cfqq); cfq_clear_cfqq_wait_request(cfqq); @@ -1370,7 +1372,9 @@ retry: atomic_set(cfqq-ref, 0); cfqq-cfqd = cfqd; - cfq_mark_cfqq_idle_window(cfqq); + if (key != CFQ_KEY_ASYNC) + cfq_mark_cfqq_idle_window(cfqq); + cfq_mark_cfqq_prio_changed(cfqq); cfq_mark_cfqq_queue_new(cfqq); cfq_init_prio_data(cfqq); -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7
Jens Axboe wrote: On Thu, Apr 19 2007, Valerie Clement wrote: Jens Axboe wrote: Please tell me how you are running ffsb, and also please include a dmessg from a booted system. Hi, our mails crossed! please see my response to Andrew. You could reproduce the problem with dd command as suggested, it's more easy. I'm sending you the dmesg info. For my tests I used the scsci sdc device. Thanks, it does. Can you try one thing for me? If you run the test on sdc, try doing: # echo 64 /sys/block/sdc/queue/iosched/quantum and repeat the test. OK, that's done. With the change of quantum, the throughput scores are now a little bit better in 2.6.21 than in 2.6.20. Valérie - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7
On Thu, Apr 19 2007, Valerie Clement wrote: Jens Axboe wrote: On Thu, Apr 19 2007, Valerie Clement wrote: Jens Axboe wrote: Please tell me how you are running ffsb, and also please include a dmessg from a booted system. Hi, our mails crossed! please see my response to Andrew. You could reproduce the problem with dd command as suggested, it's more easy. I'm sending you the dmesg info. For my tests I used the scsci sdc device. Thanks, it does. Can you try one thing for me? If you run the test on sdc, try doing: # echo 64 /sys/block/sdc/queue/iosched/quantum and repeat the test. OK, that's done. With the change of quantum, the throughput scores are now a little bit better in 2.6.21 than in 2.6.20. Wonderful, now try the patch I sent in the next mail and repeat the test. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote: Just a quick note before I forget. I thought there was a call in ext4 to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has more than 2^32 blocks? Question about the online resize case. If the fs is increased to more than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the journal. What about existing transactions that still stores 32 bit block numbers? I guess the journal need to commit them all so that revoke will not get confused about the bits for block numbers later. After that done then JBD2 can set this feature safely. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix up lazy_bg bitmap initialization at mkfs time
While trying out the -O lazy_bg option, I ran into some trouble on my big filesystem. The journal size was free blocks in the first block group, so it spilled into the next bg with available blocks. Since we are using lazy_bg here, that -should- have been the last block group. But, when setup_lazy_bg() marks block groups as UNINIT, it doesn't do anything with the bitmaps (as designed). However, the block allocation routine simply searches the bitmap for next available blocks, and finds them in the 2nd bg, despite it being marked UNINIT - the summaries aren't checked during allocation. This also caused the 1st group free block numbers to get out of whack, as we start subtracting from zero: Group 0: block bitmap at 1025, inode bitmap at 1026, inode table at 1027 0 free blocks, 16373 free inodes, 2 used directories Group 1: block bitmap at 33793, inode bitmap at 33794, inode table at 33795 63957 free blocks, 0 free inodes, 0 used directories [Inode not init, Block not init] Group 2: block bitmap at 65536, inode bitmap at 65537, inode table at 65538 0 free blocks, 0 free inodes, 0 used directories [Inode not init, Block not init] The following patch seems to fix this up for me; just mark the in-memory bitmaps as full for any bg's we flag as UNINIT. The bitmaps aren't marked as dirty, so they won't be written out. When bitmaps are re-read on the next invocation of debugfs, etc, the UNINIT flag will be found, and again the in-memory bitmaps will be marked as full. This has the somewhat interesting, but correct, result of making the journal blocks land in both the first and last bgs of a 16T filesystem: :) BLOCKS: (0-11):1520-1531, (IND):1532, (12-1035):1533-2556, ... (IND):4194272286, (31756-32768):4194272287-419427329 Unfortunately it also increases mkfs time a bit, as it must search a huge string of unavailable blocks if it has to allocate in the last bg. Ah well... Thanks, -Eric Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: e2fsprogs-1.39_ext4_hg/misc/mke2fs.c === --- e2fsprogs-1.39_ext4_hg.orig/misc/mke2fs.c +++ e2fsprogs-1.39_ext4_hg/misc/mke2fs.c @@ -450,16 +450,22 @@ static void setup_lazy_bg(ext2_filsys fs int blks; struct ext2_super_block *sb = fs-super; struct ext2_group_desc *bg = fs-group_desc; + char *block_bitmap = fs-block_map-bitmap; + char *inode_bitmap = fs-inode_map-bitmap; + int block_nbytes = (int) EXT2_BLOCKS_PER_GROUP(fs-super) / 8; + int inode_nbytes = (int) EXT2_INODES_PER_GROUP(fs-super) / 8; if (EXT2_HAS_COMPAT_FEATURE(fs-super, EXT2_FEATURE_COMPAT_LAZY_BG)) { for (i = 0; i fs-group_desc_count; i++, bg++) { if ((i == 0) || (i == fs-group_desc_count-1)) - continue; + goto skip; if (bg-bg_free_inodes_count == sb-s_inodes_per_group) { bg-bg_free_inodes_count = 0; + /* NB: set in mem only, see also read_bitmaps */ + memset(inode_bitmap, 0xff, inode_nbytes); bg-bg_flags |= EXT2_BG_INODE_UNINIT; sb-s_free_inodes_count -= sb-s_inodes_per_group; @@ -467,9 +473,13 @@ static void setup_lazy_bg(ext2_filsys fs blks = ext2fs_super_and_bgd_loc(fs, i, 0, 0, 0, 0); if (bg-bg_free_blocks_count == blks) { bg-bg_free_blocks_count = 0; + memset(block_bitmap, 0xff, block_nbytes); bg-bg_flags |= EXT2_BG_BLOCK_UNINIT; sb-s_free_blocks_count -= blks; } +skip: + block_bitmap += block_nbytes; + inode_bitmap += inode_nbytes; } } } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
On Apr 19, 2007 12:15 -0700, Mingming Cao wrote: On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote: Just a quick note before I forget. I thought there was a call in ext4 to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has more than 2^32 blocks? Question about the online resize case. If the fs is increased to more than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the journal. What about existing transactions that still stores 32 bit block numbers? I guess the journal need to commit them all so that revoke will not get confused about the bits for block numbers later. After that done then JBD2 can set this feature safely. Well, there are two options here: 1) refuse resizing filesystems beyond 16TB - this is required if they were not formatted as ext4 to start with, as the group descriptors will not be large enough to handle the _hi word in the bitmap/inode table locations - this is also a problem for block-mapped files that need to allocate blocks beyond 16TB (though this could just fail on those files with e.g. ENOSPC or EFBIG or something similar) 2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond 16TB. This would also require changing over to META_BG at some point, because there cannot be enough reserved group descriptor blocks (the resize_inode is set up for a maximum of 2TB filesystems I think) For now I'd be happy with just setting the JBD2_*_64BIT flag at mount for filesystems 16TB, and refusing resize across 16TB. We can fix it later. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4
On Apr 19, 2007 17:41 -0700, Mingming Cao wrote: Any concerns about turn on META_BG by default for all new ext4 fs? Initially I thought we only need META_BG for support 256TB, so there is no rush to turn it on for all the new fs. But it appears there are multiple benefits to enable META_BG by default: I would prefer not to have it default for the first 1TB or so of the filesystem or so. One reason is that using META_BG for all of the groups give us only 2 backups of each group descriptor, and those are relatively close together. In the first 1TB we would get 17 backups of the group descriptors, which should be plenty. - enable online resize 2TB Actually, I don't think the current online resize support for META_BG. There was a patch last year by Glauber de Oliveira Costa which added support for online resizing with META_BG, which would need to be updated to work with ext4. Also, the usage of s_first_meta_bg in that patch is incorrect. - support 256TB fs True, though not exactly pressing, and filesystems can be changed to add META_BG support at any point. - Since metadatas(bitmaps, group descriptors etc) are not put at the beginning of each block group anymore, the 128MB limit(block group size with 4k block size) that used to limit an extent size is removed. - Speed up fsck since metadata are placed closely. That isn't really true, even though descriptions of META_BG say this. There will still be block and inode bitmaps and the inode table. The ext3 code was missing support for moving the bitmaps/itable outside their respective groups, and that has not been fixed yet in ext4. The problem is that ext4_check_descriptors() in the kernel was never changed to support META_BG, so it does not allow the bitmaps or inode table to be outside the group. Similarly, ext2fs_group_first_block() and ext2fs_group_last_block() in lib/ext2fs also don't take META_BG into account. Also, since the extent format supports at most 2^15 blocks (128MB) it doesn't really make much difference in that regard, though it does help the allocator somewhat because it has more contiguous space to allocate from. So I am wondering why not make it default? It wouldn't be too hard to add in support for this I think, and there is definitely some benefit. Since neither e2fsprogs nor the kernel handle this correctly, the placement of bitmaps and inode tables outside of their respective groups may as well be a separate feature. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html