Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-19 Thread Timothy Shimmin

--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger [EMAIL PROTECTED] wrote:


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
# define FIEMAP_FLAG_SYNC   0x0001  /* flush delalloc data to disk*/
# define FIEMAP_FLAG_HSM_READ   0x0002  /* retrieve data from HSM */
# define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/

/* flags for the returned extents */
# define FIEMAP_EXTENT_HOLE 0x0001  /* no space allocated */
# define FIEMAP_EXTENT_UNWRITTEN0x0002  /* uninitialized space 
*/
# define FIEMAP_EXTENT_UNKNOWN  0x0004  /* in use, location unknown */
# define FIEMAP_EXTENT_ERROR0x0008  /* error mapping space */
# define FIEMAP_EXTENT_NO_DIRECT0x0010  /* no direct data 
access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)


I much prefer that - it makes it a lot clearer to me to have fiemap_extent
just for fm_extents (no different meanings now).
(Don't like the word offset in comment without physical or some such but 
whatever;-)
I also prefer the flags as separate fields too :)

--Tim
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-19 Thread Jens Axboe
On Thu, Apr 19 2007, Valerie Clement wrote:
 Jens Axboe wrote:
 Please tell me how you are running ffsb, and also please include a
 dmessg from a booted system.
 
 Hi,
 our mails crossed! please see my response to Andrew.
 You could reproduce the problem with dd command as suggested, it's more 
 easy.
 I'm sending you the dmesg info. For my tests I used the scsci sdc device.

Thanks, it does. Can you try one thing for me? If you run the test on
sdc, try doing:

# echo 64  /sys/block/sdc/queue/iosched/quantum

and repeat the test.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted

2007-04-19 Thread Kalpak Shah
Hi,

This patch removes a code snippet from check_ea_in_inode() in pass1 which 
checks if the EA values in the inode are sorted or not. The comments in 
fs/ext*/xattr.c state that the EA values in the external EA block are sorted 
but those in the inode need not be sorted. I have also attached a test image 
which has unsorted EAs in the inodes. The current e2fsck wrongly clears the EAs 
in the inode.

Signed-off-by: Kalpak Shah [EMAIL PROTECTED]

Index: e2fsprogs-1.40/e2fsck/pass1.c
===
--- e2fsprogs-1.40.orig/e2fsck/pass1.c
+++ e2fsprogs-1.40/e2fsck/pass1.c
@@ -246,7 +246,7 @@ static void check_ea_in_inode(e2fsck_t c
struct ext2_inode_large *inode;
struct ext2_ext_attr_entry *entry;
char *start, *end;
-   unsigned int storage_size, remain, offs;
+   unsigned int storage_size, remain;
int problem = 0;
 
inode = (struct ext2_inode_large *) pctx-inode;
@@ -261,7 +261,6 @@ static void check_ea_in_inode(e2fsck_t c
 
/* take finish entry 0UL into account */
remain = storage_size - sizeof(__u32); 
-   offs = end - start;
 
while (!EXT2_EXT_IS_LAST_ENTRY(entry)) {
 
@@ -285,15 +284,6 @@ static void check_ea_in_inode(e2fsck_t c
goto fix;
}
 
-   /* check value placement */
-   if (entry-e_value_offs + 
-   EXT2_XATTR_SIZE(entry-e_value_size) != offs) {
-   printf((entry-e_value_offs + entry-e_value_size: %d, 
offs: %d)\n, entry-e_value_offs + entry-e_value_size, offs);
-   pctx-num = entry-e_value_offs;
-   problem = PR_1_ATTR_VALUE_OFFSET;
-   goto fix;
-   }
-   
/* e_value_block must be 0 in inode's ea */
if (entry-e_value_block != 0) {
pctx-num = entry-e_value_block;
@@ -309,7 +299,6 @@ static void check_ea_in_inode(e2fsck_t c
}
 
remain -= entry-e_value_size;
-   offs -= EXT2_XATTR_SIZE(entry-e_value_size);
 
entry = EXT2_EXT_ATTR_NEXT(entry);
}

Thanks,
Kalpak Shah.
[EMAIL PROTECTED]


foo.img.gz
Description: GNU Zip compressed data


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-19 Thread Jens Axboe
On Thu, Apr 19 2007, Jens Axboe wrote:
 On Thu, Apr 19 2007, Valerie Clement wrote:
  Jens Axboe wrote:
  Please tell me how you are running ffsb, and also please include a
  dmessg from a booted system.
  
  Hi,
  our mails crossed! please see my response to Andrew.
  You could reproduce the problem with dd command as suggested, it's more 
  easy.
  I'm sending you the dmesg info. For my tests I used the scsci sdc device.
 
 Thanks, it does. Can you try one thing for me? If you run the test on
 sdc, try doing:
 
 # echo 64  /sys/block/sdc/queue/iosched/quantum
 
 and repeat the test.

And, then try this one as well (and don't tweak quantum for that
kernel):

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b6491c0..9e37971 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -986,9 +986,9 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 * expire an async queue immediately if it has used up its slice. idle
 * queue always expire after 1 dispatch round.
 */
-   if ((!cfq_cfqq_sync(cfqq) 
+   if (cfqd-busy_queues  1  ((!cfq_cfqq_sync(cfqq) 
cfqd-dispatch_slice = cfq_prio_to_maxrq(cfqd, cfqq)) ||
-   cfq_class_idle(cfqq)) {
+   cfq_class_idle(cfqq))) {
cfqq-slice_end = jiffies + 1;
cfq_slice_expired(cfqd, 0, 0);
}
@@ -1051,19 +1051,21 @@ cfq_dispatch_requests(request_queue_t *q, int force)
while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
int max_dispatch;
 
-   /*
-* Don't repeat dispatch from the previous queue.
-*/
-   if (prev_cfqq == cfqq)
-   break;
+   if (cfqd-busy_queues  1) {
+   /*
+* Don't repeat dispatch from the previous queue.
+*/
+   if (prev_cfqq == cfqq)
+   break;
 
-   /*
-* So we have dispatched before in this round, if the
-* next queue has idling enabled (must be sync), don't
-* allow it service until the previous have continued.
-*/
-   if (cfqd-rq_in_driver  cfq_cfqq_idle_window(cfqq))
-   break;
+   /*
+* So we have dispatched before in this round, if the
+* next queue has idling enabled (must be sync), don't
+* allow it service until the previous have continued.
+*/
+   if (cfqd-rq_in_driver  cfq_cfqq_idle_window(cfqq))
+   break;
+   }
 
cfq_clear_cfqq_must_dispatch(cfqq);
cfq_clear_cfqq_wait_request(cfqq);
@@ -1370,7 +1372,9 @@ retry:
atomic_set(cfqq-ref, 0);
cfqq-cfqd = cfqd;
 
-   cfq_mark_cfqq_idle_window(cfqq);
+   if (key != CFQ_KEY_ASYNC)
+   cfq_mark_cfqq_idle_window(cfqq);
+
cfq_mark_cfqq_prio_changed(cfqq);
cfq_mark_cfqq_queue_new(cfqq);
cfq_init_prio_data(cfqq);

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-19 Thread Valerie Clement

Jens Axboe wrote:

On Thu, Apr 19 2007, Valerie Clement wrote:

Jens Axboe wrote:

Please tell me how you are running ffsb, and also please include a
dmessg from a booted system.


Hi,
our mails crossed! please see my response to Andrew.
You could reproduce the problem with dd command as suggested, it's more 
easy.

I'm sending you the dmesg info. For my tests I used the scsci sdc device.


Thanks, it does. Can you try one thing for me? If you run the test on
sdc, try doing:

# echo 64  /sys/block/sdc/queue/iosched/quantum

and repeat the test.



OK, that's done.

With the change of quantum, the throughput scores are now a little bit 
better in 2.6.21 than in 2.6.20.


  Valérie

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-19 Thread Jens Axboe
On Thu, Apr 19 2007, Valerie Clement wrote:
 Jens Axboe wrote:
 On Thu, Apr 19 2007, Valerie Clement wrote:
 Jens Axboe wrote:
 Please tell me how you are running ffsb, and also please include a
 dmessg from a booted system.
 
 Hi,
 our mails crossed! please see my response to Andrew.
 You could reproduce the problem with dd command as suggested, it's more 
 easy.
 I'm sending you the dmesg info. For my tests I used the scsci sdc device.
 
 Thanks, it does. Can you try one thing for me? If you run the test on
 sdc, try doing:
 
 # echo 64  /sys/block/sdc/queue/iosched/quantum
 
 and repeat the test.
 
 
 OK, that's done.
 
 With the change of quantum, the throughput scores are now a little bit 
 better in 2.6.21 than in 2.6.20.

Wonderful, now try the patch I sent in the next mail and repeat the
test.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-19 Thread Mingming Cao
On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
 Just a quick note before I forget.  I thought there was a call in ext4
 to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
 more than 2^32 blocks?

Question about the online resize case. If the fs is increased to more
than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
journal. What about existing transactions that still stores 32 bit block
numbers?  I guess the journal need to commit them all so that revoke
will not get confused about the bits for block numbers later.  After
that done then JBD2 can set this feature safely.


 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix up lazy_bg bitmap initialization at mkfs time

2007-04-19 Thread Eric Sandeen
While trying out the -O lazy_bg option, I ran into some trouble on my
big filesystem.  The journal size was  free blocks in the first block group,
so it spilled into the next bg with available blocks.  Since we are using
lazy_bg here, that -should- have been the last block group.  But, when
setup_lazy_bg() marks block groups as UNINIT, it doesn't do anything with
the bitmaps (as designed).  However, the block allocation routine simply
searches the bitmap for next available blocks, and finds them in the 2nd
bg, despite it being marked UNINIT - the summaries aren't checked during
allocation.  This also caused the 1st group free block numbers to get
out of whack, as we start subtracting from zero:

Group  0: block bitmap at 1025, inode bitmap at 1026, inode table at 1027
  0 free blocks, 16373 free inodes, 2 used directories
Group  1: block bitmap at 33793, inode bitmap at 33794, inode table at 33795
  63957 free blocks, 0 free inodes, 0 used directories
  [Inode not init, Block not init]
Group  2: block bitmap at 65536, inode bitmap at 65537, inode table at 65538
  0 free blocks, 0 free inodes, 0 used directories
  [Inode not init, Block not init]

The following patch seems to fix this up for me; just mark the in-memory
bitmaps as full for any bg's we flag as UNINIT.  The bitmaps aren't marked
as dirty, so they won't be written out.  When bitmaps are re-read on the next
invocation of debugfs, etc, the UNINIT flag will be found, and again
the in-memory bitmaps will be marked as full.

This has the somewhat interesting, but correct, result of making the
journal blocks land in both the first and last bgs of a 16T filesystem:  :)

BLOCKS:
(0-11):1520-1531, (IND):1532, (12-1035):1533-2556, ... 
(IND):4194272286, (31756-32768):4194272287-419427329

Unfortunately it also increases mkfs time a bit, as it must search
a huge string of unavailable blocks if it has to allocate in the 
last bg.  Ah well...

Thanks,
-Eric

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

Index: e2fsprogs-1.39_ext4_hg/misc/mke2fs.c
===
--- e2fsprogs-1.39_ext4_hg.orig/misc/mke2fs.c
+++ e2fsprogs-1.39_ext4_hg/misc/mke2fs.c
@@ -450,16 +450,22 @@ static void setup_lazy_bg(ext2_filsys fs
int blks;
struct ext2_super_block *sb = fs-super;
struct ext2_group_desc *bg = fs-group_desc;
+   char *block_bitmap = fs-block_map-bitmap;
+   char *inode_bitmap = fs-inode_map-bitmap;
+   int block_nbytes = (int) EXT2_BLOCKS_PER_GROUP(fs-super) / 8;
+   int inode_nbytes = (int) EXT2_INODES_PER_GROUP(fs-super) / 8;
 
if (EXT2_HAS_COMPAT_FEATURE(fs-super, 
EXT2_FEATURE_COMPAT_LAZY_BG)) {
for (i = 0; i  fs-group_desc_count; i++, bg++) {
if ((i == 0) ||
(i == fs-group_desc_count-1))
-   continue;
+   goto skip;
if (bg-bg_free_inodes_count ==
sb-s_inodes_per_group) {
bg-bg_free_inodes_count = 0;
+   /* NB: set in mem only, see also read_bitmaps */
+   memset(inode_bitmap, 0xff, inode_nbytes);
bg-bg_flags |= EXT2_BG_INODE_UNINIT;
sb-s_free_inodes_count -= 
sb-s_inodes_per_group;
@@ -467,9 +473,13 @@ static void setup_lazy_bg(ext2_filsys fs
blks = ext2fs_super_and_bgd_loc(fs, i, 0, 0, 0, 0);
if (bg-bg_free_blocks_count == blks) {
bg-bg_free_blocks_count = 0;
+   memset(block_bitmap, 0xff, block_nbytes);
bg-bg_flags |= EXT2_BG_BLOCK_UNINIT;
sb-s_free_blocks_count -= blks;
}
+skip:
+   block_bitmap += block_nbytes;
+   inode_bitmap += inode_nbytes;
}
}
 }


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-19 Thread Andreas Dilger
On Apr 19, 2007  12:15 -0700, Mingming Cao wrote:
 On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
  Just a quick note before I forget.  I thought there was a call in ext4
  to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
  more than 2^32 blocks?
 
 Question about the online resize case. If the fs is increased to more
 than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
 journal. What about existing transactions that still stores 32 bit block
 numbers?  I guess the journal need to commit them all so that revoke
 will not get confused about the bits for block numbers later.  After
 that done then JBD2 can set this feature safely.

Well, there are two options here:
1) refuse resizing filesystems beyond 16TB
   - this is required if they were not formatted as ext4 to start with, as
 the group descriptors will not be large enough to handle the _hi
 word in the bitmap/inode table locations
   - this is also a problem for block-mapped files that need to allocate
 blocks beyond 16TB (though this could just fail on those files with
 e.g. ENOSPC or EFBIG or something similar)
2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond
   16TB.  This would also require changing over to META_BG at some point,
   because there cannot be enough reserved group descriptor blocks (the
   resize_inode is set up for a maximum of 2TB filesystems I think)
   
For now I'd be happy with just setting the JBD2_*_64BIT flag at mount for
filesystems  16TB, and refusing resize across 16TB.  We can fix it later.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

2007-04-19 Thread Andreas Dilger
On Apr 19, 2007  17:41 -0700, Mingming Cao wrote:
 Any concerns about turn on META_BG by default for all new ext4 fs?
 Initially I thought we only need META_BG for support 256TB, so there is
 no rush to turn it on for all the new fs. But it appears there are
 multiple benefits to enable META_BG by default:

I would prefer not to have it default for the first 1TB or so of the
filesystem or so.  One reason is that using META_BG for all of the groups
give us only 2 backups of each group descriptor, and those are relatively
close together.  In the first 1TB we would get 17 backups of the group
descriptors, which should be plenty.

 - enable online resize 2TB

Actually, I don't think the current online resize support for META_BG.
There was a patch last year by Glauber de Oliveira Costa which added
support for online resizing with META_BG, which would need to be updated
to work with ext4.  Also, the usage of s_first_meta_bg in that patch is
incorrect.

 - support 256TB fs 

True, though not exactly pressing, and filesystems can be changed
to add META_BG support at any point.

 - Since metadatas(bitmaps, group descriptors etc) are not put at the
   beginning of each block group anymore, the 128MB limit(block group size
   with 4k block size) that used to limit an extent size is removed. 
 - Speed up fsck since metadata are placed closely. 

That isn't really true, even though descriptions of META_BG say this.
There will still be block and inode bitmaps and the inode table.
The ext3 code was missing support for moving the bitmaps/itable outside
their respective groups, and that has not been fixed yet in ext4.

The problem is that ext4_check_descriptors() in the kernel was never
changed to support META_BG, so it does not allow the bitmaps or inode
table to be outside the group.  Similarly, ext2fs_group_first_block()
and ext2fs_group_last_block() in lib/ext2fs also don't take META_BG
into account.

Also, since the extent format supports at most 2^15 blocks (128MB) it
doesn't really make much difference in that regard, though it does help
the allocator somewhat because it has more contiguous space to allocate
from.

 So I am wondering why not make it default?

It wouldn't be too hard to add in support for this I think, and there
is definitely some benefit.  Since neither e2fsprogs nor the kernel
handle this correctly, the placement of bitmaps and inode tables outside
of their respective groups may as well be a separate feature.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html