Re: [RFC] Ext3 online defrag

2006-10-26 Thread David Chinner
On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote:
 On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
  We don't need to expose anything filesystem specific to userspace to
  implement this.  Online data movement (i.e. the defrag mechanism)
  becomes something like:
  
  do {
  get_free_list(dst_fd, location, len, list)
  /* select extent to use */
  alloc_from_list(dst_fd, list[X], off, len)
  } while (ENOALLOC)
  move_data(src_fd, dst_fd, off, len);
  
  And this would work on any filesystem type that implemented these
  interfaces. Hence tools like a startup file optimiser would
  only need to be written once, rather than needing a different
  tool for every different filesystem type.
 
 Yeah, but that's simply not enough. 

Not enough for what?

 A good defragger needs to know

Oh, we're back to defrag again. :/

 about a filesystem's allocation policies, and move files so they are
 optimally located, given the filesystem layout.  For example, in
 ext2/3/4 we will want to move blocks so they in the same block group
 as the inode.  That's filesystem specific information; other
 filesystems will require different policies.

Of which a good chunk of policies will be common. the above policy
has been around for many, many years and is implemented in many, many
filesystems (even XFS).

  get_free_list(dst_fd, location, len, list)

location == allocation policy. e.g: give me a list of free blocks:

- anywhere (default filesystem policy applies)
- near block number X
- at block X
- in block/allocation group Y
- of the largest contiguous regions in (one of the above)
- at least N blocks in length
- near inode src_fd
- in storage tier 3

then you select one of the regions that was returned at attempt
to allocate that.

You can put whatever filesystems specific stuff you need around this
to arrive at the decision of where to put the file, but you've got
to allocate the new blocks, move the data to them, and swap them
over. Every defragger needs to do this, regardless of the filesystem
type. So why not provide a framework for it, especially as the
framework is useful for far more than just as the data movement part
of a defrag application.

  Remember, I'm not just talking about defrag - I'm talking about
  an interface that is actually useful to apps that might care
  about how data is laid out on disk but the applications writers
  don't know anyhting about how filesystem X or Y or Z is
  implemented. Putting the burden of learning about fileystem
  internals on application developers is not the correct solution.
 
 Unfortunately, if you want to do a good job, a defragger *has* to know
 about some very low-level filesystem specific information, if it wants
 to do a good job.

Back to defrag. Again. Bigger picture, guys, bigger picture.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Andreas Dilger
On Oct 25, 2006  16:54 +0200, Jan Kara wrote:
 I've just not yet decided how to handle indirect
 blocks in case of relocation in the middle of the file. Should they be
 relocated or shouldn't they? Probably they should be relocated at least
 in case they are fully contained in relocated interval or maybe better
 said when all the blocks they reference to are also in the interval
 (this handles also the case of EOF). But still if you would like to
 relocate the file by parts this is not quite what you want (you won't be
 able to relocate indirect blocks in the boundary of intervals) :(.

I suspect that the natural choice for metadata blocks is to keep the
block which has the most metadata unchanged.  For example, if you are
doing a full-file relocation then you would naturally keep all of the
new {dt}indirect blocks.  If you are relocating a small chunk of the
file you would keep the old {dt}indirect blocks and just copy a few
block pointers over.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-26 Thread Andreas Dilger
On Oct 25, 2006  11:44 +0200, Andre Noll wrote:
 Are you saying that ext3_set_bit() should simply be called with
 ret_block as its first argument? If yes, that is what the revised
 patch below does.

You might need to call ext3_set_bit_atomic() (as claim_block() does,
not sure.

 @@ -1372,12 +1370,21 @@ allocated:
   in_range(ret_block, le32_to_cpu(gdp-bg_inode_table),
 EXT3_SB(sb)-s_itb_per_group) ||
   in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table),
 +   EXT3_SB(sb)-s_itb_per_group)) {
 + ext3_error(sb, __FUNCTION__,
   Allocating block in system zone - 
   blocks from E3FSBLK, length %lu,
ret_block, num);
 + /* Note: This will potentially use up one of the handle's
 +  * buffer credits.  Normally we have way too many credits,
 +  * so that is OK.  In _very_ rare cases it might not be OK.
 +  * We will trigger an assertion if we run out of credits,
 +  * and we will have to do a full fsck of the filesystem -
 +  * better than randomly corrupting filesystem metadata.
 +  */
 + ext3_set_bit(ret_block, gdp_bh-b_data);
 + goto repeat;
 + }

The other issue is that you need to potentially set num bits in the
bitmap here, if those all overlap metadata.  In fact, it might just
make more sense at this stage to walk all of the bits in the bitmaps,
the inode table and the backup superblock and group descriptor to see
if they need fixing also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Jan Kara
 On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
   On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
   So how do you then get the generic interface to allocate blocks
   specified by userspace race free?
  
  As has been repeatedly stated, there is no generic.  There MUST be
  filesystem-specific knowledge during these operations.
 
 What information? All we need to know is where the free disk space
 is, and have a method to attempt to allocate from it. That's _easy_
 to abstract into a common interface via the VFS
 
Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.
   
   Sure, but that's not a generic solution to a problem common to
   all filesystems
  
  You clearly don't know what I'm talking about.  ext2meta is an example
  of a filesystem-specific metadata access method, applicable to tasks
  such as online optimization.
 
 I know exactly what ext2meta is. I said it's not a generic solution
 and you say its a filesystem specific solution.  I think we're
 agreeing here. ;)
 
 We don't need to expose anything filesystem specific to userspace to
 implement this.  Online data movement (i.e. the defrag mechanism)
 becomes something like:
 
   do {
   get_free_list(dst_fd, location, len, list)
   /* select extent to use */
  Upto this point I can imagine we can be perfectly generic.

   alloc_from_list(dst_fd, list[X], off, len)
   } while (ENOALLOC)
   move_data(src_fd, dst_fd, off, len);
  With these two it's not clear how well can we do with just a generic
interface. Every filesystem needs to have some additional metadata to
keep list of data blocks. In case of ext2/ext3/reiserfs this is not
a negligible amount of space and placement of these metadata is important
for performance. So either we focus only on data blocks and let
implementation of alloc_from_list() allocate metadata wherever it wants
(but then we get suboptimal performace because there need not be space
for indirect blocks close before our provided extent) or we allocate
metadata from the provided list, but then we need some knowledge of fs
to know how much should we expect to spend on metadata and where these
metadata should be placed. For example if you know that indirect block
for your interval is at block B, then you'd like to allocate somewhere
close after this point or to relocate that indirect block (and all the
data it references to). But for that you need to know you have something
like indirect blocks = filesystem knowledge.
  So I think that to get this working, we also need some way to tell
the program that if it wants to allocate some data, it also needs to
count with this amount of metadata and some of it is already allocated
in given blocks...

 I see substantial benefit moving forward from having filesystem
 independent interfaces. Many features that  filesystems implement
 are common, and as time goes on the common feature set of the
 different filesystems gets larger. So why shouldn't we be
 trying to make common operations generic so that every filesystem
 can benefit from the latest and greatest tool?
  So you prefer to handle only data blocks part of the problem and let
filesystem sort out metadata?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Theodore Tso
On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
   Remember, I'm not just talking about defrag - I'm talking about
   an interface that is actually useful to apps that might care
   about how data is laid out on disk but the applications writers
   don't know anyhting about how filesystem X or Y or Z is
   implemented. Putting the burden of learning about fileystem
   internals on application developers is not the correct solution.

If all you want is something for applicaiton developers, about all you
can do is to tell the filesystem, create the file so that it will be
quickly accessed after accessing this file or this directory.  I
really don't see the point of having the application specify block
numbers if you're also claiming the applicaiton isn't going to know
anything about the filesystem layout --- or even the RAID layout of
the filesystem.  I don't think it's at **all** useful to be
half-pregnant on this score.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Dave Kleikamp
On Thu, 2006-10-26 at 09:37 -0400, Theodore Tso wrote:
 On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
Remember, I'm not just talking about defrag - I'm talking about
an interface that is actually useful to apps that might care
about how data is laid out on disk but the applications writers
don't know anyhting about how filesystem X or Y or Z is
implemented. Putting the burden of learning about fileystem
internals on application developers is not the correct solution.
 
 If all you want is something for applicaiton developers, about all you
 can do is to tell the filesystem, create the file so that it will be
 quickly accessed after accessing this file or this directory.  I
 really don't see the point of having the application specify block
 numbers if you're also claiming the applicaiton isn't going to know
 anything about the filesystem layout --- or even the RAID layout of
 the filesystem.  I don't think it's at **all** useful to be
 half-pregnant on this score.

I think a utility such as a defragmenter should know about about the
filesystem layout.  I also think that it would be a good thing to have a
consistent interface so that every filesystem isn't implementing a
completely different one.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Jörn Engel
On Wed, 25 October 2006 14:41:18 -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
Yes, but there's a question of the interface to this operation. How to
  specify which indirect block I mean? Obviously we could introduce
  separate call for remapping indirect blocks but I find this solution
  kind of clumsy...
 
 Agreed...  that gets nasty real quick.

Logfs has a similar problem and I introduced a level.  Without going
into all the gory details, data blocks reside on level 0, indirect
blocks on level 1, doubly indirect blocks on level 2, etc.  With this,
the tupel of (ino, pos, level) can specify any block on the
filesystem, provided it is used for some inode.

Logfs needs this for Garbage Collection, which is a fairly similar
problem.

Jörn

-- 
Joern's library part 3:
http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread David Chinner
On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote:
  On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
  We don't need to expose anything filesystem specific to userspace to
  implement this.  Online data movement (i.e. the defrag mechanism)
  becomes something like:
  
  do {
  get_free_list(dst_fd, location, len, list)
  /* select extent to use */
   Upto this point I can imagine we can be perfectly generic.
 
  alloc_from_list(dst_fd, list[X], off, len)
  } while (ENOALLOC)
  move_data(src_fd, dst_fd, off, len);
   With these two it's not clear how well can we do with just a generic
 interface. Every filesystem needs to have some additional metadata to
 keep list of data blocks. In case of ext2/ext3/reiserfs this is not
 a negligible amount of space and placement of these metadata is important
 for performance.

Yes, the same can be said for XFS. However, XFS's extent btree implementation
uses readahead to hide a lot of the latency involved with reading extent
map, and it only needs to read it once per inode lifecycle

 So either we focus only on data blocks and let
 implementation of alloc_from_list() allocate metadata wherever it wants
 (but then we get suboptimal performace because there need not be space
 for indirect blocks close before our provided extent)

I think the first step would be to focus on data blocks using something
like the above. There are many steps to full filesystem defragmentation,
but data fragmetnation is typically the most common symptom of
fragmentation that we see.

 or we allocate
 metadata from the provided list, but then we need some knowledge of fs
 to know how much should we expect to spend on metadata and where these
 metadata should be placed.

That's the second step, I think. For example, we could count the metadata blocks
used in metadata structure (say an block list), allocate a new chunk
like above, and then execute a move_metadata() type of operation,
which the filesystem does internally in a transactionally  safe
manner. Once again, generic interface, filesystem specific implementations.

 For example if you know that indirect block
 for your interval is at block B, then you'd like to allocate somewhere
 close after this point or to relocate that indirect block (and all the
 data it references to). But for that you need to know you have something
 like indirect blocks = filesystem knowledge.

*nod*

This is far less of a problem with extent based filesystems -
coalescing all the fragments into a single extent removes the need
for indirect blocks and you get the extent list for free when you
read the inode.  When we do have a fragmented file, XFS uses
readahead to speed btree searching and reading, so it hides a lot of
the latency overhead that fragmented metadata can cause.

Either way, these lists can still be optimised by allocating a
set of contiguous blocks and copying the metadata into them and
updating the pointers to the new blocks. It can be done separately
to the data moving and really should be done after the data has
been defragmented

   So I think that to get this working, we also need some way to tell
 the program that if it wants to allocate some data, it also needs to
 count with this amount of metadata and some of it is already allocated
 in given blocks...

If you want to do it all in one step.

However, it's not quite that simple for something like XFS. An
allocation may require a btree split (or three, actually) and the
number of blocks required is dependent on the height of the btrees.
So we don't know how many blocks we'll need ahead of time, and we'd
have to reach deep into the allocator and abuse it badly to do
anything like this. It's not something I want to even contemplate
doing. :/

Also, we don't want to be mingling global metadata with inode
specific metadata so we don't want to put most of the new metadata
blocks near the extent we are putting the data into.

That means I'd prefer to be able to optimise metadata objects
separately. e.g. rewrite a btree into a single contiguous extent
with the btree blocks laid out so the readahead patterns result
in sequential I/O. The kernel would need to do this in XFS because
we'd have to lock the entire btree a block at a time, copy it
and then issue a swap btree transaction. most other journalling
filesystems will have similar requirements, I think, for doing
this online

That's a very similar concept to the move_data() interface...

  I see substantial benefit moving forward from having filesystem
  independent interfaces. Many features that  filesystems implement
  are common, and as time goes on the common feature set of the
  different filesystems gets larger. So why shouldn't we be
  trying to make common operations generic so that every filesystem
  can benefit from the latest and greatest tool?
   So you prefer to handle only data blocks part of the problem and let
 filesystem sort out metadata?

The filesystem