Re: [RFC] Ext3 online defrag

2006-11-15 Thread Takashi Sato

Hi Alex,

Thank you for your information.
I have sent the patches of the defragmentation for a extent-based
file on ext3 using your patches of the multi-block allocation.
I'm happy if you have a time to review my patches.
[RFC][PATCH 0/3] Extent base online defrag
http://marc.theaimsgroup.com/?l=linux-ext4m=116307062907075w=2

And I'd like to start considering the defragmentation for ext4.
Do you have a plan to update your patches for ext4?


I've been reworking mballoc with few new features:

1) in-core preallocation
  like existing  reservation, but can preallocate few pieces for a file

2) locality groups
  to maintain groups of related files and flush them together.
  say, two users are unpacking kernel. with delayed allocation
  we've got bunch of files from the both in cache. then we flush
  first set (few MBs) of files from one user, then from another.
  this way write I/Os will be large enough to achieve good
  throughput and files are still quite localized to be used later
  at good read rate.

3) scalable reservation
  required for delayed allocation to avoid -ENOSPC at flush time.
  current version uses per-sb spinlock.

probably we could add something for defragmentation?


Cheers, Takashi
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-27 Thread Eric Sandeen

Alex Tomas wrote:

3) scalable reservation
   required for delayed allocation to avoid -ENOSPC at flush time.
   current version uses per-sb spinlock.


Can you elaborate on this issue?  Shouldn't delayed allocation decrement free 
space immediately, and only the actual block location choice is delayed?  Or is 
this due to potential extra metadata space required as blocks are allocated?


Thanks,

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-27 Thread Alex Tomas
 Eric Sandeen (ES) writes:

 ES Alex Tomas wrote:
  3) scalable reservation
  required for delayed allocation to avoid -ENOSPC at flush time.
  current version uses per-sb spinlock.

 ES Can you elaborate on this issue?  Shouldn't delayed allocation
 ES decrement free space immediately, and only the actual block location
 ES choice is delayed?  Or is this due to potential extra metadata space
 ES required as blocks are allocated?

exactly. in this case, reservation has nothing to do with allocation
or preallocation of real blocks. this is just a *per-sb counter* of
blocks reserved for allocation at flush time. it includes all
non-allocated-yet blocks and metadata needed to allocate them (bitmaps,
group descriptors, blocks extent tree, etc). the previous version
of mballoc has reservation, but it doesn't scale very well being
a single global counter protected by the spinlock. at least, in many
regular loads I observed the reservation function in top30 of oprofile.



thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread David Chinner
On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote:
 On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
  We don't need to expose anything filesystem specific to userspace to
  implement this.  Online data movement (i.e. the defrag mechanism)
  becomes something like:
  
  do {
  get_free_list(dst_fd, location, len, list)
  /* select extent to use */
  alloc_from_list(dst_fd, list[X], off, len)
  } while (ENOALLOC)
  move_data(src_fd, dst_fd, off, len);
  
  And this would work on any filesystem type that implemented these
  interfaces. Hence tools like a startup file optimiser would
  only need to be written once, rather than needing a different
  tool for every different filesystem type.
 
 Yeah, but that's simply not enough. 

Not enough for what?

 A good defragger needs to know

Oh, we're back to defrag again. :/

 about a filesystem's allocation policies, and move files so they are
 optimally located, given the filesystem layout.  For example, in
 ext2/3/4 we will want to move blocks so they in the same block group
 as the inode.  That's filesystem specific information; other
 filesystems will require different policies.

Of which a good chunk of policies will be common. the above policy
has been around for many, many years and is implemented in many, many
filesystems (even XFS).

  get_free_list(dst_fd, location, len, list)

location == allocation policy. e.g: give me a list of free blocks:

- anywhere (default filesystem policy applies)
- near block number X
- at block X
- in block/allocation group Y
- of the largest contiguous regions in (one of the above)
- at least N blocks in length
- near inode src_fd
- in storage tier 3

then you select one of the regions that was returned at attempt
to allocate that.

You can put whatever filesystems specific stuff you need around this
to arrive at the decision of where to put the file, but you've got
to allocate the new blocks, move the data to them, and swap them
over. Every defragger needs to do this, regardless of the filesystem
type. So why not provide a framework for it, especially as the
framework is useful for far more than just as the data movement part
of a defrag application.

  Remember, I'm not just talking about defrag - I'm talking about
  an interface that is actually useful to apps that might care
  about how data is laid out on disk but the applications writers
  don't know anyhting about how filesystem X or Y or Z is
  implemented. Putting the burden of learning about fileystem
  internals on application developers is not the correct solution.
 
 Unfortunately, if you want to do a good job, a defragger *has* to know
 about some very low-level filesystem specific information, if it wants
 to do a good job.

Back to defrag. Again. Bigger picture, guys, bigger picture.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Andreas Dilger
On Oct 25, 2006  16:54 +0200, Jan Kara wrote:
 I've just not yet decided how to handle indirect
 blocks in case of relocation in the middle of the file. Should they be
 relocated or shouldn't they? Probably they should be relocated at least
 in case they are fully contained in relocated interval or maybe better
 said when all the blocks they reference to are also in the interval
 (this handles also the case of EOF). But still if you would like to
 relocate the file by parts this is not quite what you want (you won't be
 able to relocate indirect blocks in the boundary of intervals) :(.

I suspect that the natural choice for metadata blocks is to keep the
block which has the most metadata unchanged.  For example, if you are
doing a full-file relocation then you would naturally keep all of the
new {dt}indirect blocks.  If you are relocating a small chunk of the
file you would keep the old {dt}indirect blocks and just copy a few
block pointers over.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Jan Kara
 On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
   On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
   So how do you then get the generic interface to allocate blocks
   specified by userspace race free?
  
  As has been repeatedly stated, there is no generic.  There MUST be
  filesystem-specific knowledge during these operations.
 
 What information? All we need to know is where the free disk space
 is, and have a method to attempt to allocate from it. That's _easy_
 to abstract into a common interface via the VFS
 
Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.
   
   Sure, but that's not a generic solution to a problem common to
   all filesystems
  
  You clearly don't know what I'm talking about.  ext2meta is an example
  of a filesystem-specific metadata access method, applicable to tasks
  such as online optimization.
 
 I know exactly what ext2meta is. I said it's not a generic solution
 and you say its a filesystem specific solution.  I think we're
 agreeing here. ;)
 
 We don't need to expose anything filesystem specific to userspace to
 implement this.  Online data movement (i.e. the defrag mechanism)
 becomes something like:
 
   do {
   get_free_list(dst_fd, location, len, list)
   /* select extent to use */
  Upto this point I can imagine we can be perfectly generic.

   alloc_from_list(dst_fd, list[X], off, len)
   } while (ENOALLOC)
   move_data(src_fd, dst_fd, off, len);
  With these two it's not clear how well can we do with just a generic
interface. Every filesystem needs to have some additional metadata to
keep list of data blocks. In case of ext2/ext3/reiserfs this is not
a negligible amount of space and placement of these metadata is important
for performance. So either we focus only on data blocks and let
implementation of alloc_from_list() allocate metadata wherever it wants
(but then we get suboptimal performace because there need not be space
for indirect blocks close before our provided extent) or we allocate
metadata from the provided list, but then we need some knowledge of fs
to know how much should we expect to spend on metadata and where these
metadata should be placed. For example if you know that indirect block
for your interval is at block B, then you'd like to allocate somewhere
close after this point or to relocate that indirect block (and all the
data it references to). But for that you need to know you have something
like indirect blocks = filesystem knowledge.
  So I think that to get this working, we also need some way to tell
the program that if it wants to allocate some data, it also needs to
count with this amount of metadata and some of it is already allocated
in given blocks...

 I see substantial benefit moving forward from having filesystem
 independent interfaces. Many features that  filesystems implement
 are common, and as time goes on the common feature set of the
 different filesystems gets larger. So why shouldn't we be
 trying to make common operations generic so that every filesystem
 can benefit from the latest and greatest tool?
  So you prefer to handle only data blocks part of the problem and let
filesystem sort out metadata?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Theodore Tso
On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
   Remember, I'm not just talking about defrag - I'm talking about
   an interface that is actually useful to apps that might care
   about how data is laid out on disk but the applications writers
   don't know anyhting about how filesystem X or Y or Z is
   implemented. Putting the burden of learning about fileystem
   internals on application developers is not the correct solution.

If all you want is something for applicaiton developers, about all you
can do is to tell the filesystem, create the file so that it will be
quickly accessed after accessing this file or this directory.  I
really don't see the point of having the application specify block
numbers if you're also claiming the applicaiton isn't going to know
anything about the filesystem layout --- or even the RAID layout of
the filesystem.  I don't think it's at **all** useful to be
half-pregnant on this score.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Dave Kleikamp
On Thu, 2006-10-26 at 09:37 -0400, Theodore Tso wrote:
 On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
Remember, I'm not just talking about defrag - I'm talking about
an interface that is actually useful to apps that might care
about how data is laid out on disk but the applications writers
don't know anyhting about how filesystem X or Y or Z is
implemented. Putting the burden of learning about fileystem
internals on application developers is not the correct solution.
 
 If all you want is something for applicaiton developers, about all you
 can do is to tell the filesystem, create the file so that it will be
 quickly accessed after accessing this file or this directory.  I
 really don't see the point of having the application specify block
 numbers if you're also claiming the applicaiton isn't going to know
 anything about the filesystem layout --- or even the RAID layout of
 the filesystem.  I don't think it's at **all** useful to be
 half-pregnant on this score.

I think a utility such as a defragmenter should know about about the
filesystem layout.  I also think that it would be a good thing to have a
consistent interface so that every filesystem isn't implementing a
completely different one.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Jörn Engel
On Wed, 25 October 2006 14:41:18 -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
Yes, but there's a question of the interface to this operation. How to
  specify which indirect block I mean? Obviously we could introduce
  separate call for remapping indirect blocks but I find this solution
  kind of clumsy...
 
 Agreed...  that gets nasty real quick.

Logfs has a similar problem and I introduced a level.  Without going
into all the gory details, data blocks reside on level 0, indirect
blocks on level 1, doubly indirect blocks on level 2, etc.  With this,
the tupel of (ino, pos, level) can specify any block on the
filesystem, provided it is used for some inode.

Logfs needs this for Garbage Collection, which is a fairly similar
problem.

Jörn

-- 
Joern's library part 3:
http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread David Chinner
On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote:
  On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
  We don't need to expose anything filesystem specific to userspace to
  implement this.  Online data movement (i.e. the defrag mechanism)
  becomes something like:
  
  do {
  get_free_list(dst_fd, location, len, list)
  /* select extent to use */
   Upto this point I can imagine we can be perfectly generic.
 
  alloc_from_list(dst_fd, list[X], off, len)
  } while (ENOALLOC)
  move_data(src_fd, dst_fd, off, len);
   With these two it's not clear how well can we do with just a generic
 interface. Every filesystem needs to have some additional metadata to
 keep list of data blocks. In case of ext2/ext3/reiserfs this is not
 a negligible amount of space and placement of these metadata is important
 for performance.

Yes, the same can be said for XFS. However, XFS's extent btree implementation
uses readahead to hide a lot of the latency involved with reading extent
map, and it only needs to read it once per inode lifecycle

 So either we focus only on data blocks and let
 implementation of alloc_from_list() allocate metadata wherever it wants
 (but then we get suboptimal performace because there need not be space
 for indirect blocks close before our provided extent)

I think the first step would be to focus on data blocks using something
like the above. There are many steps to full filesystem defragmentation,
but data fragmetnation is typically the most common symptom of
fragmentation that we see.

 or we allocate
 metadata from the provided list, but then we need some knowledge of fs
 to know how much should we expect to spend on metadata and where these
 metadata should be placed.

That's the second step, I think. For example, we could count the metadata blocks
used in metadata structure (say an block list), allocate a new chunk
like above, and then execute a move_metadata() type of operation,
which the filesystem does internally in a transactionally  safe
manner. Once again, generic interface, filesystem specific implementations.

 For example if you know that indirect block
 for your interval is at block B, then you'd like to allocate somewhere
 close after this point or to relocate that indirect block (and all the
 data it references to). But for that you need to know you have something
 like indirect blocks = filesystem knowledge.

*nod*

This is far less of a problem with extent based filesystems -
coalescing all the fragments into a single extent removes the need
for indirect blocks and you get the extent list for free when you
read the inode.  When we do have a fragmented file, XFS uses
readahead to speed btree searching and reading, so it hides a lot of
the latency overhead that fragmented metadata can cause.

Either way, these lists can still be optimised by allocating a
set of contiguous blocks and copying the metadata into them and
updating the pointers to the new blocks. It can be done separately
to the data moving and really should be done after the data has
been defragmented

   So I think that to get this working, we also need some way to tell
 the program that if it wants to allocate some data, it also needs to
 count with this amount of metadata and some of it is already allocated
 in given blocks...

If you want to do it all in one step.

However, it's not quite that simple for something like XFS. An
allocation may require a btree split (or three, actually) and the
number of blocks required is dependent on the height of the btrees.
So we don't know how many blocks we'll need ahead of time, and we'd
have to reach deep into the allocator and abuse it badly to do
anything like this. It's not something I want to even contemplate
doing. :/

Also, we don't want to be mingling global metadata with inode
specific metadata so we don't want to put most of the new metadata
blocks near the extent we are putting the data into.

That means I'd prefer to be able to optimise metadata objects
separately. e.g. rewrite a btree into a single contiguous extent
with the btree blocks laid out so the readahead patterns result
in sequential I/O. The kernel would need to do this in XFS because
we'd have to lock the entire btree a block at a time, copy it
and then issue a swap btree transaction. most other journalling
filesystems will have similar requirements, I think, for doing
this online

That's a very similar concept to the move_data() interface...

  I see substantial benefit moving forward from having filesystem
  independent interfaces. Many features that  filesystems implement
  are common, and as time goes on the common feature set of the
  different filesystems gets larger. So why shouldn't we be
  trying to make common operations generic so that every filesystem
  can benefit from the latest and greatest tool?
   So you prefer to handle only data blocks part of the problem and let
 filesystem sort out metadata?

The filesystem 

Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
 On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
   But it a race that is _easily_ handled, and applications only need to
   implement one interface, not a different method for every
   filesystem that requires deeep filesystem knowledge.
   
   Besides, you still have to handle the case where the block you want
   has already been allocated because reading the metadata from
   userspace doesn't prevent the kernel from allocating the block you
   want before you ask for it...
  
  The race is easily handled either way, by having the block move fail
  when you tell the kernel the destination blocks.
 
 So why are you arguing that an interface is no good because it
 is fundamentally racy? ;)

My point was that it is silly to introduce obviously racy code into the
kernel, when -- inside the kernel -- it could be handled race-free.

If you accept a racy solution, you might as well do it outside the
kernel, where you get the same results, but without adding silliness and
bloat to the kernel.


  Every major filesystem has a libfoofs library that makes it trivial to
  read the metadata, so all you need to do is use an existing lib.
 
 IOWs, you are advocating that any application that wants to use this
 special allocation technique needs to link against every different
 filesystem library and it then needs to implement filesystem
 specific searches through their metadata?  Nobody in their right
 mind would ever want to use an interface like this.

Online defrag is OBVIOUSLY highly filesystem specific.  You have to link
against filesystem specific code somewhere, whether its inside the
kernel or outside the kernel.

Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread David Chinner
On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
  On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
  So why are you arguing that an interface is no good because it
  is fundamentally racy? ;)
 
 My point was that it is silly to introduce obviously racy code into the
 kernel, when -- inside the kernel -- it could be handled race-free.

So how do you then get the generic interface to allocate blocks
specified by userspace race free?

   Every major filesystem has a libfoofs library that makes it trivial to
   read the metadata, so all you need to do is use an existing lib.
  
  IOWs, you are advocating that any application that wants to use this
  special allocation technique needs to link against every different
  filesystem library and it then needs to implement filesystem
  specific searches through their metadata?  Nobody in their right
  mind would ever want to use an interface like this.
 
 Online defrag is OBVIOUSLY highly filesystem specific. 

Parts of it are, but data movement and allocation hints need to be
provided by every filesystem that wants to implement this
efficiently. These features are also useful outside of defrag as
well - I can think of several applications that would benefit from
being able to direct where in the filesystem they want data to
reside. 

If userspace directed allocation requires deep knowledge of the
filesystem metadata (this is what you are saying they need to do,
right?), then these applications will never, ever make use of this
interface and we'll continue to have problems with them.

I guess my point is that we are going to implement features like
this in XFS and if other filesystems are going to be doing the same
thing then we should try to come up with generic solutions rather
than reinvent the wheel over an over again.

 Further, in the case being discussed in this thread, ext2meta has
 already been proven a workable solution.

Sure, but that's not a generic solution to a problem common to
all filesystems

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 24, 2006  15:44 -0400, Theodore Tso wrote:
  First of all, we would need a way of allowing userpsace to specify
  which blocks should be used in the preallocation.
 
 Presumably it could do this in the same way it will be specifying
 which blocks to relocate in the defragger - by passing an extent.
 You would be required to pass the file offset for which to preallocate,
 and optionally an extent for the on-disk allocation itself (if none is
 supplied the kernel will allocate the best extent it can).
 
  Secondly, we would need a way of marking blocks as preallocated but
  not pre-zeroed; otherwise we would have to zero out all of the blocks
  in order to assure security (don't want userspace programs seeing the
  previous contents of the data blocks), only to do the copy and the
  extents vector swap.
 
 This could be mitigated by having the preallocation be done (in the
 defragment case) against a temporary inode in the orphan list (as
 the initial patch did) so if there is a crash it will be released.
 The temporary inode will not be linked into the namespace so it cannot
 be read - only used to hold preallocation.  If this was a write-only
 file handle then we should be OK?
 
 For defragger purposes this would need:
 
 - allocate new temporary inode (VFS + fs, returns write-only fh if
fs can't properly handle uninitalized extents, or doesn't request
full-extent zeroing)
 
for each extent to defragment {
   - preallocate extents on temp inode (fs specific internals)
   - copy data from orig to temp at offset X (VFS, splice or
  e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
  to at KS '05 for network filesystems)
   - migrate copied extent to original inode (fs specific internals)
}
 
 - free temporary inode (just close of temp fh, frees unmigrated extents).
  Yes, this sounds feasible. We could split the defrag ioctl into two
pieces (addition of given extent to a file and swapping of extents), which
can have generic interface... 

 I don't think this is much more work than implementing all of this
 functionality as part of a monolithic online defrag function, assuming
 we don't require full-file copies in order to do defrag.
  Yes, it's not more work than supporting swapping of extents in the
middle of the file. I've just not yet decided how to handle indirect
blocks in case of relocation in the middle of the file. Should they be
relocated or shouldn't they? Probably they should be relocated at least
in case they are fully contained in relocated interval or maybe better
said when all the blocks they reference to are also in the interval
(this handles also the case of EOF). But still if you would like to
relocate the file by parts this is not quite what you want (you won't be
able to relocate indirect blocks in the boundary of intervals) :(.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
 On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
   On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
   So why are you arguing that an interface is no good because it
   is fundamentally racy? ;)
  
  My point was that it is silly to introduce obviously racy code into the
  kernel, when -- inside the kernel -- it could be handled race-free.
 
 So how do you then get the generic interface to allocate blocks
 specified by userspace race free?

As has been repeatedly stated, there is no generic.  There MUST be
filesystem-specific knowledge during these operations.


 If userspace directed allocation requires deep knowledge of the
 filesystem metadata (this is what you are saying they need to do,
 right?), then these applications will never, ever make use of this
 interface and we'll continue to have problems with them.

Completely false assumptions.  There is no difference in handling of
knowledge, be it kernel space or userspace.


  Further, in the case being discussed in this thread, ext2meta has
  already been proven a workable solution.
 
 Sure, but that's not a generic solution to a problem common to
 all filesystems

You clearly don't know what I'm talking about.  ext2meta is an example
of a filesystem-specific metadata access method, applicable to tasks
such as online optimization.

Implement that tiny kernel module for each filesystem, and you have
everything you need, without races.  This was discussed years ago;
review the mailing lists.  Google for 'Alexander Viro' and 'ext2meta'.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
   Yes, this sounds feasible. We could split the defrag ioctl into two
 pieces (addition of given extent to a file and swapping of extents), which
 can have generic interface... 

An ioctl is UGLY.

This was discussed years ago.  Google for 'Alexander Viro' and
'ext2meta'.  That's a clean, flexible, extensible way to access metadata
online.  No need for ioctl binary translation across 32bit-64bit, or
any other ioctl issue.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
Yes, this sounds feasible. We could split the defrag ioctl into two
  pieces (addition of given extent to a file and swapping of extents), which
  can have generic interface... 
 
 An ioctl is UGLY.
  Agreed.

 This was discussed years ago.  Google for 'Alexander Viro' and
 'ext2meta'.  That's a clean, flexible, extensible way to access metadata
 online.  No need for ioctl binary translation across 32bit-64bit, or
 any other ioctl issue.
  I've briefly looked at this and this kind of interface has some
appeal. On the other hand it's not obvious to me, how to implement in
this interface *atomic* operation copy data from file F to given set of
blocks and rewrite pointers to original blocks with pointers to new
blocks. Something like this is needed for what we want to do...
Also if we'd like to implement operation like add this block to file F
at position P we have to make sure that all the necessary updates
(bitmap updates, inode updates, indirect block updates) go into one
transaction. Which basically mean that either ext3meta has to have a way
how to do this in a single operation, or we have to give userspace a way
to start/stop transaction and that starts to be really a mess because of
various deadlocks and so on.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote:
I've briefly looked at this and this kind of interface has some
  appeal. On the other hand it's not obvious to me, how to implement in
  this interface *atomic* operation copy data from file F to given set of
  blocks and rewrite pointers to original blocks with pointers to new
  blocks. Something like this is needed for what we want to do...
  Also if we'd like to implement operation like add this block to file F
  at position P we have to make sure that all the necessary updates
  (bitmap updates, inode updates, indirect block updates) go into one
  transaction. Which basically mean that either ext3meta has to have a way
  how to do this in a single operation, or we have to give userspace a way
  to start/stop transaction and that starts to be really a mess because of
  various deadlocks and so on.
 
 Agreed, this issues exist.  But these issues exist independent of
 whether an ioctl or ext3meta is used.  It's all the responsibility
 of the implementor to define the interface.
 
 My contention is that ext3meta interface method would be much more
 robust than ioctl.  It's a namespace inside which you can define any
 inodes/dirents you wish, for the operations you desire.
  I see. So you mean that in our ext3meta filesystem we'd have a file
named add_this_extent_to_inode and a file reloc_inode_interval and
they'd be fed essentially the same info as the current ioctl interface and
do the same thing as we currently do. Hmm, I don't find it that nice any
more but yes, this would work.

 Heck, according to my sf.net/projects/gkernel CVS log, you offered
 some helpful review comments to me when I was implementing ext2meta ;-)
  Looking at those mails it was already quite some time ago so I
forgot about it  ;)
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote:
   I see. So you mean that in our ext3meta filesystem we'd have a file
 named add_this_extent_to_inode and a file reloc_inode_interval and
 they'd be fed essentially the same info as the current ioctl interface and
 do the same thing as we currently do. Hmm, I don't find it that nice any
 more but yes, this would work.

It depends on the operation.  ext2meta[1] works fine for online
defrag, just exporting metadata objects and providing read(1)
and write(2) operations on them.  Adding 'trigger' files (like your
add_this_extent_to_inode) may make sense for some operations, indeed,
but we need to see the whole picture before really understanding
whether that interface is optimal.

Jeff


[1] http://linux.yyz.us/misc/ext2meta.c
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 23, 2006  18:03 +0200, Jan Kara wrote:
  Andreas Dilger wrote:
   I would in fact go so far as to allow only a single extent to be specified
   per call.  This is to avoid the passing of any pointers as part of the
   interface (hello ioctl police :-), and also makes the kernel code simpler.
   I don't think the syscall/ioctl overhead is significant compared to the
   journal and IO overhead.
 
  ...it makes it kind of
  harder to tell where indirect blocks would go - and it would be
  impossible for the defragmenter to force some unusual placement of
  indirect blocks...
 
 It would be possible to specify indirect block relocation in same manner
 as regular block relocation I think.  Allocate a new block, copy contents,
 flush block from cache, fix up reference (inode, dindirect), commit.
  Yes, but there's a question of the interface to this operation. How to
specify which indirect block I mean? Obviously we could introduce
separate call for remapping indirect blocks but I find this solution
kind of clumsy...

Bye
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
   Yes, but there's a question of the interface to this operation. How to
 specify which indirect block I mean? Obviously we could introduce
 separate call for remapping indirect blocks but I find this solution
 kind of clumsy...

Agreed...  that gets nasty real quick.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread David Chinner
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
  On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
   On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
So why are you arguing that an interface is no good because it
is fundamentally racy? ;)
   
   My point was that it is silly to introduce obviously racy code into the
   kernel, when -- inside the kernel -- it could be handled race-free.
  
  So how do you then get the generic interface to allocate blocks
  specified by userspace race free?
 
 As has been repeatedly stated, there is no generic.  There MUST be
 filesystem-specific knowledge during these operations.

What information? All we need to know is where the free disk space
is, and have a method to attempt to allocate from it. That's _easy_
to abstract into a common interface via the VFS

   Further, in the case being discussed in this thread, ext2meta has
   already been proven a workable solution.
  
  Sure, but that's not a generic solution to a problem common to
  all filesystems
 
 You clearly don't know what I'm talking about.  ext2meta is an example
 of a filesystem-specific metadata access method, applicable to tasks
 such as online optimization.

I know exactly what ext2meta is. I said it's not a generic solution
and you say its a filesystem specific solution.  I think we're
agreeing here. ;)

We don't need to expose anything filesystem specific to userspace to
implement this.  Online data movement (i.e. the defrag mechanism)
becomes something like:

do {
get_free_list(dst_fd, location, len, list)
/* select extent to use */
alloc_from_list(dst_fd, list[X], off, len)
} while (ENOALLOC)
move_data(src_fd, dst_fd, off, len);

And this would work on any filesystem type that implemented these
interfaces. Hence tools like a startup file optimiser would
only need to be written once, rather than needing a different
tool for every different filesystem type.

Remember, I'm not just talking about defrag - I'm talking about
an interface that is actually useful to apps that might care
about how data is laid out on disk but the applications writers
don't know anyhting about how filesystem X or Y or Z is
implemented. Putting the burden of learning about fileystem
internals on application developers is not the correct solution.

I see substantial benefit moving forward from having filesystem
independent interfaces. Many features that  filesystems implement
are common, and as time goes on the common feature set of the
different filesystems gets larger. So why shouldn't we be
trying to make common operations generic so that every filesystem
can benefit from the latest and greatest tool?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Theodore Tso
On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
 We don't need to expose anything filesystem specific to userspace to
 implement this.  Online data movement (i.e. the defrag mechanism)
 becomes something like:
 
   do {
   get_free_list(dst_fd, location, len, list)
   /* select extent to use */
   alloc_from_list(dst_fd, list[X], off, len)
   } while (ENOALLOC)
   move_data(src_fd, dst_fd, off, len);
 
 And this would work on any filesystem type that implemented these
 interfaces. Hence tools like a startup file optimiser would
 only need to be written once, rather than needing a different
 tool for every different filesystem type.

Yeah, but that's simply not enough.  A good defragger needs to know
about a filesystem's allocation policies, and move files so they are
optimally located, given the filesystem layout.  For example, in
ext2/3/4 we will want to move blocks so they in the same block group
as the inode.  That's filesystem specific information; other
filesystems will require different policies.

 Remember, I'm not just talking about defrag - I'm talking about
 an interface that is actually useful to apps that might care
 about how data is laid out on disk but the applications writers
 don't know anyhting about how filesystem X or Y or Z is
 implemented. Putting the burden of learning about fileystem
 internals on application developers is not the correct solution.

Unfortunately, if you want to do a good job, a defragger *has* to know
about some very low-level filesystem specific information, if it wants
to do a good job.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
 On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
  isn't that a kernel responsbility to find/allocate target blocks?
  wouldn't it better to specify desirable target group and minimal
  acceptable chunk of free blocks?
 
 The kernel doesn't have enough knowledge to know whether or not the
 defragger prefers one blkdev location over another.
 
 When you are trying to consolidate blocks, you must specify the
 destination as well as source blocks.
 
 Certainly, to prevent corruption and other nastiness, you must fail if
 the destination isn't available...

That's the wrong way to look at it. if you want the userspace
process to specify a location, then you should preallocate it first
before doing anything else. There is no need to clutter a simple
data mover interface with all sorts of unnecessary error handling.

Once you've separated the destination allocation from the data
mover, the mover is basically a splice copy from source to
destination, an fsync and then an atomic swap blocks/extents operation.
Most of this code is generic, and a per-fs swap-extents vector
could be easily provided for the one bit that is not

The allocation interface, OTOH, is anything but simple and is really
a filesystem specific interface. Seems logical to me to separate
the two. 

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Eric Sandeen
David Chinner wrote:
 The allocation interface, OTOH, is anything but simple and is really
 a filesystem specific interface. Seems logical to me to separate
 the two. 

And ext[234] preallocation would be a very nice feature in its own right.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
 On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
  On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
   On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
isn't that a kernel responsbility to find/allocate target blocks?
wouldn't it better to specify desirable target group and minimal
acceptable chunk of free blocks?
   
   The kernel doesn't have enough knowledge to know whether or not the
   defragger prefers one blkdev location over another.
   
   When you are trying to consolidate blocks, you must specify the
   destination as well as source blocks.
   
   Certainly, to prevent corruption and other nastiness, you must fail if
   the destination isn't available...
  
  That's the wrong way to look at it. if you want the userspace
  process to specify a location, then you should preallocate it first
  before doing anything else. There is no need to clutter a simple
  data mover interface with all sorts of unnecessary error handling.
 
 You are implying the the 2-step interface, creating a new inode then
 swapping the contents, is the only way to implement this.

No, it's not the only way to implement it, but it seems the cleanest
way to me when you have to consider crash recovery. With a temporary
inode, you can create it, hold a reference and then unlink it so
that any crash at that point will free the inode and any extents
it has on it.

The only way I can see anything different working is having the
filesystem hold extents somewhere internally that provides us the
same recovery guarantees while we copy the data and insert the new
extents.  This is obviously a filesystem specific solution and is
more complex to implement than a swap extent transaction. it
probably also needs on disk format changes to support properly

  Once you've separated the destination allocation from the data
  mover, the mover is basically a splice copy from source to
  destination, an fsync and then an atomic swap blocks/extents operation.
  Most of this code is generic, and a per-fs swap-extents vector
  could be easily provided for the one bit that is not
 
 The benefit of having such a simple data mover is negated by moving the
 complexity into the allocator.

What complexity does it introduce that the allocator doesn't already
have or needs to provide for the single call interface to work?

 A single interface that would move a part of a file at a time has the
 advantage that a large file which is only fragmented in a few areas does
 not need to be completely moved.

And the two-step process can do exactly this as well - splice can
work on any offset within the file...

  The allocation interface, OTOH, is anything but simple and is really
  a filesystem specific interface. Seems logical to me to separate
  the two. 
 
 So what then is the benefit of having a simple generic data mover if
 every file system needs to implement it's own interface to allocate a
 copy of the data?

I assume you meant allocate the space to store the copy of the data.

The allocation interface needs to be be able to be  extended
independently of the data mover interface. XFS already exposes
allocation ioctls to userspace for preallocation and we've got plans
to extnd this further to allow userspace controlled allocation for
smart defrag tools for XFS. Tying allocation to the data mover
just makes the interface less flexible and harder to do anything
smart with

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Dave Kleikamp
On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
 On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
  On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
   That's the wrong way to look at it. if you want the userspace
   process to specify a location, then you should preallocate it first
   before doing anything else. There is no need to clutter a simple
   data mover interface with all sorts of unnecessary error handling.
  
  You are implying the the 2-step interface, creating a new inode then
  swapping the contents, is the only way to implement this.
 
 No, it's not the only way to implement it, but it seems the cleanest
 way to me when you have to consider crash recovery. With a temporary
 inode, you can create it, hold a reference and then unlink it so
 that any crash at that point will free the inode and any extents
 it has on it.
 
 The only way I can see anything different working is having the
 filesystem hold extents somewhere internally that provides us the
 same recovery guarantees while we copy the data and insert the new
 extents.  This is obviously a filesystem specific solution and is
 more complex to implement than a swap extent transaction. it
 probably also needs on disk format changes to support properly

This is definitely filesystem-dependent.  I would think allocating an
extent would be like any other allocation done by the filesystem, and
there are already recovery mechanisms for that.

   Once you've separated the destination allocation from the data
   mover, the mover is basically a splice copy from source to
   destination, an fsync and then an atomic swap blocks/extents operation.
   Most of this code is generic, and a per-fs swap-extents vector
   could be easily provided for the one bit that is not
  
  The benefit of having such a simple data mover is negated by moving the
  complexity into the allocator.
 
 What complexity does it introduce that the allocator doesn't already
 have or needs to provide for the single call interface to work?

I don't see it as any more or less complex than a single interface.

  A single interface that would move a part of a file at a time has the
  advantage that a large file which is only fragmented in a few areas does
  not need to be completely moved.
 
 And the two-step process can do exactly this as well - splice can
 work on any offset within the file...

I wasn't aware of that.  That makes your proposal sound a lot better.

   The allocation interface, OTOH, is anything but simple and is really
   a filesystem specific interface. Seems logical to me to separate
   the two. 
  
  So what then is the benefit of having a simple generic data mover if
  every file system needs to implement it's own interface to allocate a
  copy of the data?
 
 I assume you meant allocate the space to store the copy of the data.

Yeah.

 The allocation interface needs to be be able to be  extended
 independently of the data mover interface. XFS already exposes
 allocation ioctls to userspace for preallocation and we've got plans
 to extnd this further to allow userspace controlled allocation for
 smart defrag tools for XFS. Tying allocation to the data mover
 just makes the interface less flexible and harder to do anything
 smart with

Okay.  It would be nice to standardize the interface so we don't have
every filesystem introducing new ioctls.

 Cheers,
 
 Dave.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Theodore Tso
On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
 That's the wrong way to look at it. if you want the userspace
 process to specify a location, then you should preallocate it first
 before doing anything else. There is no need to clutter a simple
 data mover interface with all sorts of unnecessary error handling.

This is doable, but it adds a huge amount of complexity before we
could implement on-line defragmentation.

First of all, we would need a way of allowing userpsace to specify
which blocks should be used in the preallocation.

Secondly, we would need a way of marking blocks as preallocated but
not pre-zeroed; otherwise we would have to zero out all of the blocks
in order to assure security (don't want userspace programs seeing the
previous contents of the data blocks), only to do the copy and the
extents vector swap.

That's a huge amount of work, and while the above two features can be
useful for other things, it's not clear it's worth it to require this
as the only way to implement on-line defragging.  You're right that
it's a way of making things be more generic, but it means that each
filesystem needs to have a huge amount of additional complexity and
potential filesystem format changes before they could take advantage
of this general framework.  

(For example, you'd never be able to do this with the FAT filesystem,
or the ext2 or ext3 filesystems; it would work for ext4 only *after*
we implement the above mentioned new features and the associated
filesystem format changes.)

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Russell Cattelan
On Tue, 2006-10-24 at 15:44 -0400, Theodore Tso wrote:
 On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
  That's the wrong way to look at it. if you want the userspace
  process to specify a location, then you should preallocate it first
  before doing anything else. There is no need to clutter a simple
  data mover interface with all sorts of unnecessary error handling.
 
 This is doable, but it adds a huge amount of complexity before we
 could implement on-line defragmentation.
 
 First of all, we would need a way of allowing userpsace to specify
 which blocks should be used in the preallocation.
 
 Secondly, we would need a way of marking blocks as preallocated but
 not pre-zeroed; otherwise we would have to zero out all of the blocks
 in order to assure security (don't want userspace programs seeing the
 previous contents of the data blocks), only to do the copy and the
 extents vector swap

Chris Mason page place holder work for DIRECT IO should be applicable to
any pre-allocations?


-- 
Russell Cattelan [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote:
 On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
  On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
   On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
That's the wrong way to look at it. if you want the userspace
process to specify a location, then you should preallocate it first
before doing anything else. There is no need to clutter a simple
data mover interface with all sorts of unnecessary error handling.
   
   You are implying the the 2-step interface, creating a new inode then
   swapping the contents, is the only way to implement this.
  
  No, it's not the only way to implement it, but it seems the cleanest
  way to me when you have to consider crash recovery. With a temporary
  inode, you can create it, hold a reference and then unlink it so
  that any crash at that point will free the inode and any extents
  it has on it.
  
  The only way I can see anything different working is having the
  filesystem hold extents somewhere internally that provides us the
  same recovery guarantees while we copy the data and insert the new
  extents.  This is obviously a filesystem specific solution and is
  more complex to implement than a swap extent transaction. it
  probably also needs on disk format changes to support properly
 
 This is definitely filesystem-dependent.  I would think allocating an
 extent would be like any other allocation done by the filesystem, and
 there are already recovery mechanisms for that.

Yes, the allocation would be the same, but that isn't the problem
I was talking about.

The problem is holding a reference to the extent once it has been
allocated while it is having the data copied into it (i.e. before it
is swapped with the original extents) and then holding the original
extents until they are freed.  These references need to be
persistent so they can be freed correctly during crash recovery
i.e. rollback the allocation if the extent swap has not been
logged, or free the original blocks is the extent swap has been
logged.

The obvious way to do this is to use an unlinked (orphan) inode

Once you've separated the destination allocation from the data
mover, the mover is basically a splice copy from source to
destination, an fsync and then an atomic swap blocks/extents operation.
Most of this code is generic, and a per-fs swap-extents vector
could be easily provided for the one bit that is not
   
   The benefit of having such a simple data mover is negated by moving the
   complexity into the allocator.
  
  What complexity does it introduce that the allocator doesn't already
  have or needs to provide for the single call interface to work?
 
 I don't see it as any more or less complex than a single interface.

Ok, I thought I was missing something there.

  The allocation interface needs to be be able to be  extended
  independently of the data mover interface. XFS already exposes
  allocation ioctls to userspace for preallocation and we've got plans
  to extnd this further to allow userspace controlled allocation for
  smart defrag tools for XFS. Tying allocation to the data mover
  just makes the interface less flexible and harder to do anything
  smart with
 
 Okay.  It would be nice to standardize the interface so we don't have
 every filesystem introducing new ioctls.

Well, that will be an interesting challenge. I'm sure that there
is a common subset that all filesystems can implement e.g. per
file preallocation (something like XFS's allocate/reserve/free space
ioctls) to provide kernel support for posix_fallocate(), etc.

However, we may end up exposing enough of XFS's current allocation
semantics to do things like telling the filesystem to allocate in
allocation group 6, near block number 0x32482 within the AG, falling
back to searching for the nearest match to the size requirement,
failing that look for something larger than the minimum size
specified, and then fail if you can't find a match in that AG.

That makes little sense to any filesystem but XFS, which is really
why I think that the smarter allocation interfaces are going to
remain filesystem specific

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Tue, Oct 24, 2006 at 03:44:16PM -0400, Theodore Tso wrote:
 On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
  That's the wrong way to look at it. if you want the userspace
  process to specify a location, then you should preallocate it first
  before doing anything else. There is no need to clutter a simple
  data mover interface with all sorts of unnecessary error handling.
 
 This is doable, but it adds a huge amount of complexity before we
 could implement on-line defragmentation.
 
 First of all, we would need a way of allowing userpsace to specify
 which blocks should be used in the preallocation.

Not initially. Create a file, and call posix_fallocate() on it.
Later, the filesystem can provide something that the defrag tool can
use for fine-grained control of where the preallocated blocks are on
disk.

 Secondly, we would need a way of marking blocks as preallocated but
 not pre-zeroed; otherwise we would have to zero out all of the blocks
 in order to assure security (don't want userspace programs seeing the
 previous contents of the data blocks), only to do the copy and the
 extents vector swap.

The unlinked inode method avoids this problem because no user space
process can see the inode to open it. Also, posix_fallocate() zeroes
the disk blocks so even this protects against data exposure.

So, now all that remains for an initial implementation is the swap
extents transaction and the data mover syscall.

For a smart, fast implementation, I agree that you need unwritten
extents (which XFS already has), then a fast filesystem
implementation of posix_fallocate() that utilises unwritten extents
(which XFS already has), and finally another interface that allows
you to allocate unwritten extents in an arbitrary location within
the filesystem (which no filesystem currently has).

 That's a huge amount of work, and while the above two features can be
 useful for other things, it's not clear it's worth it to require this
 as the only way to implement on-line defragging.  You're right that
 it's a way of making things be more generic, but it means that each
 filesystem needs to have a huge amount of additional complexity and
 potential filesystem format changes before they could take advantage
 of this general framework.  

I disagree - it's not a huge amount of work to get some thing
working and to solidify the generic interfaces and only format
change is a new transaction. Any filesystem that supports the swap
extent/blocks method would then work better than XFs's current
online defrag tool which currently does not use preallocation,
nor does it use splice.

 (For example, you'd never be able to do this with the FAT filesystem,
 or the ext2 or ext3 filesystems; it would work for ext4 only *after*
 we implement the above mentioned new features and the associated
 filesystem format changes.)

Sure, but they can use the slow, unoptimised posix_fallocate() method
for allocating disk space

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Ext3 online defrag

2006-10-24 Thread Barry Naujok
 

On Wed, 25 Oct 2006 11:19 AM, David Chinner wrote:
 On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote:
  On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
   On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
   The allocation interface needs to be be able to be  extended
   independently of the data mover interface. XFS already exposes
   allocation ioctls to userspace for preallocation and 
 we've got plans
   to extnd this further to allow userspace controlled allocation for
   smart defrag tools for XFS. Tying allocation to the data mover
   just makes the interface less flexible and harder to do anything
   smart with
  
  Okay.  It would be nice to standardize the interface so we 
 don't have
  every filesystem introducing new ioctls.
 
 Well, that will be an interesting challenge. I'm sure that there
 is a common subset that all filesystems can implement e.g. per
 file preallocation (something like XFS's allocate/reserve/free space
 ioctls) to provide kernel support for posix_fallocate(), etc.
 
 However, we may end up exposing enough of XFS's current allocation
 semantics to do things like telling the filesystem to allocate in
 allocation group 6, near block number 0x32482 within the AG, falling
 back to searching for the nearest match to the size requirement,
 failing that look for something larger than the minimum size
 specified, and then fail if you can't find a match in that AG.
 
 That makes little sense to any filesystem but XFS, which is really
 why I think that the smarter allocation interfaces are going to
 remain filesystem specific

Could we have a more abstract method for asking the filesystem where the 
free blocks are and then using the same block addressing to tell the
fs where to allocate/move the file's data to?

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote:
 Could we have a more abstract method for asking the filesystem where the 
 free blocks are and then using the same block addressing to tell the
 fs where to allocate/move the file's data to?

That's fundamentally racy, so you might as well just read the
filesystem metadata from userspace.  No need to go through the kernel
for that.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Tue, Oct 24, 2006 at 10:42:57PM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote:
  Could we have a more abstract method for asking the filesystem where the 
  free blocks are and then using the same block addressing to tell the
  fs where to allocate/move the file's data to?
 
 That's fundamentally racy, so you might as well just read the
 filesystem metadata from userspace.  No need to go through the kernel
 for that.

But it a race that is _easily_ handled, and applications only need to
implement one interface, not a different method for every
filesystem that requires deeep filesystem knowledge.

Besides, you still have to handle the case where the block you want
has already been allocated because reading the metadata from
userspace doesn't prevent the kernel from allocating the block you
want before you ask for it...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
 But it a race that is _easily_ handled, and applications only need to
 implement one interface, not a different method for every
 filesystem that requires deeep filesystem knowledge.
 
 Besides, you still have to handle the case where the block you want
 has already been allocated because reading the metadata from
 userspace doesn't prevent the kernel from allocating the block you
 want before you ask for it...

The race is easily handled either way, by having the block move fail
when you tell the kernel the destination blocks.

The difference is that you don't unnecessarily bloat the kernel.

Every major filesystem has a libfoofs library that makes it trivial to
read the metadata, so all you need to do is use an existing lib.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-24 Thread David Chinner
On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
  But it a race that is _easily_ handled, and applications only need to
  implement one interface, not a different method for every
  filesystem that requires deeep filesystem knowledge.
  
  Besides, you still have to handle the case where the block you want
  has already been allocated because reading the metadata from
  userspace doesn't prevent the kernel from allocating the block you
  want before you ask for it...
 
 The race is easily handled either way, by having the block move fail
 when you tell the kernel the destination blocks.

So why are you arguing that an interface is no good because it
is fundamentally racy? ;)

 The difference is that you don't unnecessarily bloat the kernel.

By that argument, we should rip out the bmap interface (FIBMAP)
because you can get all that information by reading the metadata
from userspace.

 Every major filesystem has a libfoofs library that makes it trivial to
 read the metadata, so all you need to do is use an existing lib.

IOWs, you are advocating that any application that wants to use this
special allocation technique needs to link against every different
filesystem library and it then needs to implement filesystem
specific searches through their metadata?  Nobody in their right
mind would ever want to use an interface like this.

Also, this simply doesn't work for XFS because the cached metadata
is in a different address space to the block device. Hence it can be
tens of seconds between the kernel modifying a metadata buffer and
userspace being able to see that modification. You need to freeze
the filesystem for the XFS userspace tools to guarantee a
consistent view of an online filesystem from the block device.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Theodore Tso
On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
   Hello,
 
   I've written a simple patch implementing ext3 ioctl for file
 relocation. Basically you call ioctl on a file, give it list of blocks
 and it relocates the file into given blocks (provided they are still
 free). The idea is to use it as a kernel part of ext3 online
 defragmenter (or generally disk access optimizer). Now I don't have the
 user space part that finds larger runs of free blocks and so on so that
 it can really be used as a defragmenter. I just send this as a kind of
 proof-of-concept to hear some comments. Attached is also a simple
 program that demonstrates the use of the ioctl.

As a suggestion, I would pass the inode number and inode generation
number into the ext3_file_mode_data array:

struct ext3_file_move_data {
int extents;
struct ext3_reloc_extent __user *ext_array;
};

This will be much more efficient for the userspace relocator, since it
won't need to translate from an inode number to a pathname, and then
try to open the file before relocating it.

I'd also use an explicit 64-bit block numbers type so that we don't
have to worry about the ABI changing when we support 64-bit block
numbers.


The other problem I see with this patch is that there will be cache
coherency problems between the buffer cache and the page cache.  I
think you will want to pull the data blocks of the file into the page
cache, and then write them out from the page cache, and only *then*
update the indirect blocks and commit the transaction.

So what needs to happen is the following:

1) Validate that inode and generation number.  Make sure the new
(destination) blocks passed in are valid and not in use.  Allocate
them to prevent anyone else from using those blocks.

2) Pull the blocks into the page cache (if they are not already
there), and the write them out to the new location on disk.  If any of
the I/O's fail, abort.

3) Update the indirect blocks or extent tree to point at the newly
allocated and copied data blocks.

In the current patch, it looks like you add the inode being relocated
to the orphan list, and then update the direct/indirect blocks first
--- and if you fail the inode gets truncated.  That's bad since we
don't want to lose any data if we crash in the middle of the defrag
operation

Great to see that you're working on this problem!  I'd love to see
this functionality into ext4.

Regards,

- Ted

P.S.  There is also the question of whether we'll be able to get this
interface past the ioctl() police, but the atomicity requirements of
such an interface are a poster child for why we really, REALLY, can't
do this via a sysfs interface.  We might be forced to create a new
filesystem, or create a pseudo inode which we open via a magic
pathname, though.  That in my opinion is uglier than an ioctl, but the
ioctl police really don't like the problem of needing to maintain
32/64 bit translation functions, and this interface would surely cause
problems for the x86_64 and PPC platforms, since they have to support
32-bit and 64-bit system ABI's.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
  Hello,

I've written a simple patch implementing ext3 ioctl for file
  relocation. Basically you call ioctl on a file, give it list of blocks
  and it relocates the file into given blocks (provided they are still
  free). The idea is to use it as a kernel part of ext3 online
  defragmenter (or generally disk access optimizer). Now I don't have the
  user space part that finds larger runs of free blocks and so on so that
  it can really be used as a defragmenter. I just send this as a kind of
  proof-of-concept to hear some comments. Attached is also a simple
  program that demonstrates the use of the ioctl.
 
 As a suggestion, I would pass the inode number and inode generation
 number into the ext3_file_mode_data array:
 
 struct ext3_file_move_data {
   int extents;
   struct ext3_reloc_extent __user *ext_array;
 };
 
 This will be much more efficient for the userspace relocator, since it
 won't need to translate from an inode number to a pathname, and then
 try to open the file before relocating it.
  Hmm, I was also thinking about it. Probably you're right. It just
seemed elegant to call ioctl on a file and *plop* it's relocated ;).

 I'd also use an explicit 64-bit block numbers type so that we don't
 have to worry about the ABI changing when we support 64-bit block
 numbers.
  Right, will fix.

 The other problem I see with this patch is that there will be cache
 coherency problems between the buffer cache and the page cache.  I
 think you will want to pull the data blocks of the file into the page
 cache, and then write them out from the page cache, and only *then*
 update the indirect blocks and commit the transaction.
  Hmm, I thought I got this right. We build a new tree, copy all data to
it (no writes happen so trees remain consistent), we switch block
pointers from inode. So from now on, any get_block() will correctly
return new block number and block will be read from disk (hmm, probably
I'm missing sync after writing out all the data). Now we call
invalidate_inode_pages2() so all buffers mapped to old blocks are freed
from memory. So there should not be problems with this... OTOH doing the
data copy via page-cache (of the temporarily set-up inode) should not be
a big problem either and we can avoid one sync which should be a win.

 So what needs to happen is the following:
 
 1) Validate that inode and generation number.  Make sure the new
 (destination) blocks passed in are valid and not in use.  Allocate
 them to prevent anyone else from using those blocks.
 
 2) Pull the blocks into the page cache (if they are not already
 there), and the write them out to the new location on disk.  If any of
 the I/O's fail, abort.
 
 3) Update the indirect blocks or extent tree to point at the newly
 allocated and copied data blocks.
 
 In the current patch, it looks like you add the inode being relocated
 to the orphan list, and then update the direct/indirect blocks first
  No, I create temporary inode that holds allocated blocks and that is
added to the orphan list. Hence if we crash in the middle of relocation,
all blocks are correctly freed.

 --- and if you fail the inode gets truncated.  That's bad since we
 don't want to lose any data if we crash in the middle of the defrag
 operation
 
 Great to see that you're working on this problem!  I'd love to see
 this functionality into ext4.
  Thanks for comments.

 P.S.  There is also the question of whether we'll be able to get this
 interface past the ioctl() police, but the atomicity requirements of
 such an interface are a poster child for why we really, REALLY, can't
 do this via a sysfs interface.  We might be forced to create a new
 filesystem, or create a pseudo inode which we open via a magic
 pathname, though.  That in my opinion is uglier than an ioctl, but the
 ioctl police really don't like the problem of needing to maintain
 32/64 bit translation functions, and this interface would surely cause
 problems for the x86_64 and PPC platforms, since they have to support
 32-bit and 64-bit system ABI's.
  Umm, yes. I'm open to suggestions with respect to which interface to
choose. ioctl() was just the easiest to code ;).

Bye
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  18:31 +0400, Alex Tomas wrote:
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?

In some cases this is useful (e.g. if file has small fragments after
being written in small pieces or in a fragmented free space).  In other
cases the user tool HAS to be able to specify the new mapping in order
to make progress.

Consider if there are two very large fragmented files and user-space
defrag tool wants to make contiguous free space.  If kernel is left to do
allocation it will always consume the largest chunk of free space first,
even if it is not yet optimal (e.g. large 1MB aligned extent).

I would make this interface optionally allow the target extent to be
specified, but if target block == 0 then the kernel is free to do its
own allocation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
  Theodore Tso (TT) writes:
 
  TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
   Hello,
   
   I've written a simple patch implementing ext3 ioctl for file
   relocation. Basically you call ioctl on a file, give it list of blocks
   and it relocates the file into given blocks (provided they are still
   free). The idea is to use it as a kernel part of ext3 online
   defragmenter (or generally disk access optimizer). 
 
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?
  Kernel definitely allocates those blocks (because it's the only
reasonably race-free way). The problem of finding those blocks is a bit
harder - it may be quite complicated decision where to put the file
(also given, that sometimes you may need to shift away some file to
make space for some other one). Also what I'm aiming for is, that
userspace defragmenter could be fed some access patterns and it
optimizes layout of several files to speedup startup (i.e. blocks of
those several files would be interleaved so that their sequence is close
to the one seen during start-up).

Honza
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
 On Oct 23, 2006  18:31 +0400, Alex Tomas wrote:
 I would make this interface optionally allow the target extent to be
 specified, but if target block == 0 then the kernel is free to do its
 own allocation.
  That's a good idea! I'll change the handling so that if block==0 we
just allocate blocks of given extent as we wish...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Eric Sandeen

Alex Tomas wrote:

Theodore Tso (TT) writes:


 TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
  Hello,
  
  I've written a simple patch implementing ext3 ioctl for file

  relocation. Basically you call ioctl on a file, give it list of blocks
  and it relocates the file into given blocks (provided they are still
  free). The idea is to use it as a kernel part of ext3 online
  defragmenter (or generally disk access optimizer). 


isn't that a kernel responsbility to find/allocate target blocks?
wouldn't it better to specify desirable target group and minimal
acceptable chunk of free blocks?


XFS does this by allocating new blocks for a temporary file (initiated from 
userspace, implemented in kernelspace of course), then just checks to see if the 
result is better than what we had before; if so, then swap the storage space  
throw away the temporary file (which now has the original, more-fragmented file 
blocks).


see xfs_swapext() in xfs_dfrag.c for the extent swapping part of this.

You probably want to avoid the page cache in all of this too, doing O_DIRECT IO 
if possible, I don't think there's any reason to churn the page cache while the 
defragmenter runs over a filesystem?


-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  10:16 -0400, Theodore Tso wrote:
 As a suggestion, I would pass the inode number and inode generation
 number into the ext3_file_mode_data array:
 
 struct ext3_file_move_data {
   int extents;
   struct ext3_reloc_extent __user *ext_array;
 };
 
 This will be much more efficient for the userspace relocator, since it
 won't need to translate from an inode number to a pathname, and then
 try to open the file before relocating it.
 
 I'd also use an explicit 64-bit block numbers type so that we don't
 have to worry about the ABI changing when we support 64-bit block
 numbers.

I would in fact go so far as to allow only a single extent to be specified
per call.  This is to avoid the passing of any pointers as part of the
interface (hello ioctl police :-), and also makes the kernel code simpler.
I don't think the syscall/ioctl overhead is significant compared to the
journal and IO overhead.

Also, I would specify both the source extent and the target extent in
the inode.  This first allows defragmenting only part of the file
instead of (it appears) requiring the whole file to be relocated.  That
would be a killer if the file being defragmented is larger than free
space.  It secondly provides a level of insurance that what the kernel
is relocating matches what userspace thinks it is doing.  It would
protect against problems if the kernel ever does block relocation
itself (e.g. merge fragments into a single extent on (re)write, or for
snapshot/COW).

 The other problem I see with this patch is that there will be cache
 coherency problems between the buffer cache and the page cache.  I
 think you will want to pull the data blocks of the file into the page
 cache, and then write them out from the page cache, and only *then*
 update the indirect blocks and commit the transaction.

Alternately (maybe even better) is to treat it as O_DIRECT and ensure
the page cache is flushed.  This also avoids polluting the whole page
cache while running a defragmenter on the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
 On Oct 23, 2006  10:16 -0400, Theodore Tso wrote:
  As a suggestion, I would pass the inode number and inode generation
  number into the ext3_file_mode_data array:
  
  struct ext3_file_move_data {
  int extents;
  struct ext3_reloc_extent __user *ext_array;
  };
  
  This will be much more efficient for the userspace relocator, since it
  won't need to translate from an inode number to a pathname, and then
  try to open the file before relocating it.
  
  I'd also use an explicit 64-bit block numbers type so that we don't
  have to worry about the ABI changing when we support 64-bit block
  numbers.
 
 I would in fact go so far as to allow only a single extent to be specified
 per call.  This is to avoid the passing of any pointers as part of the
 interface (hello ioctl police :-), and also makes the kernel code simpler.
 I don't think the syscall/ioctl overhead is significant compared to the
 journal and IO overhead.
  I'm not sure it makes the kernel code simplier - if we have to replace
just a part of the file, we have to rewrite references to blocks at
several places inside indiretc tree. If we relocate whole file, we just
replace block pointers from inode. Furthermore it makes it kind of
harder to tell where indirect blocks would go - and it would be
impossible for the defragmenter to force some unusual placement of
indirect blocks... Currently blocks (including indirect ones) are just
being allocated in the DFS order from the given list.

 Also, I would specify both the source extent and the target extent in
 the inode.  This first allows defragmenting only part of the file
 instead of (it appears) requiring the whole file to be relocated.  That
 would be a killer if the file being defragmented is larger than free
 space.  It secondly provides a level of insurance that what the kernel
 is relocating matches what userspace thinks it is doing.  It would
 protect against problems if the kernel ever does block relocation
 itself (e.g. merge fragments into a single extent on (re)write, or for
 snapshot/COW).
  I agree that this is the positive side of your approach :).

  The other problem I see with this patch is that there will be cache
  coherency problems between the buffer cache and the page cache.  I
  think you will want to pull the data blocks of the file into the page
  cache, and then write them out from the page cache, and only *then*
  update the indirect blocks and commit the transaction.
 
 Alternately (maybe even better) is to treat it as O_DIRECT and ensure
 the page cache is flushed.  This also avoids polluting the whole page
 cache while running a defragmenter on the filesystem.
  That's what I'm trying to do (but maybe my code is buggy).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jeff Garzik
On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?

The kernel doesn't have enough knowledge to know whether or not the
defragger prefers one blkdev location over another.

When you are trying to consolidate blocks, you must specify the
destination as well as source blocks.

Certainly, to prevent corruption and other nastiness, you must fail if
the destination isn't available...

(ext2meta did all this...)

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html