Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote: On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Yeah, but that's simply not enough. Not enough for what? A good defragger needs to know Oh, we're back to defrag again. :/ about a filesystem's allocation policies, and move files so they are optimally located, given the filesystem layout. For example, in ext2/3/4 we will want to move blocks so they in the same block group as the inode. That's filesystem specific information; other filesystems will require different policies. Of which a good chunk of policies will be common. the above policy has been around for many, many years and is implemented in many, many filesystems (even XFS). get_free_list(dst_fd, location, len, list) location == allocation policy. e.g: give me a list of free blocks: - anywhere (default filesystem policy applies) - near block number X - at block X - in block/allocation group Y - of the largest contiguous regions in (one of the above) - at least N blocks in length - near inode src_fd - in storage tier 3 then you select one of the regions that was returned at attempt to allocate that. You can put whatever filesystems specific stuff you need around this to arrive at the decision of where to put the file, but you've got to allocate the new blocks, move the data to them, and swap them over. Every defragger needs to do this, regardless of the filesystem type. So why not provide a framework for it, especially as the framework is useful for far more than just as the data movement part of a defrag application. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. Unfortunately, if you want to do a good job, a defragger *has* to know about some very low-level filesystem specific information, if it wants to do a good job. Back to defrag. Again. Bigger picture, guys, bigger picture. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 25, 2006 16:54 +0200, Jan Kara wrote: I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. I suspect that the natural choice for metadata blocks is to keep the block which has the most metadata unchanged. For example, if you are doing a full-file relocation then you would naturally keep all of the new {dt}indirect blocks. If you are relocating a small chunk of the file you would keep the old {dt}indirect blocks and just copy a few block pointers over. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On Oct 25, 2006 11:44 +0200, Andre Noll wrote: Are you saying that ext3_set_bit() should simply be called with ret_block as its first argument? If yes, that is what the revised patch below does. You might need to call ext3_set_bit_atomic() (as claim_block() does, not sure. @@ -1372,12 +1370,21 @@ allocated: in_range(ret_block, le32_to_cpu(gdp-bg_inode_table), EXT3_SB(sb)-s_itb_per_group) || in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table), + EXT3_SB(sb)-s_itb_per_group)) { + ext3_error(sb, __FUNCTION__, Allocating block in system zone - blocks from E3FSBLK, length %lu, ret_block, num); + /* Note: This will potentially use up one of the handle's + * buffer credits. Normally we have way too many credits, + * so that is OK. In _very_ rare cases it might not be OK. + * We will trigger an assertion if we run out of credits, + * and we will have to do a full fsck of the filesystem - + * better than randomly corrupting filesystem metadata. + */ + ext3_set_bit(ret_block, gdp_bh-b_data); + goto repeat; + } The other issue is that you need to potentially set num bits in the bitmap here, if those all overlap metadata. In fact, it might just make more sense at this stage to walk all of the bits in the bitmaps, the inode table and the backup superblock and group descriptor to see if they need fixing also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ Upto this point I can imagine we can be perfectly generic. alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); With these two it's not clear how well can we do with just a generic interface. Every filesystem needs to have some additional metadata to keep list of data blocks. In case of ext2/ext3/reiserfs this is not a negligible amount of space and placement of these metadata is important for performance. So either we focus only on data blocks and let implementation of alloc_from_list() allocate metadata wherever it wants (but then we get suboptimal performace because there need not be space for indirect blocks close before our provided extent) or we allocate metadata from the provided list, but then we need some knowledge of fs to know how much should we expect to spend on metadata and where these metadata should be placed. For example if you know that indirect block for your interval is at block B, then you'd like to allocate somewhere close after this point or to relocate that indirect block (and all the data it references to). But for that you need to know you have something like indirect blocks = filesystem knowledge. So I think that to get this working, we also need some way to tell the program that if it wants to allocate some data, it also needs to count with this amount of metadata and some of it is already allocated in given blocks... I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? So you prefer to handle only data blocks part of the problem and let filesystem sort out metadata? Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote: Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. If all you want is something for applicaiton developers, about all you can do is to tell the filesystem, create the file so that it will be quickly accessed after accessing this file or this directory. I really don't see the point of having the application specify block numbers if you're also claiming the applicaiton isn't going to know anything about the filesystem layout --- or even the RAID layout of the filesystem. I don't think it's at **all** useful to be half-pregnant on this score. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, 2006-10-26 at 09:37 -0400, Theodore Tso wrote: On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote: Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. If all you want is something for applicaiton developers, about all you can do is to tell the filesystem, create the file so that it will be quickly accessed after accessing this file or this directory. I really don't see the point of having the application specify block numbers if you're also claiming the applicaiton isn't going to know anything about the filesystem layout --- or even the RAID layout of the filesystem. I don't think it's at **all** useful to be half-pregnant on this score. I think a utility such as a defragmenter should know about about the filesystem layout. I also think that it would be a good thing to have a consistent interface so that every filesystem isn't implementing a completely different one. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, 25 October 2006 14:41:18 -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote: Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Agreed... that gets nasty real quick. Logfs has a similar problem and I introduced a level. Without going into all the gory details, data blocks reside on level 0, indirect blocks on level 1, doubly indirect blocks on level 2, etc. With this, the tupel of (ino, pos, level) can specify any block on the filesystem, provided it is used for some inode. Logfs needs this for Garbage Collection, which is a fairly similar problem. Jörn -- Joern's library part 3: http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote: On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ Upto this point I can imagine we can be perfectly generic. alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); With these two it's not clear how well can we do with just a generic interface. Every filesystem needs to have some additional metadata to keep list of data blocks. In case of ext2/ext3/reiserfs this is not a negligible amount of space and placement of these metadata is important for performance. Yes, the same can be said for XFS. However, XFS's extent btree implementation uses readahead to hide a lot of the latency involved with reading extent map, and it only needs to read it once per inode lifecycle So either we focus only on data blocks and let implementation of alloc_from_list() allocate metadata wherever it wants (but then we get suboptimal performace because there need not be space for indirect blocks close before our provided extent) I think the first step would be to focus on data blocks using something like the above. There are many steps to full filesystem defragmentation, but data fragmetnation is typically the most common symptom of fragmentation that we see. or we allocate metadata from the provided list, but then we need some knowledge of fs to know how much should we expect to spend on metadata and where these metadata should be placed. That's the second step, I think. For example, we could count the metadata blocks used in metadata structure (say an block list), allocate a new chunk like above, and then execute a move_metadata() type of operation, which the filesystem does internally in a transactionally safe manner. Once again, generic interface, filesystem specific implementations. For example if you know that indirect block for your interval is at block B, then you'd like to allocate somewhere close after this point or to relocate that indirect block (and all the data it references to). But for that you need to know you have something like indirect blocks = filesystem knowledge. *nod* This is far less of a problem with extent based filesystems - coalescing all the fragments into a single extent removes the need for indirect blocks and you get the extent list for free when you read the inode. When we do have a fragmented file, XFS uses readahead to speed btree searching and reading, so it hides a lot of the latency overhead that fragmented metadata can cause. Either way, these lists can still be optimised by allocating a set of contiguous blocks and copying the metadata into them and updating the pointers to the new blocks. It can be done separately to the data moving and really should be done after the data has been defragmented So I think that to get this working, we also need some way to tell the program that if it wants to allocate some data, it also needs to count with this amount of metadata and some of it is already allocated in given blocks... If you want to do it all in one step. However, it's not quite that simple for something like XFS. An allocation may require a btree split (or three, actually) and the number of blocks required is dependent on the height of the btrees. So we don't know how many blocks we'll need ahead of time, and we'd have to reach deep into the allocator and abuse it badly to do anything like this. It's not something I want to even contemplate doing. :/ Also, we don't want to be mingling global metadata with inode specific metadata so we don't want to put most of the new metadata blocks near the extent we are putting the data into. That means I'd prefer to be able to optimise metadata objects separately. e.g. rewrite a btree into a single contiguous extent with the btree blocks laid out so the readahead patterns result in sequential I/O. The kernel would need to do this in XFS because we'd have to lock the entire btree a block at a time, copy it and then issue a swap btree transaction. most other journalling filesystems will have similar requirements, I think, for doing this online That's a very similar concept to the move_data() interface... I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? So you prefer to handle only data blocks part of the problem and let filesystem sort out metadata? The filesystem