Re: Ext2/3 block remapping tool
On Apr 30, 2007 08:09 -0400, Theodore Tso wrote: On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). While we're at it, someone want to start thinking about on-line shrinking of ext4 filesystems? Again, the same block remapping interfaces for defrag and file access optimizations should also be useful for shrinking filesystems (even if some of the files that need to be relocated are being actively used). If not, that probably means we got the interface wrong. Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-1.39-tyt3
On Apr 30, 2007 11:22 -0400, Theodore Ts'o wrote: One concern I still have is the fact that we're exposing a lot of interfaces in libext2fs.so which are very specifically tied to the current 48-bit physical/32-bit logical on-disk extent data structure. If/when we add support for the 64/64-bit extent data structure, and some kind of compressed/bit-packed extent format, those interfaces will have to be extended. Some other comments: - it appears you have all of extents.c copied into both block.c and in extent.c - it appears you are missing a couple of extent sanity checks for headers and such that were added more recently, but those might also have been removed by you - do the patches as they stand actually fix the duplicate block test cases yet? It seems all of the BLOCK_CHANGED handling was removed from block_iterate_extents(). - it would be easier to read if the patches were generated with diff -up (can be set in .quiltrc via 'export QUILT_DIFF_OPTS=-up'. I also like export QUILT_NO_DIFF_TIMESTAMPS=1 to avoid tons of gratuitous changes to patches when they are refreshed. Another problem is that while external extent blocks are getting byte swapped, there is no byte swapping going on for the extents stored in the inode. I haven't tried it on a big endian system, such as Power machine, but I'm pretty sure it's going to blow up spectacularly. Ah, interesting. We have 64-bit machines for testing, but no big-endian ones. The patches can be found at: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/e2fsprogs-1.39-tyt3 What else I think is important to get into this patch series is a change to mkfs.ext4 so that the default inode size is increased to 256. I'm not sure how/if this can be done via mke2fs.conf. I also wouldn't object to changing the default inode_ratio on large filesystems to be a lot fewer than 1/8192 bytes. At a minimum, allowing an inode_ratio 1/4MB should be allowed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext4 devel interlock meeting minutes (April 23, 2007)
On Mon, 2007-04-30 at 16:36 +0530, Aneesh Kumar wrote: On 4/24/07, Avantika Mathur [EMAIL PROTECTED] wrote: Ext4 Developer Interlock Call: 04/23/2007 Meeting Minutes TESTING - extents testing - Discussed methods for testing extents on highly fragmented filesystems. - Jose will look into possible tests, including perhaps using the 'aged' option in FFSB - Ted suggested creating a mountoption that creates a bad block allocator which it jumps to a new block group every 8 blocks. This would force a very large number of extents, and may be a good test for extents. What i am doing for creating a large number of extents is dd if=/dev/zero of=myfile count=10 seek=20 while [ 1 ]; do dd if=/dev/zero of=myfile count=10 seek=$seek; seek=`expr $seek + 20`; done I had written a simple tool bitmap_manip with which you can actually manipulate the number of free chunks and their sizes in a filesystem. It uses libext2fs to set the bits in block bitmaps thereby leaving the desired free extents. I had written it to test the allocators performance. It can be used as: ./bitmap_manip /dev/sda9 1MA 4 16K 1 12K 3 8K 4 4K 6 This will leave only 1 16K chunk, 3 12K chunks, free in the filesystem. 1MA 4 will get us 4 1Mb free ALIGNED chunks. It isn't very beautiful code since it was only used for testing but maybe it can help. Thanks, Kalpak. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html /* Manipulate block bitmap directly for mballoc testing */ /* USAGE: * ./bitmap_manip /dev/volmballoc/test 16K 1 12K 3 8K 4 4K 6 * This will leave 1 16K chunk, 3 12K chunks, in the filesystem specified. * Ideally give the inputs in ascending order. * 1MA 4 will get us 4 1Mb ALIGNED chunks. */ #include stdio.h #include ext2fs/ext2fs.h #include ext2fs/ext2_types.h #include fcntl.h #include stdlib.h #define ONE_MB (1024 * 1024) #define ONE_KB 1024 #define SETTING 0 #define FREEING 1 #define NO_ALIGN 0 #define ALIGN 1 struct chunk_arg { int chunk_size; int num_chunks; int align; }; int main(int argc, char **argv) { ext2_filsys fs; ext2fs_block_bitmap *map = NULL; int bg_num = 0, retval, arg_num, multiply, chunk_num; int i, start_blk, set_bit, test_bit, j; struct chunk_arg chunk[50]; int free_blocks_req = 0, free_blocks_avail, num_of_chunks_req = 0, group; char str[10]; float orig_avail_req, avail_req; int set_till_now, free_till_now, num_blks_to_set, num_blks_to_free, phase; int current, align_flag = 0, align = 0, curr = 0; if (argc 2) { printf(Please give name of a filesystem. Exiting...\n); return -1; } /* Even from user's perspective */ if(argc 0x01) { printf(This utility cannot have even number of arguments.\n); return -1; } if ((retval = ext2fs_open(argv[1], EXT2_FLAG_RW, 0, 0, unix_io_manager, fs))) { com_err(ext2fs open:, retval, while opening %s\n, argv[1]); return retval; } srand(1234567); chunk_num = 0; for (arg_num = 2; arg_num argc; arg_num += 2, chunk_num++) { strcpy(str, argv[arg_num]); /* Check if we have to align */ if (toupper(str[strlen(str) - 1 ]) == 'A') { chunk[chunk_num].align = ALIGN; str[strlen(str) - 1] = '\0'; align = 1; } else chunk[chunk_num].align = NO_ALIGN; if (toupper(str[strlen(str) - 1]) == 'K') multiply = ONE_KB; else if(toupper(str[strlen(str) - 1]) == 'M') multiply = ONE_MB; str[strlen(str) - 1] = '\0'; chunk[chunk_num].chunk_size = ((strtod(str, NULL)) * multiply)/ (fs-blocksize); chunk[chunk_num].num_chunks = strtod(argv[arg_num + 1], NULL); free_blocks_req += chunk[chunk_num].chunk_size * chunk[chunk_num].num_chunks; num_of_chunks_req += chunk[chunk_num].num_chunks; } ext2fs_read_block_bitmap(fs); map = fs-block_map; start_blk = fs-super-s_first_data_block; free_blocks_avail = fs-super-s_free_blocks_count; orig_avail_req = free_blocks_avail / free_blocks_req; current = 0; i = start_blk; num_blks_to_set = (orig_avail_req / 4) * chunk[current].chunk_size; num_blks_to_free = chunk[current].chunk_size; phase = SETTING; do { test_bit = i; if (!ext2fs_fast_test_block_bitmap(*map, test_bit)) { if (phase == SETTING) { if (chunk[current].align == ALIGN chunk[current].num_chunks 0) { if (align_flag == 0) { num_blks_to_set = (i / chunk[current].chunk_size + 1) * chunk[current].chunk_size - i; align_flag = 1; } else if (i % chunk[current].chunk_size == 0) { num_blks_to_set = 0; phase = FREEING; } } set_bit = i; ext2fs_mark_block_bitmap(*map, set_bit); group = (set_bit - fs-super-s_first_data_block) / fs-super-s_blocks_per_group; fs-group_desc[group].bg_free_blocks_count--; fs-super-s_free_blocks_count--; num_blks_to_set--; if (num_blks_to_set == 0) { phase =
Re: Ext2/3 block remapping tool
On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote: Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Well, I did say necessary, and not sufficient. But yes, moving inodes, especially if the inode is currently open gets interesting. I don't think there are that many user space applications that would notice or care if the st_ino of an open file changed out from under them, but there are obviously userspace applications, such as tar, that would most definitely care. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On 1 May 2007, at 15:20, David Chinner wrote: On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: This is actually for future use. Any flags that are added into this range must be understood by both sides or it should be considered an error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. If it turns out that 8 bits is too small a range for INCOMPAT flags, then we can make 0x0100 an incompat flag that means e.g. 0x00ff are also incompat flags also. Ah, ok. So it's not really a set of compatibility flags, it's more a compulsory set. Under those terms, i don't really see why this is necessary - either the filesystem will understand the flags or it will return EINVAL or ignore them... I'm assuming that all flags that will be in the original FIEMAP proposal will be understood by the implementations. Most filesystems can safely ignore FLAG_HSM_READ, for example, since they don't support HSM, and for that matter FLAG_SYNC is probably moot for most filesystems also because they do block allocation at preprw time. Exactly my point - so why do we really need to encode a compulsory set of Because flags have meaning, independent of whether or not the filesystem understands them. And if the filesystem chooses to ignore critically important flags (instead of returning EINVAL), bad things may happen. So, either the filesystem will understand the flag or iff the unknown flag is in the incompat set, it will return EINVAL or else the unknown flag will be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e2fsprogs-1.39-tyt3
On May 01, 2007 11:05 -0400, Theodore Tso wrote: Speaking of SCM's, how wedded is Clustrefs to mercurial? I've been considering migrating e2fsprogs development to git, but one of the reasons I haven't is because I was concerned that might be incovenient for you folks. Not at all, we just generate a 1.39 to 1.39-WIP patch, and then use quilt + CVS to manage the resulting patches. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On May 01, 2007 11:28 -0400, Theodore Tso wrote: On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote: Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Well, I did say necessary, and not sufficient. But yes, moving inodes, especially if the inode is currently open gets interesting. I don't think there are that many user space applications that would notice or care if the st_ino of an open file changed out from under them, but there are obviously userspace applications, such as tar, that would most definitely care. I think rm -r does a LOT of this kind of operation, like: stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.) I think find does the same to avoid security problems with malicious path manipulation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Tue, May 01, 2007 at 12:52:49PM -0600, Andreas Dilger wrote: I think rm -r does a LOT of this kind of operation, like: stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.) I think find does the same to avoid security problems with malicious path manipulation. Yep, so if you're doing an rm -rf (or any other recursive descent) while we're doing an on-line shrink, it's going to fail. I suppose we could have an in-core inode mapping table that would continue to remap inode numbers until the next reboot. I'm not sure we would want to keep the inode remapping indefinitely, although if we don't it could also end up screwing up NFS as well. Not sure I care, though. :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: On May 01, 2007 14:22 +1000, David Chinner wrote: On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't I disagree - why would you want to indicate the state is unknown when we know very well that it is offline? If you don't like UNKNOWN, what about UNMAPPED? I just want a catch-all flag that indicates this extent contains data but there is nothing sensible to be returned for the extent mapping. Yes, I like that much more. Good suggestion. ;) Effectively, when your extent is offline in the HSM, it is inaccessable, and you have to bring it back from tape so it becomes accessible again. i.e. some action is necessary on behalf of the user to make it accessible. So I think that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for no mapping. There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. Can you propose reasonable flag names for these (I can't think of anything very good) and a clear explanation of what they mean. I suspect it will only be XFS that uses them initially. In mke2fs and ext4+mballoc there is the concept of stripe unit and stripe width, but as yet they are not communicated between the two very well. I'd be much happier if this info could be queried in a standard way from the block layer instead of the user having to specify it and the filesystem having to track it. My preference is definitely for a separate ioctl to grab the filesystem geometry so this stuff can be calculated in userspace. i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't bother trying to define names until we decide which appraoch we take to implement this. Hmm, previously you wrote This information could be easily passed up in the flags fields if the filesystem has geometry information. So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a stripe unit and a stripe width is filesystem specific? Right. I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request tell me the number of blocks from this offset until the next physical disk boundary or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. In XFS, mkfs.xfs does the work of getting this information to see in the filesystem superblock. Here's the code for getting sunit/swidth from the underlying block device: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ Not much in common there ;) It looks like this might be just what e2fsprogs needs also. More than likely. It does make sense to specify zero for the fm_extent_count array and a new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the extent data itself, for the non-verbose mode of filefrag, and for pre-allocating a buffer large enough to hold the file if that is important. Rather than rely on implicit behaviour of pass in extent count of zero and a don't try to return any extents to return the number of extents on the file, why not just explicitly define this as a valid input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for return no extents and return number of extents is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of get no extents to provide the query of how many extents in this file is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... - does XFS return an extent for the metadata parts of the file (e.g. btree)?