Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) My comments below are generally geared towards understanding the ioctl interface. On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: 2 Functional specification The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; The logic for the filefrag would be similar to above. The size of the extent array will be extrapolated from the filesize and multiple ioctls of increasing extent count may be called for very large files. filefrag can easily call the FIEMAP ioctls repeatedly using the end of the last extent as the start offset for the next ioctl: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. I think you meant 'fm_length' instead of 'fm_end' there. The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is given then the fm_extents array is not filled, and only fm_extent_count is returned with the total number of extents in the file. Any new flags that introduce and/or require an incompatible behaviour in an application or in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that range if they were not part of the original specification). This is currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT is not large enough then it is possible to use the last INCOMPAT flag 0x0100 to incidate that more of the flag range contains incompatible flags. #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */ #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */ #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of extents */ #define FIEMAP_FLAG_INCOMPAT0xff00 /* error for unknown flags in here */ The returned data from the FIEMAP ioctl is an array of fiemap_extent elements, one per extent in the file. The first extent will contain the byte specified by fm_start and the last extent will contain the byte specified by fm_start + fm_len, unless there are more than the passed-in fm_extent_count extents in the file, or this is beyond the EOF in which case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent returned has a set of flags associated with it that provide additional information about the extent. Not all filesystems will support all flags. FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by the file. It will be used by default for filefrag since the specific extent information is not required in many cases. #define FIEMAP_EXTENT_HOLE 0x0001 /* has no data or space allocation */ Btw, I really like that holes are explicitely marked. #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* space allocated, but no data */ #define FIEMAP_EXTENT_UNMAPPED 0x0004 /* has data but no space allocated */ #define FIEMAP_EXTENT_ERROR 0x0008 /* map error, errno in fe_offset. */ #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* cannot access data directly */ #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_DELALLOC 0x0040 /* has data but not yet written */ #define FIEMAP_EXTENT_SECONDARY 0x0080 /* data in secondary storage */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF */ Is EOF here considering beyond i_size or beyond allocation? #define FIEMAP_EXTENT_UNKNOWN 0x0200 /* in use but location is unknown */ FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
By request on #linuxfs, here is the FIEMAP spec that we used to implement the FIEMAP support for ext4. There was an ext4 patch posted on August 29 to linux-ext4 entitled [PATCH] FIEMAP ioctl. I've asked Kalpak to post an updated version of that patch along with the changes to the filefrag tool to use FIEMAP. FIEMAP_1.0.txt == File Mapping Interface 18 June 2007 Andreas Dilger, Kalpak Shah Introduction This document covers the user interface and internal implementation of an efficient fragmentation reporting tool. This will include addition of a FIEMAP ioctl to fetch extents and changes to filefrag to use this ioctl. The main objective of this tool is to efficiently and easily allow inspection of the disk layout of one or more files without requiring user access to the underlying storage device(s). 1 Requirements The tool should be efficient in its use of resources, even for large files. The FIBMAP ioctl is not suitable for use on large files, as this can result in millions or even billions of ioctls to get the mapping information for a single file. It should be possible to get the information about an arbitrary-sized extent in a single call, and the kernel component and user tool should efficiently use this information. The user interface should be simple, and the output should be easily understood - by default the filename(s), a count of extents (for each file), and the optimal number of extents for a file with the given striping parameters. The user interface will be filefrag [options] {filename ...} and will allow retrieving the fragmentation information for one or more files specified on the command-line. The output will be of the form: /path/to/file1: extents=2 optimal=1 /path/to/file2: extents=10 optimal=4 .. 2 Functional specification The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ __u64 fe_length;/* length in bytes for the extent */ __u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */ __u32 fe_lun; /* logical device number for extent(starting at 0)*/ }; struct fiemap { __u64 fm_start; /* logical byte offset (in/out) */ __u64 fm_length;/* logical length of map (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */ __u32 fm_extent_count; /* extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; }; In the ioctl request, the fiemap struct is initialized with the desired mapping information. fiemap.fm_start = {desired start byte offset, 0 if whole file}; fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file} fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array}; fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed}; ioctl(fd, FIEMAP, fiemap); {verify fiemap flags are understood } for (i = 0; i fiemap.fm_extent_count; i++) { { process extent fiemap.fm_extents[i]}; } The logic for the filefrag would be similar to above. The size of the extent array will be extrapolated from the filesize and multiple ioctls of increasing extent count may be called for very large files. filefrag can easily call the FIEMAP ioctls repeatedly using the end of the last extent as the start offset for the next ioctl: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is given then the fm_extents array is not filled, and only fm_extent_count is returned with the total number of extents in the file. Any new flags that introduce and/or require an incompatible behaviour in an application or in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that range if they were not part of the original specification). This is currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT is not large enough then it is possible to use the last INCOMPAT flag 0x0100 to incidate that more of the flag range contains incompatible flags. #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */ #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */ #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). Actually, that is completely bunk. What it should say is something like: filefrag can easily call the FIEMAP ioctls repeatedly using the returned fm_start and fm_length as the start offset for the next ioctl: fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1; Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 04:29:07PM -0600, Andreas Dilger wrote: On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). Actually, that is completely bunk. What it should say is something like: filefrag can easily call the FIEMAP ioctls repeatedly using the returned fm_start and fm_length as the start offset for the next ioctl: fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1; Yeah - that's where I was going with my question. This is much more clear now, thanks. --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: By request on #linuxfs, here is the FIEMAP spec that we used to implement the FIEMAP support for ext4. There was an ext4 patch posted on August 29 to linux-ext4 entitled [PATCH] FIEMAP ioctl. Link: http://marc.info/?l=linux-ext4m=118838241209683w=2 That's a very ext4 specific ioctl interface. Can we get this made generic like the FIBMAP interface so we don't have to replicate all the copyin/copyout handling and interface definitions everywhere? i.e. a -extent_map aops callout to the filesystem in generic code just like -bmap? I've asked Kalpak to post an updated version of that patch along with the changes to the filefrag tool to use FIEMAP. Where can I find the test program that validates the implementation? Also, following the fallocate model, can we get the interface definition turned into a man page before anything is submitted upstream? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) I tried to make it as Lustre-agnostic as possible... On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. I think you meant 'fm_length' instead of 'fm_end' there. You're right, thanks. #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF*/ Is EOF here considering beyond i_size or beyond allocation? _EOF == beyond i_size. _LAST == last extent in the file. In most cases FIEMAP_EXTENT_EOF will be set at the same time as FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the EOF flag may be set on one or more earlier extents. FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. The other issue is that I'd like to get the basics of the API in place before it gets too complex. We can always add functionality with more FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is being done). Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 17:11 -0700, Mark Fasheh wrote: On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote: Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. Well, fe_offset and fe_length are already expressed in bytes, so we could just put the byte offset to where the inline data starts in there. fe_length is just used as the length allocated for inline-data. If fe_offset is required to be block aligned, then we could add a field to express an offset within the block where data would be found - say 'fe_data_start_offset'. In the non-inline case, we could guarantee that fe_data_start_offset is zero. That way software which doesn't want to care whether something is inline-data (for example, a backup program) or not could just blidly add it to fe_offset before looking at the data. Oh, I was confused as to what you are asking. Mapping in-inode data is just fine using the existing interface. The byte offset of the data is given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it isn't necessarily safe to do IO directly to that byte offset in the file (e.g. tail packed, compressed data, etc). I was thinking you were asking how to map metadata (e.g. indirect blocks). Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) I tried to make it as Lustre-agnostic as possible... IMHO, your description succeeded at that. I'm hoping that the final patch can have mostly generic code, like FIBMAP does today. #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF*/ Is EOF here considering beyond i_size or beyond allocation? _EOF == beyond i_size. _LAST == last extent in the file. In most cases FIEMAP_EXTENT_EOF will be set at the same time as FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the EOF flag may be set on one or more earlier extents. Oh, ok great - I was primarily looking for a way to say there's allocation past i_size and it looks like we have it. FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. Well, fe_offset and fe_length are already expressed in bytes, so we could just put the byte offset to where the inline data starts in there. fe_length is just used as the length allocated for inline-data. If fe_offset is required to be block aligned, then we could add a field to express an offset within the block where data would be found - say 'fe_data_start_offset'. In the non-inline case, we could guarantee that fe_data_start_offset is zero. That way software which doesn't want to care whether something is inline-data (for example, a backup program) or not could just blidly add it to fe_offset before looking at the data. Regardless, I think we also want to explicitely flag this: #define FIEMAP_EXTENT_DATA_IN_INODE 0x0400 /* extent data is stored in inode block */ I'm going to pretend that I completely understand reiserfs tail-packing and say that my approaches above looks like they could work for that case too. We'd want to add a seperate flag for tail packed data though. The other issue is that I'd like to get the basics of the API in place before it gets too complex. We can always add functionality with more FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is being done). Sure, but I think whatever goes upstream should be able to handle this case - there's file systems in use _today_ which put data in inode blocks and pack file tails. Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On May 02, 2007 20:57 +1000, David Chinner wrote: On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: HSM_READ is definitely _NOT_ required because all it means is if the file is OFFLINE, bring it ONLINE and then return the extent map. You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called HSM_READ instead of HSM_NO_READ. The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like where does the HSM file live and how many copies are there. Do you know the reasoning behind including this into XFS_IOC_GETBMAPX? Looking at the bmap.c comments it appears it is simply because the API isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate there is data in HSM but it has no blocks allocated in the filesystem. I don't think it makes the operation significantly more efficient than say ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP) if an application actually needs the data to be present instead of just returning mapping info that includes UNMAPPED. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On 3 May 2007, at 08:49, Andreas Dilger wrote: On May 02, 2007 20:57 +1000, David Chinner wrote: On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: HSM_READ is definitely _NOT_ required because all it means is if the file is OFFLINE, bring it ONLINE and then return the extent map. You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. In my original proposal I specifically pointed out that the FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the HSM_READ flag is set. That's why the flag is called HSM_READ instead of HSM_NO_READ. Cool. I did not misunderstand after all then. (-: The reason is that it seems bad if the default behaviour for calling ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is only disabled by specifying a flag. It makes a lot more sense to just leave the data as it is and return the extent mapping by default (i.e. this is the principle of least surprise). It would probably be equally surprising and undesirable if the default behaviour was to force all data out to HSM. For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should even be a part of this interface? I have no problem with returning a flag that reports if the data is migrated to HSM and whether it is UNMAPPED. Having FIEMAP force the retrieval of data from HSM strikes me as something that should be a part of a separate HSM interface, which also needs to be able to do things like push specific files or parts thereof out to HSM, set the aging policy, and return information like where does the HSM file live and how many copies are there. That would seem sensible to me also. Just like David argued that causing the data to be in a fixed location should be a separate interface rather than part of FIEMAP so by analogy the same should apply to touching HSM. Best regards, Anton -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On 2 May 2007, at 01:06, David Chinner wrote: On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: On 1 May 2007, at 05:22, David Chinner wrote: On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: The FIBMAP ioctl is for privileged users only, and I wonder if FIEMAP should be the same, or at least disallow mapping files that the user can't access especially with FLAG_SYNC and/or FLAG_HSM_READ. I see little reason for restricting FI[BE]MAP to privileged users - anyone should be able to determine if files they have permission to access are fragmented. Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the machine. Perhaps for non-privileged users FIEMAP has to be read- only? As soon as any of the FLAG_* flags come into play you make it privileged. For example fancy any user being able to fill up your file system by calling FIEMAP with FLAG_HSM_READ on all files recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Sorry, it was not a great example. But the point still stands that there are/may be created flags that you do not want to allow everyone to use. I completely agree with Andreas that those can simply return -EPERM and the rest can be allowed through. Best regards, Anton -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: On 1 May 2007, at 15:20, David Chinner wrote: So, either the filesystem will understand the flag or iff the unknown flag is in the incompat set, it will return EINVAL or else the unknown flag will be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called getting it right. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. They might be able to safely ignore it, but in reality it should be saying I don't understand this. If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these voluntary flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it's meaning - I'm doing the same thing here. An interface needs to be explicitly defined and should not have and undefined behaviour in it Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On 1 May 2007, at 15:20, David Chinner wrote: On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: This is actually for future use. Any flags that are added into this range must be understood by both sides or it should be considered an error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. If it turns out that 8 bits is too small a range for INCOMPAT flags, then we can make 0x0100 an incompat flag that means e.g. 0x00ff are also incompat flags also. Ah, ok. So it's not really a set of compatibility flags, it's more a compulsory set. Under those terms, i don't really see why this is necessary - either the filesystem will understand the flags or it will return EINVAL or ignore them... I'm assuming that all flags that will be in the original FIEMAP proposal will be understood by the implementations. Most filesystems can safely ignore FLAG_HSM_READ, for example, since they don't support HSM, and for that matter FLAG_SYNC is probably moot for most filesystems also because they do block allocation at preprw time. Exactly my point - so why do we really need to encode a compulsory set of Because flags have meaning, independent of whether or not the filesystem understands them. And if the filesystem chooses to ignore critically important flags (instead of returning EINVAL), bad things may happen. So, either the filesystem will understand the flag or iff the unknown flag is in the incompat set, it will return EINVAL or else the unknown flag will be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this You are wrong about this because you are missing the point that you have no code to review. The users that will use those flags are going to be applications that run in user space. Chances are you will never see their code. Heck, they might not even be open source applications... And all applications will run against a multitude of kernels. So version X of the application will run on kernel 2.4.*, 2.6.*, a.b.*, etc... For future expandability of the interface I think it is important to have both compulsory and non-compulsory flags. For example there is no reason why FIEMAP_HSM_READ needs to be compulsory. Most filesystems do not support HSM so can safely ignore it. And applications that want to read/write the data locations that are obtained with the FIEMAP call will likely always supply FIEMAP_HSM_READ because they want to ensure the file is brought in if it is off line so they definitely want file systems that do not support this flag to ignore it. And vice versa, an application might specify some weird and funky yet to be developed feature that it expects the FS to perform and if the FS cannot do it (either because it does not support it or because it failed to perform the operation) the application expects the FS to return an error and not to ignore the flag. An example could be the asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS ignores it it will return the extent map for the file data instead of the XATTR_FORK! Not what the application wanted at all. Ouch! So this is definitely a compulsory flag if I ever saw one. So as you see you must support both voluntary and compulsory flags... Also consider what I said above about different kernels. A new feature is implemented in kernel 2.8.13 say that was not there before and an application is updated to use that feature. There will be lots of instances where that application will still be run on older kernels where this feature does not exist. Depending on the feature it may be quite sensible to simply ignore in the kernel that the application set an unknown flag whilst for a different feature it may be the opposite. Best regards, Anton -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: On May 01, 2007 14:22 +1000, David Chinner wrote: On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't I disagree - why would you want to indicate the state is unknown when we know very well that it is offline? If you don't like UNKNOWN, what about UNMAPPED? I just want a catch-all flag that indicates this extent contains data but there is nothing sensible to be returned for the extent mapping. Yes, I like that much more. Good suggestion. ;) Effectively, when your extent is offline in the HSM, it is inaccessable, and you have to bring it back from tape so it becomes accessible again. i.e. some action is necessary on behalf of the user to make it accessible. So I think that OFFLINE is a good name for this state because it really is inaccessible. What you are calling OFFLINE I would prefer to call UNMAPPED, since that can be used by applications as a catch-all for no mapping. There can be further flags that give refinements to UNMAPPED that some applications might care about them (e.g. HSM_RESIDENT), but many users/apps will not if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. Can you propose reasonable flag names for these (I can't think of anything very good) and a clear explanation of what they mean. I suspect it will only be XFS that uses them initially. In mke2fs and ext4+mballoc there is the concept of stripe unit and stripe width, but as yet they are not communicated between the two very well. I'd be much happier if this info could be queried in a standard way from the block layer instead of the user having to specify it and the filesystem having to track it. My preference is definitely for a separate ioctl to grab the filesystem geometry so this stuff can be calculated in userspace. i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't bother trying to define names until we decide which appraoch we take to implement this. Hmm, previously you wrote This information could be easily passed up in the flags fields if the filesystem has geometry information. So, I _think_ what you are saying is that you want 4 flags to convey this start/end alignment information, but the exact semantics of what a stripe unit and a stripe width is filesystem specific? Right. I definitely do NOT want to get into any issues of querying the block device geometry here. I was just making a passing comment that ext4+mballoc can already do RAID-specific allocation alignment, but it depends on the admin to specify this information and it would be nice if there was some easy way to get this from userspace/kernel interfaces. Having an API that can request tell me the number of blocks from this offset until the next physical disk boundary or similar would be useful to any allocator, and the block layer already needs to know this when submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. In XFS, mkfs.xfs does the work of getting this information to see in the filesystem superblock. Here's the code for getting sunit/swidth from the underlying block device: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ Not much in common there ;) It looks like this might be just what e2fsprogs needs also. More than likely. It does make sense to specify zero for the fm_extent_count array and a new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the extent data itself, for the non-verbose mode of filefrag, and for pre-allocating a buffer large enough to hold the file if that is important. Rather than rely on implicit behaviour of pass in extent count of zero and a don't try to return any extents to return the number of extents on the file, why not just explicitly define this as a valid input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my clever-clever for return no extents and return number of extents is wasted :-/. Too clever for an API, I think. ;) My point is mainly that if you are going to use an API for a specific function (e.g. query the number of extents) I think that the API should have an obvious method for executing that specific function. Using a command of get no extents to provide the query of how many extents in this file is kind of obscure. When you read the code it doesn't make a lot of sense, as opposed to seeing a clear statement of intent from the code itself. i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API and the code that uses it... - does XFS return an extent for the metadata parts of the file (e.g. btree)?
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ # define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ # define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from HSM */ # define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ /* flags for the returned extents */ # define FIEMAP_EXTENT_HOLE 0x0001 /* no space allocated */ # define FIEMAP_EXTENT_UNWRITTEN0x0002 /* uninitialized space */ # define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ # define FIEMAP_EXTENT_ERROR0x0008 /* error mapping space */ # define FIEMAP_EXTENT_NO_DIRECT0x0010 /* no direct data access */ SUMMARY OF CHANGES == - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) I much prefer that - it makes it a lot clearer to me to have fiemap_extent just for fm_extents (no different meanings now). (Don't like the word offset in comment without physical or some such but whatever;-) I also prefer the flags as separate fields too :) --Tim - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 16, 2007 18:01 +1000, Timothy Shimmin wrote: --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fiemap { struct fiemap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fiemap_extent fm_extents[0]; } # define FIEMAP_LEN_MASK 0xff # define FIEMAP_LEN_HOLE 0x01 # define FIEMAP_LEN_UNWRITTEN0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Well, that's what stood out for me. I was wondering where the fe_block field had gone - the physical address. So is your fe_start; /* starting offset */ actually the disk location (not a logical file offset) _except_ in the header (fiemap) where it is the desired logical offset. Correct. The fm_extent in the request contains the logical start offset and length in bytes of the requested fiemap region. In the returned header it represents the logical start offset of the extent that contained the requested start offset, and the logical length of all the returned extents. I haven't decided whether the returned length should be until EOF, or have the virtual hole at the end of the file. I think EOF makes more sense. The fe_start + fe_len in the fm_extents represent the physical location on the block device for that extent. fm_extent[i].fe_start (per Anton) is undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. Right, I'll write up a new RFC based on feedback here, and correcting the various errors in the original proposal. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). Correct. It saves space in the common case. Caller works something like: char buf[4096]; struct fiemap *fm = (struct fiemap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_start.fe_start = 0; /* start of file */ fm-fm_start.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0; /* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last entry will have less extents than the maximum */ while (fm-fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i fm-fm_extent_count; i++) { __u64 len = fm-fm_extents[i].fe_len FIEMAP_LEN_MASK; __u64 fm_next = fm-fm_start.fe_start + len; int hole = fm-fm_extents[i].fe_len FIEMAP_LEN_HOLE; int unwr = fm-fm_extents[i].fe_len FIEMAP_LEN_UNWRITTEN; printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n, fm-fm_start.fe_start, fm_next - 1, hole ? 0 : fm-fm_extents[i].fe_start, hole ? 0 : fm-fm_extents[i].fe_start + fm-fm_extents[i].fe_len - 1, len, hole ? (hole) : , unwr ? (unwritten) : ); /* get ready for printing next extent, or next ioctl */ fm-fm_start.fe_start = fm_next; } } Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 16, 2007 21:22 +1000, David Chinner wrote: On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fiemap { struct fiemap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fiemap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN0x02 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fiemap_extent for this. Christoph expressed the same concern. I'm not dead set against having an extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may mean the need for 50% more ioctls if the file is large. Below is an aggregation of the comments in this thread: struct fiemap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_lun; /* logical storage device number in array */ } struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC0x0001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ0x0002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ /* flags for the returned extents */ #define FIEMAP_EXTENT_HOLE 0x0001 /* no space allocated */ #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* uninitialized space */ #define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ #define FIEMAP_EXTENT_ERROR 0x0008 /* error mapping space */ #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* no direct data access */ SUMMARY OF CHANGES == - use fm_* fields directly in request instead of making it a fiemap_extent (though they are layed out identically) - separate flags word for fm_flags: - FIEMAP_FLAG_SYNC = range should be synced to disk before returning mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future if there is agreement on whether that is desirable to have or if it is better to call ioctl(FIEMAP) on an XATTR fd. - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it - __u64 fm_unused does not take up an extra space on all power-of-two buffer sizes (would otherwise be at end of buffer), and may be handy in the future. - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno? - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data encrypted, compressed, etc), may want separate flags for these? - add new fe_lun word per extent for filesystems that manage multiple devices (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Can you clarify the terminology here? What is a stripe unit and what is a stripe width? Are there N * stripe_unit = stripe_width in e.g. a RAID
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. ... I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. They certainly seem to be (combining entries and header). struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } # define FIEMAP_LEN_MASK0xff # define FIEMAP_LEN_HOLE0x01 # define FIEMAP_LEN_UNWRITTEN 0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. Well, that's what stood out for me. I was wondering where the fe_block field had gone - the physical address. So is your fe_start; /* starting offset */ actually the disk location (not a logical file offset) _except_ in the header (fibmap) where it is the desired logical offset. Okay, looking at your example use below that's what it looks like. And when you refer to fm_start below, you mean fm_start.fe_start? Sorry, I realise this is just an approximation but this part confused me. So you get rid of all the logical file offsets in the extents because we report holes explicitly (and we know everything is contiguous if you include the holes). --Tim Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_extent.fe_start = 0; /* start of file */ fm-fm_extent.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last entry will have less extents than the maximum */ while (fm-fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i fm-fm_extent_count; i++) { __u64 len = fm-fm_extents[i].fe_len FIEMAP_LEN_MASK; __u64 fm_next = fm-fm_start + len; int hole = fm-fm_extents[i].fe_len FIEMAP_LEN_HOLE; int unwr = fm-fm_extents[i].fe_len FIEMAP_LEN_UNWRITTEN; printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n, fm-fm_start, fm_next - 1, hole ? 0 : fm-fm_extents[i].fe_start, hole ? 0 : fm-fm_extents[i].fe_start + fm-fm_extents[i].fe_len - 1, len, hole ? (hole) : , unwr ? (unwritten) : ); /* get ready for printing next extent, or next ioctl */ fm-fm_start = fm_next; } } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN 0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). I'd be much happier to have the separate per-extent flags value. For one thing this allows much nicer representations of unwritten extents or holes without taking away bits from the len value. It also allows to make interesting use of this in the future, e.g. telling about an offline exttent for use in HSM applications. Also for this kernel-user interface the wasted space shouldn't matter too much - if you want to pass the above condensed structure over the wire in lustre that shouldn't a problem, you'd have to convert to an endian-neutral on the wire format anyway. Not doing the masking also make the interface quite a bit simpler to use. One addition freature from the XFS getbmapx interface we should provide is the ability to query layout of xattrs. While other filesystems might not have the exact xattr fork XFS has it fits nicely into the interface. Especially when we have Anton's suggested flag for inline data. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On 13 Apr 2007, at 11:15, Christoph Hellwig wrote: On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). I'd be much happier to have the separate per-extent flags value. For one thing this allows much nicer representations of unwritten extents or holes without taking away bits from the len value. It also allows to make interesting use of this in the future, e.g. telling about an offline exttent for use in HSM applications. Also for this kernel-user interface the wasted space shouldn't matter too much - if you want to pass the above condensed structure over the wire in lustre that shouldn't a problem, you'd have to convert to an endian-neutral on the wire format anyway. Not doing the masking also make the interface quite a bit simpler to use. One addition freature from the XFS getbmapx interface we should provide is the ability to query layout of xattrs. While other filesystems might not have the exact xattr fork XFS has it fits nicely into the interface. Especially when we have Anton's suggested flag for inline data. Would it not be better to allow people to get a file descriptor on the xattr fork and then just run the normal FIEMAP ioctl on that file descriptor? I.e. openat(base file descriptor, O_STREAM, streamname) or O_XATTR or whatever... An alternative API would be to provide a getxattrfd ()/fgetxattrfd() call or similar that would instead of returning the value of an xattr return an fd to it. Then you do not need to modify openat() at all... Interface doesn't bother me, just some ideas... And for XFS you would define a magic streamname or xattrname (or whatever you want to call it) of say com.sgi.filesystem.xfs.xattrstream (or .xattrfork) or something and then XFS would intercept that and know what to do with it... Such an interface could then be used by NTFS named streams and other file systems providing such things... (Yes I know I will now totally get flamed about named streams not being wanted in Linux and crap like that but that is exactly what you are asking for except you want to special case a particular stream using a flag instead of calling it for what it really is and once you start doing that you might as well allow full named streams...) You can just see named streams as an alternative, non-atomic API to xattrs if you like, i.e. you can either use the atomic xattr API provided in Linux already or you can get a file descriptor to an xattr and then use the normal system calls to access it non- atomically thus you can use the FIEMAP ioctl also. (-: FWIW this two-API approach to xattrs/named streams is the direction OSX is heading towards also so it is not without precedent and Windows has had both APIs for many years. And Solaris has the openat (O_XATTR) interface so that is not without precedent either. Best regards, Anton PS. to all flamers: I am going to delete any non-technical flames without replying so please do us all a favour and don't bother... Thanks. -- Anton Altaparmakov aia21 at cam.ac.uk (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Fri, 2007-04-13 at 12:38 +0100, Anton Altaparmakov wrote: One addition freature from the XFS getbmapx interface we should provide is the ability to query layout of xattrs. While other filesystems might not have the exact xattr fork XFS has it fits nicely into the interface. Especially when we have Anton's suggested flag for inline data. Would it not be better to allow people to get a file descriptor on the xattr fork and then just run the normal FIEMAP ioctl on that file descriptor? I.e. openat(base file descriptor, O_STREAM, streamname) or O_XATTR or whatever... An alternative API would be to provide a getxattrfd ()/fgetxattrfd() call or similar that would instead of returning the value of an xattr return an fd to it. Then you do not need to modify openat() at all... Interface doesn't bother me, just some ideas... And for XFS you would define a magic streamname or xattrname (or whatever you want to call it) of say com.sgi.filesystem.xfs.xattrstream (or .xattrfork) or something and then XFS would intercept that and know what to do with it... Such an interface could then be used by NTFS named streams and other file systems providing such things... (Yes I know I will now totally get flamed about named streams not being wanted in Linux and crap like that but that is exactly what you are asking for except you want to special case a particular stream using a flag instead of calling it for what it really is and once you start doing that you might as well allow full named streams...) You can just see named streams as an alternative, non-atomic API to xattrs if you like, i.e. you can either use the atomic xattr API provided in Linux already or you can get a file descriptor to an xattr and then use the normal system calls to access it non- atomically thus you can use the FIEMAP ioctl also. (-: FWIW this two-API approach to xattrs/named streams is the direction OSX is heading towards also so it is not without precedent and Windows has had both APIs for many years. And Solaris has the openat (O_XATTR) interface so that is not without precedent either. Except that xattrs in Linux aren't streams, and providing a stream-like interface to them would be a weird abuse of the xattr concept. In essence, Linux xattrs are named extensions to struct stat, with getxattr() being in the same category as stat() and setxattr() being in the same category as chmod()/chown()/utime()/etc. They system namespace exists to provide a better interface than ioctl() to weird FS-specific features (DOS attribute bits, HFS+ creator/type, ext2/3/reiserfs/etc. immutable/append-only/secure-delete/etc. attributes and so on). The uptake of this feature isn't as high as I'd like, but that's what it's there for. They security namespace is there for all the neat LSM modules that need to attach metadata to files in order to function. Finally, the user namespace exists to allow users to attach small bits of information to their own files, since the API was already there and hey!, metadata is useful. Now, Solaris came along and totally confused the issue by using the same name for a completely different feature, but that isn't any real reason to mess up the existing Linux xattr concept just to graft named streams support into the kernel. (Not that I'm opposed to named streams in Linux, you just have to realize that xattrs aren't name streams, can't live in the same namespace as named streams, and certainly don't serve the same purpose as named streams.) -- Nicholas Miell [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] add FIEMAP ioctl to efficiently map file allocation
I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. There was also recent discussion about SEEK_HOLE and SEEK_DATA as implemented by Sun, but even if we could skip the holes we still might need to do millions of FIBMAPs to see how large files are allocated on disk. Conversely, having filesystems implement an efficient FIBMAP ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE and SEEK_DATA instead of doing looping over -bmap() inside the kernel as I saw one patch. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN0x02 All offsets are in bytes to allow cases where filesystems are not going block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_extent.fe_start = 0; /* start of file */ fm-fm_extent.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0; /* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last entry will have less extents than the maximum */ while (fm-fm_extent_count == count) { rc = ioctl(fd, FIEMAP, fm); if (rc) break; /* kernel filled in fm_extents[] array, set fm_extent_count * to be actual number of extents returned, leaves fm_start * alone (unlike XFS_IOC_GETBMAP). */ for (i = 0; i fm-fm_extent_count; i++) { __u64 len = fm-fm_extents[i].fe_len FIEMAP_LEN_MASK; __u64 fm_next = fm-fm_start + len; int hole = fm-fm_extents[i].fe_len FIEMAP_LEN_HOLE; int unwr = fm-fm_extents[i].fe_len FIEMAP_LEN_UNWRITTEN; printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n, fm-fm_start, fm_next - 1, hole ? 0 : fm-fm_extents[i].fe_start, hole ? 0 : fm-fm_extents[i].fe_start + fm-fm_extents[i].fe_len - 1, len, hole ? (hole) : , unwr ? (unwritten) : ); /* get ready for printing next extent, or next ioctl */ fm-fm_start = fm_next; } } I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP. I'm quite open to suggestions at this point, both in terms of how usable the fibmap data structures are by the caller, and if we need to add anything to make them more flexible for the future. In terms of implementing this in the kernel, there was originally code for this during the development of the ext3 extent patches and it was done via a callback in the extent tree iterator so it is very efficient. I believe it implements all that is needed
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, On 12 Apr 2007, at 12:05, Andreas Dilger wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. I had come up with a plan independently and was also steered toward XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original plan, though I think the XFS structs used there are a bit bloated. There was also recent discussion about SEEK_HOLE and SEEK_DATA as implemented by Sun, but even if we could skip the holes we still might need to do millions of FIBMAPs to see how large files are allocated on disk. Conversely, having filesystems implement an efficient FIBMAP ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE and SEEK_DATA instead of doing looping over -bmap() inside the kernel as I saw one patch. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN0x02 Sound good but I would add: #define FIEMAP_LEN_NO_DIRECT_ACCESS This would say that the offset on disk can move at any time or that the data is compressed or encrypted on disk thus the data is not useful for direct disk access. On NTFS small files can be inside the inode and there direct access is not possible because the metadata on disk is protected with fixups which need to be removed when the inode is read into memory. If you access the data directly on disk, you would see corrupt data on reads and cause corruption on writes... Similarly both for compressed and encrypted files doing direct access to the on-disk data is totally nonsensical as you would see random junk on read and cause fatal data corruption on writes. Also why are you not using 0xff00, i.e. two more zeroes at the end? Seems unnecessary to drop an extra 8 bits of significance from the byte size... May not matter today but it almost certainly will do in the future (just remember what people said about the 640k limit in MSDOS when it first came out!)... Finally please make sure that the file system can return in one way or another errors for example when it fails to determine the extents because the system ran out of memory, there was an i/o error, whatever... It may even be useful to be able to say here is an extent of size X bytes but we do not know where it is on disk because there was an error determining this particular extent's on-disk location for some reason or other... All offsets are in bytes to allow cases where filesystems are not going Excellent! block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). Why the fe_start == 0? Surely just the flag is sufficient... On NTFS it is perfectly valid to have fe_start == 0 and to have that not be sparse (normally the $Boot system file is stored in the first 8 sectors of the volume)... Best regards, Anton The -fm_extents[] array includes all of the holes in addition to allocated extents because this avoids the need to return both the logical and physical address for every extent and does not make processing any harder. One feature that XFS_IOC_GETBMAPX has that may be desirable is the ability to return unwritten extent information. In order to do this XFS required expanding the per-extent struct from 32 to 48 bytes per extent, but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) and keep 8 bytes or so for input/output flags per extent (would need to be masked before use). Caller works something like: char buf[4096]; struct fibmap *fm = (struct fibmap *)buf; int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); fm-fm_extent.fe_start = 0; /* start of file */ fm-fm_extent.fe_len = -1; /* end of file */ fm-fm_extent_count = count; /* max extents in fm_extents[] array */ fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */ fd = open(path, O_RDONLY); printf(logical\t\tphysical\t\tbytes\n); /* The last
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote: On 12 Apr 2007, at 12:05, Andreas Dilger wrote: I'm interested in getting input for implementing an ioctl to efficiently map file extents holes (FIEMAP) instead of looping over FIBMAP a billion times. We already have customers with single files in the 10TB range and we additionally need to get the mapping over the network so it needs to be efficient in terms of how data is passed, and how easily it can be extracted from the filesystem. struct fibmap_extent { __u64 fe_start; /* starting offset in bytes */ __u64 fe_len; /* length in bytes */ } struct fibmap { struct fibmap_extent fm_start; /* offset, length of desired mapping */ __u32 fm_extent_count; /* number of extents in array */ __u32 fm_flags; /* flags for input request */ XFS_IOC_GETBMAP) */ __u64 unused; struct fibmap_extent fm_extents[0]; } #define FIEMAP_LEN_MASK 0xff #define FIEMAP_LEN_HOLE 0x01 #define FIEMAP_LEN_UNWRITTEN 0x02 Sound good but I would add: #define FIEMAP_LEN_NO_DIRECT_ACCESS This would say that the offset on disk can move at any time or that the data is compressed or encrypted on disk thus the data is not useful for direct disk access. This makes sense. Even for Reiserfs the same is true with packed tails, and I believe if FIBMAP is called on a tail it will migrate the tail into a block because this is might be a sign that the file is a kernel that LILO wants to boot. I'd rather not have any such feature in FIEMAP, and just return the on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me. My main reason for FIEMAP is being able to investigate allocation patterns of files. By no means is my flag list exhaustive, just the ones that I thought would be needed to implement this for ext4 and Lustre. Also why are you not using 0xff00, i.e. two more zeroes at the end? Seems unnecessary to drop an extra 8 bits of significance from the byte size... It was actually just a typo (this was the first time I'd written the structs and flags down, it is just at the discussion stage). I'd meant for it to be 2^56 bytes for the file size as I wrote later in the email. That said, I think that 2^48 bytes is probably sufficient for most uses, so that we get 16 bits for flags. As it is this email already discusses 5 flags, and that would give little room for expansion in the future. Remember, this is the mapping for a single file (which can't practially be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to return a few separate extents which are actually contiguous (assuming that there will actually be files in filesystems with 2^48 bytes of contiguous space). Since the API is that it will return the extent that contains the requested start byte, the kernel will be able to detect this case also, since it won't be able to specify a length for the extent that contains the start byte. At most we'd have to call the ioctl() 65536 times for a completely contiguous 2^64 byte file if the buffer was only large enough for a single extent. In reality, I expect any file to have some discontinuities and the buffer to be large enough for a thousand or more entries so the corner case is not very bad. Finally please make sure that the file system can return in one way or another errors for example when it fails to determine the extents because the system ran out of memory, there was an i/o error, whatever... It may even be useful to be able to say here is an extent of size X bytes but we do not know where it is on disk because there was an error determining this particular extent's on-disk location for some reason or other... Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated to tape and currently has no blocks allocated in the filesystem. We want to return some indication that there is actual file data and not just a hole, but at the same time we don't want this to actually return the file from tape just to generate block mappings for it. This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ, but this needs to be specified on input to prevent the file being mapped and I'd rather the opposite (not getting file from tape) be the default, by principle of least surprise. block-aligned/sized allocations (e.g. tail packing). The fm_extents array returned contains the packed list of allocation extents for the file, including entries for holes (which have fe_start == 0, and a flag). Why the fe_start == 0? Surely just the flag is sufficient... On NTFS it is perfectly valid to have fe_start == 0 and to have that not be sparse (normally the $Boot system file is stored in the first 8 sectors of the volume)... I thought