Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
Hi Andreas,

Thanks for posting this. I believe that an interface such as FIEMAP
would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

My comments below are generally geared towards understanding the ioctl
interface.

On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:

 2 Functional specification
 
 The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
 ioctl block device ioctl used for mapping an individual logical block
 address in a file to a physical block address in the block device. The
 FIEMAP ioctl will return the logical to physical mapping for the extent
 that contains the specified logical byte address.
 
 struct fiemap_extent {
 __u64 fe_offset;/* offset in bytes for the start of the extent */

I'm a little bit confused by fe_offset. Is it a physical offset, or a
logical offset? The reason I ask is that your description above says FIEMAP
ioctl will return the logical to physical mapping for the extent that
contains the specified logical byte address. Which seems to imply physical,
but your math to get to the next logical start in a very fragmented file,
implies that fe_offset is a logical offset:

   fm_start = fm_extents[fm_extent_count - 1].fe_offset +
 fm_extents[fm_extent_count - 1].fe_length + 1; 


 The logic for the filefrag would be similar to above. The size of the
 extent array will be extrapolated from the filesize and multiple ioctls
 of increasing extent count may be called for very large files. filefrag
 can easily call the FIEMAP ioctls repeatedly using the end of the last
 extent as the start offset for the next ioctl:
 
   fm_start = fm_extents[fm_extent_count - 1].fe_offset +
 fm_extents[fm_extent_count - 1].fe_length + 1;
 
 We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
 will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

I think you meant 'fm_length' instead of 'fm_end' there.


 The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
 given then the fm_extents array is not filled, and only fm_extent_count is
 returned with the total number of extents in the file. Any new flags that
 introduce and/or require an incompatible behaviour in an application or
 in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
 (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
 range if they were not part of the original specification). This is
 currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
 is not large enough then it is possible to use the last INCOMPAT flag
 0x0100 to incidate that more of the flag range contains incompatible
 flags.
 
 #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
 #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
 #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of extents */
 #define FIEMAP_FLAG_INCOMPAT0xff00 /* error for unknown flags in here 
 */
 
 The returned data from the FIEMAP ioctl is an array of fiemap_extent
 elements, one per extent in the file. The first extent will contain the
 byte specified by fm_start and the last extent will contain the byte
 specified by fm_start + fm_len, unless there are more than the passed-in
 fm_extent_count extents in the file, or this is beyond the EOF in which
 case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent
 returned has a set of flags associated with it that provide additional
 information about the extent. Not all filesystems will support all flags.
 
 FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by
 the file. It will be used by default for filefrag since the specific
 extent information is not required in many cases.
 
 #define FIEMAP_EXTENT_HOLE  0x0001 /* has no data or space allocation 
 */

Btw, I really like that holes are explicitely marked.


 #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* space allocated, but no data */
 #define FIEMAP_EXTENT_UNMAPPED  0x0004 /* has data but no space allocated 
 */
 #define FIEMAP_EXTENT_ERROR 0x0008 /* map error, errno in fe_offset. 
 */
 #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* cannot access data directly */
 #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
 #define FIEMAP_EXTENT_DELALLOC  0x0040 /* has data but not yet written */
 #define FIEMAP_EXTENT_SECONDARY 0x0080 /* data in secondary storage */
 #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF */

Is EOF here considering beyond i_size or beyond allocation?


 #define FIEMAP_EXTENT_UNKNOWN   0x0200 /* in use but location is unknown 
 */
 
 
 FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
 encrypted, compressed, etc.)

Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
Btrfs, Ocfs2, and Gfs2 pack small amounts of 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
to linux-ext4 entitled [PATCH] FIEMAP ioctl.   I've asked Kalpak to post
an updated version of that patch along with the changes to the filefrag
tool to use FIEMAP.

 FIEMAP_1.0.txt ==

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be filefrag [options]
{filename ...} and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length;/* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count;  /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];  
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, fiemap);
{verify fiemap flags are understood }

for (i = 0; i  fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x0100 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
#define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  I'm a little bit confused by fe_offset. Is it a physical offset, or a
  logical offset? The reason I ask is that your description above says FIEMAP
  ioctl will return the logical to physical mapping for the extent that
  contains the specified logical byte address. Which seems to imply physical,
  but your math to get to the next logical start in a very fragmented file,
  implies that fe_offset is a logical offset:
  
 fm_start = fm_extents[fm_extent_count - 1].fe_offset +
   fm_extents[fm_extent_count - 1].fe_length + 1; 
 
 Note the distinction between fe_offset (which is a physical offset for
 a single extent) and fm_offset (which is a logical offset for that file).

Actually, that is completely bunk.  What it should say is something like:
filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
On Mon, Oct 29, 2007 at 04:29:07PM -0600, Andreas Dilger wrote:
 On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
  On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   I'm a little bit confused by fe_offset. Is it a physical offset, or a
   logical offset? The reason I ask is that your description above says 
   FIEMAP
   ioctl will return the logical to physical mapping for the extent that
   contains the specified logical byte address. Which seems to imply 
   physical,
   but your math to get to the next logical start in a very fragmented file,
   implies that fe_offset is a logical offset:
   
  fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1; 
  
  Note the distinction between fe_offset (which is a physical offset for
  a single extent) and fm_offset (which is a logical offset for that file).
 
 Actually, that is completely bunk.  What it should say is something like:
 filefrag can easily call the FIEMAP ioctls repeatedly using the returned
 fm_start and fm_length as the start offset for the next ioctl:
 
 fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Yeah - that's where I was going with my question. This is much more clear
now, thanks.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread David Chinner
On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
 By request on #linuxfs, here is the FIEMAP spec that we used to implement
 the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
 to linux-ext4 entitled [PATCH] FIEMAP ioctl.

Link:

http://marc.info/?l=linux-ext4m=118838241209683w=2

That's a very ext4 specific ioctl interface. Can we get this made
generic like the FIBMAP interface so we don't have to replicate all
the copyin/copyout handling and interface definitions everywhere?
i.e. a -extent_map aops callout to the filesystem in generic code
just like -bmap?

 I've asked Kalpak to post
 an updated version of that patch along with the changes to the filefrag
 tool to use FIEMAP.

Where can I find the test program that validates the implementation?
Also, following the fallocate model, can we get the interface definition
turned into a man page before anything is submitted upstream?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   Thanks for posting this. I believe that an interface such as FIEMAP
 would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

 On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
  The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
  ioctl block device ioctl used for mapping an individual logical block
  address in a file to a physical block address in the block device. The
  FIEMAP ioctl will return the logical to physical mapping for the extent
  that contains the specified logical byte address.
  
  struct fiemap_extent {
  __u64 fe_offset;/* offset in bytes for the start of the extent */
 
 I'm a little bit confused by fe_offset. Is it a physical offset, or a
 logical offset? The reason I ask is that your description above says FIEMAP
 ioctl will return the logical to physical mapping for the extent that
 contains the specified logical byte address. Which seems to imply physical,
 but your math to get to the next logical start in a very fragmented file,
 implies that fe_offset is a logical offset:
 
fm_start = fm_extents[fm_extent_count - 1].fe_offset +
  fm_extents[fm_extent_count - 1].fe_length + 1; 

Note the distinction between fe_offset (which is a physical offset for
a single extent) and fm_offset (which is a logical offset for that file).

  We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
  will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
 
 I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

  #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
  #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF*/
 
 Is EOF here considering beyond i_size or beyond allocation?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
EOF flag may be set on one or more earlier extents.

  FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
  encrypted, compressed, etc.)
 
 Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
 Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
 blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given?  One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought.  The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  17:11 -0700, Mark Fasheh wrote:
 On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
   Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
   blocks.
  
  Hmm, but part of the issue would be how to request the extra data, and
  what offset it would be given?  One could, for example, use negative
  offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
  or similar, I hadn't given that much thought.
 
 Well, fe_offset and fe_length are already expressed in bytes, so we could
 just put the byte offset to where the inline data starts in there. fe_length
 is just used as the length allocated for inline-data.
 
 If fe_offset is required to be block aligned, then we could add a field to
 express an offset within the block where data would be found - say
 'fe_data_start_offset'. In the non-inline case, we could guarantee that
 fe_data_start_offset is zero. That way software which doesn't want to care
 whether something is inline-data (for example, a backup program) or not
 could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking.  Mapping in-inode data is
just fine using the existing interface.  The byte offset of the data is
given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  Thanks for posting this. I believe that an interface such as FIEMAP
  would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)
 
 I tried to make it as Lustre-agnostic as possible...

IMHO, your description succeeded at that. I'm hoping that the final patch
can have mostly generic code, like FIBMAP does today.


   #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
   #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond 
   EOF*/
  
  Is EOF here considering beyond i_size or beyond allocation?
 
 _EOF == beyond i_size.
 _LAST == last extent in the file.
 
 In most cases FIEMAP_EXTENT_EOF will be set at the same time as
 FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
 EOF flag may be set on one or more earlier extents.

Oh, ok great - I was primarily looking for a way to say there's allocation
past i_size and it looks like we have it.


   FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
   encrypted, compressed, etc.)
  
  Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
  Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
  blocks.
 
 Hmm, but part of the issue would be how to request the extra data, and
 what offset it would be given?  One could, for example, use negative
 offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
 or similar, I hadn't given that much thought.

Well, fe_offset and fe_length are already expressed in bytes, so we could
just put the byte offset to where the inline data starts in there. fe_length
is just used as the length allocated for inline-data.

If fe_offset is required to be block aligned, then we could add a field to
express an offset within the block where data would be found - say
'fe_data_start_offset'. In the non-inline case, we could guarantee that
fe_data_start_offset is zero. That way software which doesn't want to care
whether something is inline-data (for example, a backup program) or not
could just blidly add it to fe_offset before looking at the data.

Regardless, I think we also want to explicitely flag this:

#define FIEMAP_EXTENT_DATA_IN_INODE 0x0400 /* extent data is stored in 
inode block */


I'm going to pretend that I completely understand reiserfs tail-packing and
say that my approaches above looks like they could work for that case too.
We'd want to add a seperate flag for tail packed data though.


 The other issue is that I'd like to get the basics of the API in place
 before it gets too complex. We can always add functionality with more
 FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is
 being done).

Sure, but I think whatever goes upstream should be able to handle this case
- there's file systems in use _today_ which put data in inode blocks and
pack file tails.

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-03 Thread Andreas Dilger
On May 02, 2007  20:57 +1000, David Chinner wrote:
 On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
  HSM_READ is definitely _NOT_ required because all  
  it means is if the file is OFFLINE, bring it ONLINE and then return  
  the extent map.
 
 You've got the definition of HSM_READ wrong. If the flag is *not*
 set, then we bring everything back online and return the full extent
 map.
 
 Specifying the flag indicates that we do *not* want the offline
 extents brought back online.  i.e. it is a HSM or a datamover
 (e.g. backup program) that is querying the extents and we want to
 known *exactly* what the current state of the file is right now.
 
 So, if the HSM_READ flag is set, then the application is
 expecting the filesytem to be part of a HSM. Hence if it's not,
 it should return an error because somebody has done something wrong.

In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX
BMV_IF_NO_DMAPI_READ flag.  Data is retrieved from HSM only if the
HSM_READ flag is set.  That's why the flag is called HSM_READ instead
of HSM_NO_READ.

The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is
only disabled by specifying a flag.  It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise).  It would probably be equally
surprising and undesirable if the default behaviour was to force all
data out to HSM.



For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should
even be a part of this interface?  I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is UNMAPPED.

Having FIEMAP force the retrieval of data from HSM strikes me as something
that should be a part of a separate HSM interface, which also needs to be
able to do things like push specific files or parts thereof out to HSM,
set the aging policy, and return information like where does the HSM
file live and how many copies are there.

Do you know the reasoning behind including this into XFS_IOC_GETBMAPX?
Looking at the bmap.c comments it appears it is simply because the API
isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate
there is data in HSM but it has no blocks allocated in the filesystem.

I don't think it makes the operation significantly more efficient than
say ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP) if an application actually
needs the data to be present instead of just returning mapping info that
includes UNMAPPED.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-03 Thread Anton Altaparmakov


On 3 May 2007, at 08:49, Andreas Dilger wrote:


On May 02, 2007  20:57 +1000, David Chinner wrote:

On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:

HSM_READ is definitely _NOT_ required because all
it means is if the file is OFFLINE, bring it ONLINE and then return
the extent map.


You've got the definition of HSM_READ wrong. If the flag is *not*
set, then we bring everything back online and return the full extent
map.

Specifying the flag indicates that we do *not* want the offline
extents brought back online.  i.e. it is a HSM or a datamover
(e.g. backup program) that is querying the extents and we want to
known *exactly* what the current state of the file is right now.

So, if the HSM_READ flag is set, then the application is
expecting the filesytem to be part of a HSM. Hence if it's not,
it should return an error because somebody has done something wrong.


In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the  
XFS_IOC_GETBMAPX

BMV_IF_NO_DMAPI_READ flag.  Data is retrieved from HSM only if the
HSM_READ flag is set.  That's why the flag is called HSM_READ  
instead

of HSM_NO_READ.


Cool.  I did not misunderstand after all then. (-:


The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and  
this is

only disabled by specifying a flag.  It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise).  It would probably be  
equally

surprising and undesirable if the default behaviour was to force all
data out to HSM.

For that matter, I'm also beginning to wonder if the FLAG_HSM_READ  
should

even be a part of this interface?  I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is  
UNMAPPED.


Having FIEMAP force the retrieval of data from HSM strikes me as  
something
that should be a part of a separate HSM interface, which also needs  
to be
able to do things like push specific files or parts thereof out to  
HSM,

set the aging policy, and return information like where does the HSM
file live and how many copies are there.


That would seem sensible to me also.  Just like David argued that  
causing the data to be in a fixed location should be a separate  
interface rather than part of FIEMAP so by analogy the same should  
apply to touching HSM.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-02 Thread Anton Altaparmakov

On 2 May 2007, at 01:06, David Chinner wrote:

On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote:

On 1 May 2007, at 05:22, David Chinner wrote:

On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:

 The FIBMAP ioctl is for privileged users
 only, and I wonder if FIEMAP should be the same, or at least
disallow
 mapping files that the user can't access especially with
FLAG_SYNC and/or
 FLAG_HSM_READ.


I see little reason for restricting FI[BE]MAP to privileged users -
anyone should be able to determine if files they have permission to
access are fragmented.


Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the
machine.  Perhaps for non-privileged users FIEMAP has to be read-
only?  As soon as any of the FLAG_* flags come into play you make it
privileged.  For example fancy any user being able to fill up your
file system by calling FIEMAP with FLAG_HSM_READ on all files
recursively?


By that reasoning, users should not be allowed to recall any files
without root privileges. HSMs don't work that way, though - any user
is allowed to recall any files they have permission to access either
by manual command or by trying to read the file daata.

If that runs the filesytem out of space, then the HSM either hasn't
been configured properly or it's failed to manage the space
correctly. Either way, that's not the fault of the user for
recalling their own files.

Hence allowing FIEMAP to be executed by the user does not open up
any DOS conditions that don't already exist in normal HSM-managed
filesystem.


Sorry, it was not a great example.  But the point still stands that  
there are/may be created flags that you do not want to allow everyone  
to use.


I completely agree with Andreas that those can simply return -EPERM  
and the rest can be allowed through.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-02 Thread David Chinner
On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
 On 1 May 2007, at 15:20, David Chinner wrote:
 
 So, either the filesystem will understand the flag or iff the  
 unknown flag
 is in the incompat set, it will return EINVAL or else the unknown  
 flag will
 be safely ignored.
 
 My point was that there is a difference between specification and
 implementation - if the specification says something is compulsory,
 then they must be implemented in the filesystem. This is easy
 enough to ensure by code review - we don't need additional interface
 complexity for this
 
 You are wrong about this because you are missing the point that you  
 have no code to review.  The users that will use those flags are  
 going to be applications that run in user space.  Chances are you  
 will never see their code.  Heck, they might not even be open source  
 applications...

Ummm - the specification defines what is compulsory for *filesystems*
to implement, not what applications can use. We don't need to see
what the applications do - what we care about is that all filesystems
implement the compulsory part of the specification. That's the code
we review, and that's what I was referring to.

 And all applications will run against a multitude of  
 kernels.  So version X of the application will run on kernel 2.4.*,  
 2.6.*, a.b.*, etc...  For future expandability of the interface I  
 think it is important to have both compulsory and non-compulsory flags.

Ah, so that's what you want - a mutable interface. i.e. versioning.

So how does compusory flags help here? What happens if a voluntary
flag now becomes compulsory? Or vice versa? How is the application
supposed to deal with this dynamically?

I suggested a version number for this right back at the start of
this discussion and got told that we don't want versioned interfaces
because we should make the effort to get it right the first time.
I don't think this can be called getting it right.

 For example there is no reason why FIEMAP_HSM_READ needs to be  
 compulsory.  Most filesystems do not support HSM so can safely ignore  
 it.

They might be able to safely ignore it, but in reality it should
be saying I don't understand this. If the application *needs* to
use a flag like this, then it should be told that the filesystem is
not capable of doing what it was asked!

OTOH if the application does not need to use the flag, then it
shouldn't be using it and we shouldn't be silently ignoring
incorrect usage of the provided API.

What you are effectively saying about these voluntary flags
is that their behaviour is _undefined_. That is, if you use
these flags what you get on a successful call is undefined;
it may or may not contain what you asked for but you can't
tell if it really did what you want or returned the information
you asked for.

This is a really bad semantic to encode into an API.

 And vice versa, an application might specify some weird and funky yet  
 to be developed feature that it expects the FS to perform and if the  
 FS cannot do it (either because it does not support it or because it  
 failed to perform the operation) the application expects the FS to  
 return an error and not to ignore the flag.  An example could be the  
 asked for FIEMAP_XATTR_FORK flag.  If that is implemented, and the FS  
 ignores it it will return the extent map for the file data instead of  
 the XATTR_FORK!  Not what the application wanted at all.  Ouch!  So  
 this is definitely a compulsory flag if I ever saw one.

Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But
we don't need a flag defined in the user visible API to tell us
that we need to return an error here.

 So as you see you must support both voluntary and compulsory flags...

No, you've managed to convince me that they are not necessary and
they are in fact a Bad Idea... ;)

 Also consider what I said above about different kernels.  A new  
 feature is implemented in kernel 2.8.13 say that was not there before  
 and an application is updated to use that feature.  There will be  
 lots of instances where that application will still be run on older  
 kernels where this feature does not exist. 

This is *exactly* where silently ignoring flags really falls down.
On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does
something and it returns different structure contents for the same
state. Now how does the application writer know which is correct or
how to tell the difference?  They have to guess or write detection
code which is exactly what we want to avoid.

I objected to the UNKNOWN flag because it wasn't explicit
in it's meaning - I'm doing the same thing here. An interface
needs to be explicitly defined and should not have and undefined
behaviour in it

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Anton Altaparmakov

On 1 May 2007, at 15:20, David Chinner wrote:

On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote:

On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:

On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
This is actually for future use.  Any flags that are added into  
this
range must be understood by both sides or it should be  
considered an
error.  Flags outside the FIEMAP_FLAG_INCOMPAT do not  
necessarily need
to be supported.  If it turns out that 8 bits is too small a  
range for
INCOMPAT flags, then we can make 0x0100 an incompat flag  
that means

e.g. 0x00ff are also incompat flags also.


Ah, ok. So it's not really a set of compatibility flags, it's  
more a

compulsory set. Under those terms, i don't really see why this is
necessary - either the filesystem will understand the flags or it  
will

return EINVAL or ignore them...

I'm assuming that all flags that will be in the original FIEMAP  
proposal
will be understood by the implementations.  Most filesystems can  
safely
ignore FLAG_HSM_READ, for example, since they don't support HSM,  
and for
that matter FLAG_SYNC is probably moot for most filesystems also  
because

they do block allocation at preprw time.


Exactly my point - so why do we really need to encode a  
compulsory set of


Because flags have meaning, independent of whether or not the  
filesystem

understands them. And if the filesystem chooses to ignore critically
important flags (instead of returning EINVAL), bad things may happen.

So, either the filesystem will understand the flag or iff the  
unknown flag
is in the incompat set, it will return EINVAL or else the unknown  
flag will

be safely ignored.


My point was that there is a difference between specification and
implementation - if the specification says something is compulsory,
then they must be implemented in the filesystem. This is easy
enough to ensure by code review - we don't need additional interface
complexity for this


You are wrong about this because you are missing the point that you  
have no code to review.  The users that will use those flags are  
going to be applications that run in user space.  Chances are you  
will never see their code.  Heck, they might not even be open source  
applications...  And all applications will run against a multitude of  
kernels.  So version X of the application will run on kernel 2.4.*,  
2.6.*, a.b.*, etc...  For future expandability of the interface I  
think it is important to have both compulsory and non-compulsory flags.


For example there is no reason why FIEMAP_HSM_READ needs to be  
compulsory.  Most filesystems do not support HSM so can safely ignore  
it.  And applications that want to read/write the data locations that  
are obtained with the FIEMAP call will likely always supply  
FIEMAP_HSM_READ because they want to ensure the file is brought in if  
it is off line so they definitely want file systems that do not  
support this flag to ignore it.


And vice versa, an application might specify some weird and funky yet  
to be developed feature that it expects the FS to perform and if the  
FS cannot do it (either because it does not support it or because it  
failed to perform the operation) the application expects the FS to  
return an error and not to ignore the flag.  An example could be the  
asked for FIEMAP_XATTR_FORK flag.  If that is implemented, and the FS  
ignores it it will return the extent map for the file data instead of  
the XATTR_FORK!  Not what the application wanted at all.  Ouch!  So  
this is definitely a compulsory flag if I ever saw one.


So as you see you must support both voluntary and compulsory flags...

Also consider what I said above about different kernels.  A new  
feature is implemented in kernel 2.8.13 say that was not there before  
and an application is updated to use that feature.  There will be  
lots of instances where that application will still be run on older  
kernels where this feature does not exist.  Depending on the feature  
it may be quite sensible to simply ignore in the kernel that the  
application set an unknown flag whilst for a different feature it may  
be the opposite.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread David Chinner
On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote:
 On May 01, 2007  14:22 +1000, David Chinner wrote:
  On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
   Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't
  
  I disagree - why would you want to indicate the state is unknown when we 
  know
  very well that it is offline?
 
 If you don't like UNKNOWN, what about UNMAPPED?  I just want a
 catch-all flag that indicates this extent contains data but there is
 nothing sensible to be returned for the extent mapping.

Yes, I like that much more. Good suggestion. ;)

  Effectively, when your extent is offline in the HSM, it is inaccessable, and
  you have to bring it back from tape so it becomes accessible again. i.e. 
  some
  action is necessary on behalf of the user to make it accessible. So I think
  that OFFLINE is a good name for this state because it really is 
  inaccessible.
 
 What you are calling OFFLINE I would prefer to call UNMAPPED, since that
 can be used by applications as a catch-all for no mapping.  There can
 be further flags that give refinements to UNMAPPED that some applications
 might care about them (e.g. HSM_RESIDENT), but many users/apps will not
 if they just want the number of fragments in a given file.

Agreed - UNMAPPED does make a lot more sense in this case.

   Can you propose reasonable flag names for these (I can't think of anything
   very good) and a clear explanation of what they mean.  I suspect it will
   only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
   the concept of stripe unit and stripe width, but as yet they are not
   communicated between the two very well.  I'd be much happier if this info
   could be queried in a standard way from the block layer instead of the
   user having to specify it and the filesystem having to track it.
  
  My preference is definitely for a separate ioctl to grab the
  filesystem geometry so this stuff can be calculated in userspace.
  i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
  bother trying to define names until we decide which appraoch we take
  to implement this.
 
 Hmm, previously you wrote This information could be easily passed up in the
 flags fields if the filesystem has geometry information.  So, I _think_
 what you are saying is that you want 4 flags to convey this start/end
 alignment information, but the exact semantics of what a stripe unit and
 a stripe width is filesystem specific?

Right.

 I definitely do NOT want to get into any issues of querying the block
 device geometry here.  I was just making a passing comment that ext4+mballoc
 can already do RAID-specific allocation alignment, but it depends on the
 admin to specify this information and it would be nice if there was some
 easy way to get this from userspace/kernel interfaces.
 
 Having an API that can request tell me the number of blocks from this
 offset until the next physical disk boundary or similar would be useful
 to any allocator, and the block layer already needs to know this when
 submitting IO.

The block layer knows this once you get inside the volume manager. I
think the issue is that there is no common export interface for this
information.

  In XFS, mkfs.xfs does the work of getting this information
  to see in the filesystem superblock. Here's the code for getting
  sunit/swidth from the underlying block device:
  
  http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
  
  Not much in common there ;)
 
 It looks like this might be just what e2fsprogs needs also.

More than likely.

   It does make sense to specify zero for the fm_extent_count array and a
   new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
   extent data itself, for the non-verbose mode of filefrag, and for
   pre-allocating a buffer large enough to hold the file if that is 
   important.
  
  Rather than rely on implicit behaviour of pass in extent count of
  zero and a don't try to return any extents to return the number of
  extents on the file, why not just explicitly define this as a valid
  input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS
 
 That's what I said, isn't it?  FIEMAP_FLAG_NO_EXTENTS.  I wonder if my
 clever-clever for return no extents and return number of extents
 is wasted :-/.

Too clever for an API, I think. ;)

My point is mainly that if you are going to use an API for a
specific function (e.g. query the number of extents) I think that
the API should have an obvious method for executing that specific
function. Using a command of get no extents to provide the query
of how many extents in this file is kind of obscure. When you read
the code it doesn't make a lot of sense, as opposed to seeing a
clear statement of intent from the code itself.

i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API
and the code that uses it...

   - does XFS return an extent for the metadata parts of the file (e.g. 
   btree)?
  
  

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-19 Thread Timothy Shimmin

--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger [EMAIL PROTECTED] wrote:


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
# define FIEMAP_FLAG_SYNC   0x0001  /* flush delalloc data to disk*/
# define FIEMAP_FLAG_HSM_READ   0x0002  /* retrieve data from HSM */
# define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/

/* flags for the returned extents */
# define FIEMAP_EXTENT_HOLE 0x0001  /* no space allocated */
# define FIEMAP_EXTENT_UNWRITTEN0x0002  /* uninitialized space 
*/
# define FIEMAP_EXTENT_UNKNOWN  0x0004  /* in use, location unknown */
# define FIEMAP_EXTENT_ERROR0x0008  /* error mapping space */
# define FIEMAP_EXTENT_NO_DIRECT0x0010  /* no direct data 
access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)


I much prefer that - it makes it a lot clearer to me to have fiemap_extent
just for fm_extents (no different meanings now).
(Don't like the word offset in comment without physical or some such but 
whatever;-)
I also prefer the flags as separate fields too :)

--Tim
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  18:01 +1000, Timothy Shimmin wrote:
 --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] 
 wrote:
 struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping 
  */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to 
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
 }
 
 # define FIEMAP_LEN_MASK 0xff
 # define FIEMAP_LEN_HOLE 0x01
 # define FIEMAP_LEN_UNWRITTEN0x02
 
 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 The -fm_extents[] array includes all of the holes in addition to
 allocated extents because this avoids the need to return both the logical
 and physical address for every extent and does not make processing any
 harder.
 
 Well, that's what stood out for me. I was wondering where the fe_block 
 field had gone - the physical address.
 So is your fe_start; /* starting offset */ actually the disk location
 (not a logical file offset)
 _except_ in the header (fiemap) where it is the desired logical offset.

Correct.  The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region.  In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the virtual hole at the end of the file.  I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent.  fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

 Okay, looking at your example use below that's what it looks like.
 And when you refer to fm_start below, you mean fm_start.fe_start?
 Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

 So you get rid of all the logical file offsets in the extents because we
 report holes explicitly (and we know everything is contiguous if you
 include the holes).

Correct.  It saves space in the common case.

 Caller works something like:
 
  char buf[4096];
  struct fiemap *fm = (struct fiemap *)buf;
  int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
  
  fm-fm_start.fe_start = 0; /* start of file */
  fm-fm_start.fe_len = -1;   /* end of file */
  fm-fm_extent_count = count; /* max extents in fm_extents[] array */
  fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */
 
  fd = open(path, O_RDONLY);
  printf(logical\t\tphysical\t\tbytes\n);
 
  /* The last entry will have less extents than the maximum */
  while (fm-fm_extent_count == count) {
  rc = ioctl(fd, FIEMAP, fm);
  if (rc)
  break;
 
  /* kernel filled in fm_extents[] array, set fm_extent_count
   * to be actual number of extents returned, leaves
   * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
 
  for (i = 0; i  fm-fm_extent_count; i++) {
  __u64 len = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_MASK;
  __u64 fm_next = fm-fm_start.fe_start + len;
  int hole = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_HOLE;
  int unwr = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_UNWRITTEN;
 
  printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
  fm-fm_start.fe_start, fm_next - 1,
  hole ? 0 : fm-fm_extents[i].fe_start,
  hole ? 0 : fm-fm_extents[i].fe_start +
 fm-fm_extents[i].fe_len - 1,
  len, hole ? (hole)  : ,
  unwr ? (unwritten)  : );
 
  /* get ready for printing next extent, or next ioctl 
  */
  fm-fm_start.fe_start = fm_next;
  }
  }
 

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  21:22 +1000, David Chinner wrote:
 On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
  struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
  }
  
  struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
  }
  
  #define FIEMAP_LEN_MASK 0xff
  #define FIEMAP_LEN_HOLE 0x01
  #define FIEMAP_LEN_UNWRITTEN0x02
 
 I'm not sure I like stealing bits from the length to use a flags -
 I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern.  I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE  0x0001  /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x0002  /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN   0x0004  /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x0008  /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x0010  /* no direct data access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)

- separate flags word for fm_flags:
  - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
  - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
  - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
  - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
  sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
  - FIEMAP_EXTENT_HOLE = extent has no space allocation
  - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
  - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
  - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == errno?
  - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
  (e.g. OCFS, GFS, ZFS, Lustre).  This would otherwise have been unused.


 Given that xfs_bmap uses extra information from the filesystem
 (geometry) to display extra (and frequently used) information
 about the alignment of extents. ie:
 
 chook 681% xfs_bmap -vv fred
 fred:
  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
  FLAG Values:
 01 Unwritten preallocated extent
 001000 Doesn't begin on stripe unit
 000100 Doesn't end   on stripe unit
 10 Doesn't begin on stripe width
 01 Doesn't end   on stripe width

Can you clarify the terminology here?  What is a stripe unit and what is
a stripe width?  Are there N * stripe_unit = stripe_width in e.g. a
RAID 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-16 Thread Timothy Shimmin

Hi Andreas,

--On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] wrote:


I'm interested in getting input for implementing an ioctl to efficiently
map file extents  holes (FIEMAP) instead of looping over FIBMAP a billion
times.

...


I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.


They certainly seem to be (combining entries and header).



struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start;  /* offset, length of desired mapping */
__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

# define FIEMAP_LEN_MASK0xff
# define FIEMAP_LEN_HOLE0x01
# define FIEMAP_LEN_UNWRITTEN   0x02

All offsets are in bytes to allow cases where filesystems are not going
block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

The -fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the logical
and physical address for every extent and does not make processing any
harder.


Well, that's what stood out for me. I was wondering where the fe_block field
had gone - the physical address.
So is your fe_start; /* starting offset */ actually the disk location
(not a logical file offset)
_except_ in the header (fibmap) where it is the desired logical offset.
Okay, looking at your example use below that's what it looks like.
And when you refer to fm_start below, you mean fm_start.fe_start?
Sorry, I realise this is just an approximation but this part confused me.
So you get rid of all the logical file offsets in the extents because we
report holes explicitly (and we know everything is contiguous if you
include the holes).

--Tim



Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm-fm_extent.fe_start = 0; /* start of file */
fm-fm_extent.fe_len = -1;   /* end of file */
fm-fm_extent_count = count; /* max extents in fm_extents[] array */
fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */

fd = open(path, O_RDONLY);
printf(logical\t\tphysical\t\tbytes\n);

/* The last entry will have less extents than the maximum */
while (fm-fm_extent_count == count) {
rc = ioctl(fd, FIEMAP, fm);
if (rc)
break;

/* kernel filled in fm_extents[] array, set fm_extent_count
 * to be actual number of extents returned, leaves fm_start
 * alone (unlike XFS_IOC_GETBMAP). */

for (i = 0; i  fm-fm_extent_count; i++) {
__u64 len = fm-fm_extents[i].fe_len  FIEMAP_LEN_MASK;
__u64 fm_next = fm-fm_start + len;
int hole = fm-fm_extents[i].fe_len  FIEMAP_LEN_HOLE;
int unwr = fm-fm_extents[i].fe_len  
FIEMAP_LEN_UNWRITTEN;

printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
fm-fm_start, fm_next - 1,
hole ? 0 : fm-fm_extents[i].fe_start,
hole ? 0 : fm-fm_extents[i].fe_start +
   fm-fm_extents[i].fe_len - 1,
len, hole ? (hole)  : ,
unwr ? (unwritten)  : );

/* get ready for printing next extent, or next ioctl */
fm-fm_start = fm_next;
}
}



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-13 Thread Christoph Hellwig
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
 struct fibmap_extent {
   __u64 fe_start; /* starting offset in bytes */
   __u64 fe_len;   /* length in bytes */
 }
 
 struct fibmap {
   struct fibmap_extent fm_start;  /* offset, length of desired mapping */
   __u32 fm_extent_count;  /* number of extents in array */
   __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
   __u64 unused;
   struct fibmap_extent fm_extents[0];
 }
 
 #define FIEMAP_LEN_MASK   0xff
 #define FIEMAP_LEN_HOLE   0x01
 #define FIEMAP_LEN_UNWRITTEN  0x02
 
 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).

 One feature that XFS_IOC_GETBMAPX has that may be desirable is the
 ability to return unwritten extent information.  In order to do this XFS
 required expanding the per-extent struct from 32 to 48 bytes per extent,
 but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
 and keep 8 bytes or so for input/output flags per extent (would need to
 be masked before use).

I'd be much happier to have the separate per-extent flags value.
For one thing this allows much nicer representations of unwritten
extents or holes without taking away bits from the len value.  It also
allows to make interesting use of this in the future, e.g. telling
about an offline exttent for use in HSM applications.  Also for
this kernel-user interface the wasted space shouldn't matter too
much - if you want to pass the above condensed structure over the
wire in lustre that shouldn't a problem, you'd have to convert
to an endian-neutral on the wire format anyway.  Not doing the
masking also make the interface quite a bit simpler to use.

One addition freature from the XFS getbmapx interface we should
provide is the ability to query layout of xattrs.  While other
filesystems might not have the exact xattr fork XFS has it fits
nicely into the interface.  Especially when we have Anton's suggested
flag for inline data.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-13 Thread Anton Altaparmakov

On 13 Apr 2007, at 11:15, Christoph Hellwig wrote:

On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:

struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
	struct fibmap_extent fm_start;	/* offset, length of desired  
mapping */

__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

#define FIEMAP_LEN_MASK 0xff
#define FIEMAP_LEN_HOLE 0x01
#define FIEMAP_LEN_UNWRITTEN0x02

All offsets are in bytes to allow cases where filesystems are not  
going
block-aligned/sized allocations (e.g. tail packing).  The  
fm_extents array

returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).



One feature that XFS_IOC_GETBMAPX has that may be desirable is the
ability to return unwritten extent information.  In order to do  
this XFS
required expanding the per-extent struct from 32 to 48 bytes per  
extent,
but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what  
hardship)
and keep 8 bytes or so for input/output flags per extent (would  
need to

be masked before use).


I'd be much happier to have the separate per-extent flags value.
For one thing this allows much nicer representations of unwritten
extents or holes without taking away bits from the len value.  It also
allows to make interesting use of this in the future, e.g. telling
about an offline exttent for use in HSM applications.  Also for
this kernel-user interface the wasted space shouldn't matter too
much - if you want to pass the above condensed structure over the
wire in lustre that shouldn't a problem, you'd have to convert
to an endian-neutral on the wire format anyway.  Not doing the
masking also make the interface quite a bit simpler to use.

One addition freature from the XFS getbmapx interface we should
provide is the ability to query layout of xattrs.  While other
filesystems might not have the exact xattr fork XFS has it fits
nicely into the interface.  Especially when we have Anton's suggested
flag for inline data.


Would it not be better to allow people to get a file descriptor on  
the xattr fork and then just run the normal FIEMAP ioctl on that file  
descriptor?


I.e. openat(base file descriptor, O_STREAM, streamname) or O_XATTR  
or whatever...  An alternative API would be to provide a getxattrfd 
()/fgetxattrfd() call or similar that would instead of returning the  
value of an xattr return an fd to it.  Then you do not need to modify  
openat() at all...  Interface doesn't bother me, just some ideas...


And for XFS you would define a magic streamname or xattrname (or  
whatever you want to call it) of say  
com.sgi.filesystem.xfs.xattrstream (or .xattrfork) or something and  
then XFS would intercept that and know what to do with it...


Such an interface could then be used by NTFS named streams and other  
file systems providing such things...


(Yes I know I will now totally get flamed about named streams not  
being wanted in Linux and crap like that but that is exactly what you  
are asking for except you want to special case a particular stream  
using a flag instead of calling it for what it really is and once you  
start doing that you might as well allow full named streams...)


You can just see named streams as an alternative, non-atomic API to  
xattrs if you like, i.e. you can either use the atomic xattr API  
provided in Linux already or you can get a file descriptor to an  
xattr and then use the normal system calls to access it non- 
atomically thus you can use the FIEMAP ioctl also.  (-:


FWIW this two-API approach to xattrs/named streams is the direction  
OSX is heading towards also so it is not without precedent and  
Windows has had both APIs for many years.  And Solaris has the openat 
(O_XATTR) interface so that is not without precedent either.


Best regards,

Anton

PS. to all flamers: I am going to delete any non-technical flames  
without replying so please do us all a favour and don't bother...   
Thanks.


--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-13 Thread Nicholas Miell
On Fri, 2007-04-13 at 12:38 +0100, Anton Altaparmakov wrote:
  One addition freature from the XFS getbmapx interface we should
  provide is the ability to query layout of xattrs.  While other
  filesystems might not have the exact xattr fork XFS has it fits
  nicely into the interface.  Especially when we have Anton's suggested
  flag for inline data.
 
 Would it not be better to allow people to get a file descriptor on  
 the xattr fork and then just run the normal FIEMAP ioctl on that file  
 descriptor?
 
 I.e. openat(base file descriptor, O_STREAM, streamname) or O_XATTR  
 or whatever...  An alternative API would be to provide a getxattrfd 
 ()/fgetxattrfd() call or similar that would instead of returning the  
 value of an xattr return an fd to it.  Then you do not need to modify  
 openat() at all...  Interface doesn't bother me, just some ideas...
 
 And for XFS you would define a magic streamname or xattrname (or  
 whatever you want to call it) of say  
 com.sgi.filesystem.xfs.xattrstream (or .xattrfork) or something and  
 then XFS would intercept that and know what to do with it...
 
 Such an interface could then be used by NTFS named streams and other  
 file systems providing such things...
 
 (Yes I know I will now totally get flamed about named streams not  
 being wanted in Linux and crap like that but that is exactly what you  
 are asking for except you want to special case a particular stream  
 using a flag instead of calling it for what it really is and once you  
 start doing that you might as well allow full named streams...)
 
 You can just see named streams as an alternative, non-atomic API to  
 xattrs if you like, i.e. you can either use the atomic xattr API  
 provided in Linux already or you can get a file descriptor to an  
 xattr and then use the normal system calls to access it non- 
 atomically thus you can use the FIEMAP ioctl also.  (-:
 
 FWIW this two-API approach to xattrs/named streams is the direction  
 OSX is heading towards also so it is not without precedent and  
 Windows has had both APIs for many years.  And Solaris has the openat 
 (O_XATTR) interface so that is not without precedent either.

Except that xattrs in Linux aren't streams, and providing a stream-like
interface to them would be a weird abuse of the xattr concept.

In essence, Linux xattrs are named extensions to struct stat, with
getxattr() being in the same category as stat() and setxattr() being in
the same category as chmod()/chown()/utime()/etc.

They system namespace exists to provide a better interface than ioctl()
to weird FS-specific features (DOS attribute bits, HFS+ creator/type,
ext2/3/reiserfs/etc. immutable/append-only/secure-delete/etc. attributes
and so on). The uptake of this feature isn't as high as I'd like, but
that's what it's there for.

They security namespace is there for all the neat LSM modules that need
to attach metadata to files in order to function.

Finally, the user namespace exists to allow users to attach small bits
of information to their own files, since the API was already there and
hey!, metadata is useful.

Now, Solaris came along and totally confused the issue by using the same
name for a completely different feature, but that isn't any real reason
to mess up the existing Linux xattr concept just to graft named streams
support into the kernel.

(Not that I'm opposed to named streams in Linux, you just have to
realize that xattrs aren't name streams, can't live in the same
namespace as named streams, and certainly don't serve the same purpose
as named streams.)

-- 
Nicholas Miell [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger
I'm interested in getting input for implementing an ioctl to efficiently
map file extents  holes (FIEMAP) instead of looping over FIBMAP a billion
times.  We already have customers with single files in the 10TB range and
we additionally need to get the mapping over the network so it needs to
be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.

There was also recent discussion about SEEK_HOLE and SEEK_DATA as
implemented by Sun, but even if we could skip the holes we still might
need to do millions of FIBMAPs to see how large files are allocated
on disk.  Conversely, having filesystems implement an efficient FIBMAP
ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE
and SEEK_DATA instead of doing looping over -bmap() inside the kernel
as I saw one patch.


struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start;  /* offset, length of desired mapping */
__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

#define FIEMAP_LEN_MASK 0xff
#define FIEMAP_LEN_HOLE 0x01
#define FIEMAP_LEN_UNWRITTEN0x02

All offsets are in bytes to allow cases where filesystems are not going
block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

The -fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the logical
and physical address for every extent and does not make processing any
harder.

One feature that XFS_IOC_GETBMAPX has that may be desirable is the
ability to return unwritten extent information.  In order to do this XFS
required expanding the per-extent struct from 32 to 48 bytes per extent,
but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
and keep 8 bytes or so for input/output flags per extent (would need to
be masked before use).


Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm-fm_extent.fe_start = 0; /* start of file */
fm-fm_extent.fe_len = -1;  /* end of file */
fm-fm_extent_count = count; /* max extents in fm_extents[] array */
fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */

fd = open(path, O_RDONLY);
printf(logical\t\tphysical\t\tbytes\n);

/* The last entry will have less extents than the maximum */
while (fm-fm_extent_count == count) {
rc = ioctl(fd, FIEMAP, fm);
if (rc)
break;

/* kernel filled in fm_extents[] array, set fm_extent_count
 * to be actual number of extents returned, leaves fm_start
 * alone (unlike XFS_IOC_GETBMAP). */

for (i = 0; i  fm-fm_extent_count; i++) {
__u64 len = fm-fm_extents[i].fe_len  FIEMAP_LEN_MASK;
__u64 fm_next = fm-fm_start + len;
int hole = fm-fm_extents[i].fe_len  FIEMAP_LEN_HOLE;
int unwr = fm-fm_extents[i].fe_len  
FIEMAP_LEN_UNWRITTEN;

printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
fm-fm_start, fm_next - 1,
hole ? 0 : fm-fm_extents[i].fe_start,
hole ? 0 : fm-fm_extents[i].fe_start +
   fm-fm_extents[i].fe_len - 1,
len, hole ? (hole)  : ,
unwr ? (unwritten)  : );

/* get ready for printing next extent, or next ioctl */
fm-fm_start = fm_next;
}
}

I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP.
I'm quite open to suggestions at this point, both in terms of how usable
the fibmap data structures are by the caller, and if we need to add anything
to make them more flexible for the future.

In terms of implementing this in the kernel, there was originally code for
this during the development of the ext3 extent patches and it was done via
a callback in the extent tree iterator so it is very efficient.  I believe
it implements all that is needed 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Anton Altaparmakov

Hi Andreas,

On 12 Apr 2007, at 12:05, Andreas Dilger wrote:

I'm interested in getting input for implementing an ioctl to  
efficiently
map file extents  holes (FIEMAP) instead of looping over FIBMAP a  
billion
times.  We already have customers with single files in the 10TB  
range and
we additionally need to get the mapping over the network so it  
needs to

be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.

There was also recent discussion about SEEK_HOLE and SEEK_DATA as
implemented by Sun, but even if we could skip the holes we still might
need to do millions of FIBMAPs to see how large files are allocated
on disk.  Conversely, having filesystems implement an efficient FIBMAP
ioctl (or -fiemap() method) could in turn be leveraged for SEEK_HOLE
and SEEK_DATA instead of doing looping over -bmap() inside the kernel
as I saw one patch.


struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
}

struct fibmap {
	struct fibmap_extent fm_start;	/* offset, length of desired  
mapping */

__u32 fm_extent_count;  /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

#define FIEMAP_LEN_MASK 0xff
#define FIEMAP_LEN_HOLE 0x01
#define FIEMAP_LEN_UNWRITTEN0x02


Sound good but I would add:

#define FIEMAP_LEN_NO_DIRECT_ACCESS

This would say that the offset on disk can move at any time or that  
the data is compressed or encrypted on disk thus the data is not  
useful for direct disk access.


On NTFS small files can be inside the inode and there direct access  
is not possible because the metadata on disk is protected with fixups  
which need to be removed when the inode is read into memory.  If you  
access the data directly on disk, you would see corrupt data on reads  
and cause corruption on writes...


Similarly both for compressed and encrypted files doing direct access  
to the on-disk data is totally nonsensical as you would see random  
junk on read and cause fatal data corruption on writes.


Also why are you not using 0xff00, i.e. two more zeroes  
at the end?  Seems unnecessary to drop an extra 8 bits of  
significance from the byte size...  May not matter today but it  
almost certainly will do in the future (just remember what people  
said about the 640k limit in MSDOS when it first came out!)...


Finally please make sure that the file system can return in one way  
or another errors for example when it fails to determine the extents  
because the system ran out of memory, there was an i/o error,  
whatever...  It may even be useful to be able to say here is an  
extent of size X bytes but we do not know where it is on disk because  
there was an error determining this particular extent's on-disk  
location for some reason or other...


All offsets are in bytes to allow cases where filesystems are not  
going


Excellent!

block-aligned/sized allocations (e.g. tail packing).  The  
fm_extents array

returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).


Why the fe_start == 0?  Surely just the flag is sufficient...  On  
NTFS it is perfectly valid to have fe_start == 0 and to have that not  
be sparse (normally the $Boot system file is stored in the first 8  
sectors of the volume)...


Best regards,

Anton


The -fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the  
logical

and physical address for every extent and does not make processing any
harder.

One feature that XFS_IOC_GETBMAPX has that may be desirable is the
ability to return unwritten extent information.  In order to do  
this XFS
required expanding the per-extent struct from 32 to 48 bytes per  
extent,
but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what  
hardship)
and keep 8 bytes or so for input/output flags per extent (would  
need to

be masked before use).


Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm-fm_extent.fe_start = 0; /* start of file */
fm-fm_extent.fe_len = -1;   /* end of file */
fm-fm_extent_count = count; /* max extents in fm_extents[] array */
fm-fm_flags = 0;/* maybe no DMAPI, etc like XFS */

fd = open(path, O_RDONLY);
printf(logical\t\tphysical\t\tbytes\n);

/* The last 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger
On Apr 12, 2007  12:22 +0100, Anton Altaparmakov wrote:
 On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
 I'm interested in getting input for implementing an ioctl to  
 efficiently map file extents  holes (FIEMAP) instead of looping
 over FIBMAP a billion times.  We already have customers with single
 files in the 10TB range and we additionally need to get the mapping
 over the network so it needs to be efficient in terms of how data
 is passed, and how easily it can be extracted from the filesystem.
 
 struct fibmap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fibmap {
  struct fibmap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags for input request */
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fibmap_extent fm_extents[0];
 }
 
 #define FIEMAP_LEN_MASK  0xff
 #define FIEMAP_LEN_HOLE  0x01
 #define FIEMAP_LEN_UNWRITTEN 0x02
 
 Sound good but I would add:
 
 #define FIEMAP_LEN_NO_DIRECT_ACCESS
 
 This would say that the offset on disk can move at any time or that  
 the data is compressed or encrypted on disk thus the data is not  
 useful for direct disk access.

This makes sense.  Even for Reiserfs the same is true with packed tails,
and I believe if FIBMAP is called on a tail it will migrate the tail into
a block because this is might be a sign that the file is a kernel that
LILO wants to boot.

I'd rather not have any such feature in FIEMAP, and just return the
on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
My main reason for FIEMAP is being able to investigate allocation patterns
of files.

By no means is my flag list exhaustive, just the ones that I thought would
be needed to implement this for ext4 and Lustre.

 Also why are you not using 0xff00, i.e. two more zeroes  
 at the end?  Seems unnecessary to drop an extra 8 bits of  
 significance from the byte size...

It was actually just a typo (this was the first time I'd written the
structs and flags down, it is just at the discussion stage).  I'd meant
for it to be 2^56 bytes for the file size as I wrote later in the email.
That said, I think that 2^48 bytes is probably sufficient for most uses,
so that we get 16 bits for flags.  As it is this email already discusses
5 flags, and that would give little room for expansion in the future.

Remember, this is the mapping for a single file (which can't practially
be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to
return a few separate extents which are actually contiguous (assuming that
there will actually be files in filesystems with  2^48 bytes of contiguous
space).  Since the API is that it will return the extent that contains the
requested start byte, the kernel will be able to detect this case also,
since it won't be able to specify a length for the extent that contains the
start byte.

At most we'd have to call the ioctl() 65536 times for a completely
contiguous 2^64 byte file if the buffer was only large enough for a
single extent.  In reality, I expect any file to have some discontinuities
and the buffer to be large enough for a thousand or more entries so the
corner case is not very bad.

 Finally please make sure that the file system can return in one way  
 or another errors for example when it fails to determine the extents  
 because the system ran out of memory, there was an i/o error,  
 whatever...  It may even be useful to be able to say here is an  
 extent of size X bytes but we do not know where it is on disk because  
 there was an error determining this particular extent's on-disk  
 location for some reason or other...

Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
FIEMAP_LEN_ERROR.  Consider FIEMAP on a file that was migrated
to tape and currently has no blocks allocated in the filesystem.  We
want to return some indication that there is actual file data and not
just a hole, but at the same time we don't want this to actually return
the file from tape just to generate block mappings for it.

This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ,
but this needs to be specified on input to prevent the file being mapped
and I'd rather the opposite (not getting file from tape) be the default,
by principle of least surprise.


 block-aligned/sized allocations (e.g. tail packing).  The  
 fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 Why the fe_start == 0?  Surely just the flag is sufficient...  On  
 NTFS it is perfectly valid to have fe_start == 0 and to have that not  
 be sparse (normally the $Boot system file is stored in the first 8  
 sectors of the volume)...

I thought