Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-29 Thread Alan Cox
On Sun, 28 Oct 2007 13:43:21 -0400
J. Bruce Fields [EMAIL PROTECTED] wrote:

 From: J. Bruce Fields [EMAIL PROTECTED]
 
 We currently attempt to return -EDEALK to blocking fcntl() file locking
 requests that would create a cycle in the graph of tasks waiting on
 locks.
 
 This is inefficient: in the general case it requires us determining
 whether we're adding a cycle to an arbitrary directed acyclic graph.
 And this calculation has to be performed while holding a lock (currently
 the BKL) that prevents that graph from changing.
 
 It has historically been a source of bugs; most recently it was noticed
 that it could loop indefinitely while holding the BKL.
 
 It seems unlikely to be useful to applications:
   - The difficulty of implementation has kept standards from
 requiring it.  (E.g. SUSv3 : Since implementation of full
 deadlock detection is not always feasible, the [EDEADLK] error
 was made optional.)  So portable applications may not be able to
 depend on it.
   - It only detects deadlocks that involve nothing but local posix
 file locks; deadlocks involving network filesystems or other kinds
 of locks or resources are missed.
 
 It therefore seems best to remove deadlock detection.
 
 Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]


NAK. This is an ABI change and one that was rejected before when this was
last discussed in detail. Moving it out of BKL makes a ton of sense, even
adding a don't check flag makes a lot of sense. Removing the checking
does not.

I'd much rather see


if (flags  FL_NODLCHECK)
posix_deadlock_detect()


The failure case for removing this feature is obscure and hard to debug
application hangs for the afflicted programs - not nice for users at all.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-29 Thread Alan Cox
 And if posix file locks are to be useful to threaded applications, then
 we have to preserve the same no-false-positives requirement for them as
 well.

It isn't useful to threaded applications. The specification requires
this. Which is another reason for having an additional Linux (for now)
flag to say don't bother

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-29 Thread Jiri Kosina
On Sun, 28 Oct 2007, Matthew Wilcox wrote:

  A potential for deadlock occurs if a process controlling a locked 
  region is put to sleep by attempting to lock another process' 
  locked region. If the system detects that sleeping until a locked 
  region is unlocked would cause a deadlock, fcntl() shall fail with 
  an [EDEADLK] error.
  This is what POSIX says [1], even after being modified with respect to 
  POSIX Threads Extension, right? So it doesn't deal with threads at 
  all, just processess are taken into account. Probably for a reason :)
 Did you have a concrete suggestion, or are you just quoting the spec?

I was quoting the spec and I thought that the suggestion is implicit -- 
the specification defines what happens only when processess are in place. 
If the application uses POSIX threads in combination with locks, it is 
going to receive undefined behavior.

 The problem is that it's nonsense -- processes don't sleep, threads do. 
 I think the key is would deadlock, not might deadlock.  Our current 
 behaviour is clearly in violation of SuSv3.

- either we can add a special fcntl() Linux-specific flag, that will ask 
  the kernel not to perform deadlock detection. Any application that is 
  combining threads and posix locks (and thus IMHO asking for undefined 
  behavior) could use this flag and not receive EDEADLK any more

- or we can add some heuristics here-and-there to track which 
  current-files are shared and which are not, and do not return EDEADLK 
  in the case of shared -files (could be a little bit tricky)

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-29 Thread Jiri Kosina
On Sun, 28 Oct 2007, J. Bruce Fields wrote:

 But, OK, if we can identify unshared current-files at the time we put a 
 task to sleep, then a slight modification of our current algorithm might 
 be sufficient to detect any deadlock that involves purely posix file 
 locks and processes.  And we can tell people that avoiding deadlock is 
 their problem as soon as any task with a shared current-files is 
 involved.  (Ditto, I assume, if nfsd/lockd acquires a lock.)

Don't forget that comparing file_lock-fl_owner (i.e. current-files) is 
not the only way how lock ownership could be computed (there could be 
specific file_lock-fl_lmops-fl_compare_owner() and all of them should 
be teached this new semantics, right?).

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/6][RFC] Attempt to plug race with truncate

2007-10-29 Thread Chris Mason
On Fri, 26 Oct 2007 16:37:36 -0700
Mike Waychison [EMAIL PROTECTED] wrote:

 Attempt to deal with races with truncate paths.
 
 I'm not really sure on the locking here, but these seem to be taken
 by the truncate path.  BKL is left as some filesystem may(?) still
 require it.
 
 Signed-off-by: Mike Waychison [EMAIL PROTECTED]
  fs/ioctl.c |8 
  1 file changed, 8 insertions(+)
 
 Index: linux-2.6.23/fs/ioctl.c
 ===
 --- linux-2.6.23.orig/fs/ioctl.c  2007-10-26 15:27:29.0
 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26
 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long
 do_ioctl(struct file *filp, static int do_fibmap(struct address_space
 *mapping, sector_t block, sector_t *phys_block)
  {
 + struct inode *inode = mapping-host;
 +
   if (!capable(CAP_SYS_RAWIO))
   return -EPERM;
   if (!mapping-a_ops-bmap)
   return -EINVAL;
  
   lock_kernel();
 + /* Avoid races with truncate */
 + mutex_lock(inode-i_mutex);
 + /* FIXME: Do we really need i_alloc_sem? */
 + down_read(inode-i_alloc_sem);


i_alloc_sem will avoid races with filesystems filling holes inside
writepage (where i_mutex isn't held).  I'd expect everyone to currently
give some consistent result (either the old value or the new but not
garbage), but I wouldn't expect taking the semaphore to hurt anything.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason
On Sat, 27 Oct 2007 18:57:06 +0100
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 Hi,
 
 -bmap is ugly and horrible!  If you have to do this at the very
 least please cause -bmap64 to be able to return error values in case
 the file system failed to get the information or indeed such
 information does not exist as is the case for compressed and
 encrypted files for example and also for small files that are inside
 the on-disk inode (NTFS resident files and reiserfs packed tails are
 examples of this).
 
 And another of my pet peeves with -bmap is that it uses 0 to mean  
 sparse which causes a conflict on NTFS at least as block zero is  
 part of the $Boot system file so it is a real, valid block...  NTFS  
 uses -1 to denote sparse blocks internally.

Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Mike Waychison

Zach Brown wrote:
And another of my pet peeves with -bmap is that it uses 0 to mean  
sparse which causes a conflict on NTFS at least as block zero is  
part of the $Boot system file so it is a real, valid block...  NTFS  
uses -1 to denote sparse blocks internally.

Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.


And maybe we can step back and see what the callers of FIBMAP are doing
with the results they're getting.

One use is to discover the order in which to read file data that will
result in efficient IO.

If we had an interface specifically for this use case then perhaps a
sparse block would be better reported as the position of the inode
relative to other data blocks.  Maybe the inode block number in ext* land.



Can you clarify what you mean above with an example?  I don't really follow.

Thanks,

Mike Waychison


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Mike Waychison

Chris Mason wrote:

On Sat, 27 Oct 2007 18:57:06 +0100
Anton Altaparmakov [EMAIL PROTECTED] wrote:


Hi,

-bmap is ugly and horrible!  If you have to do this at the very
least please cause -bmap64 to be able to return error values in case
the file system failed to get the information or indeed such
information does not exist as is the case for compressed and
encrypted files for example and also for small files that are inside
the on-disk inode (NTFS resident files and reiserfs packed tails are
examples of this).

And another of my pet peeves with -bmap is that it uses 0 to mean  
sparse which causes a conflict on NTFS at least as block zero is  
part of the $Boot system file so it is a real, valid block...  NTFS  
uses -1 to denote sparse blocks internally.


Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.



I hadn't heard of FIEMAP, so I went back and read the thread from 
April/May.  It seems that this is a much better approach than 
introducing a FIBMAP64.


What ever happened with this proposal?

Mike Waychison
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Zach Brown

 But, we shouldn't inflict all of this on fibmap/fiemapwe'll get
 lost trying to make the one true interface for all operations.
 
 For grouping operations on files, I think a read_tree syscall with
 hints for what userland will do (read, stat, delete, list
 filenames), and a better cookie than readdir should do it.

Agreed, on both points.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Zach Brown

 Can you clarify what you mean above with an example?  I don't really
 follow.

Sure, take 'tar' as an example.  It'll read files in the order that
their names are returned from directory listing.  This can produce bad
IO patterns because the order in which the file names are returned
doesn't match the order of the file's blocks on disk.  (htree, I'm
looking at you!)

People have noticed that tar-like loads can be sped up greatly just by
sorting the files by their inode number as returned by stat(), never
mind the file blocks themselves.  One example of this is Chris Mason's
'acp'.

  http://oss.oracle.com/~mason/acp/

The logical extension of that is to use FIBMAP to find the order of file
blocks on disk and then doing IO on blocks in sorted order.  It'd take
work to write an app that does this reliably, sure.

In this use the application doesn't actually care what the absolute
numbers are.  It cares about their ordering.  File systems would be able
to chose whatever scheme they wanted for the actual values of the
results from a FIBMAP-alike as long as the sorting resulted in the right
IO patterns.

Arguing that this use is significant enough to justify an addition to
the file system API is a stretch.  I'm just sharing the observation.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  12:16 -0700, Mike Waychison wrote:
 Chris Mason wrote:
 Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
 was a way to indicate your-data-is-here-but-isn't-alone.  But that's
 more of a feature for the FIEMAP stuff.

 I hadn't heard of FIEMAP, so I went back and read the thread from 
 April/May.  It seems that this is a much better approach than introducing a 
 FIBMAP64.

 What ever happened with this proposal?

We made a patch for ext4 that we need to update and push upstream.  I've
just resent the spec we used in a separate email (attached to old thread)
for reference.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
to linux-ext4 entitled [PATCH] FIEMAP ioctl.   I've asked Kalpak to post
an updated version of that patch along with the changes to the filefrag
tool to use FIEMAP.

 FIEMAP_1.0.txt ==

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be filefrag [options]
{filename ...} and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length;/* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count;  /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];  
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, fiemap);
{verify fiemap flags are understood }

for (i = 0; i  fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x0100 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
#define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of 

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-29 Thread Hugh Dickins
On Sun, 28 Oct 2007, Erez Zadok wrote:
 
 I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
 and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
 it under memory pressure: I have a couple of live CDs that use tmpfs and can
 deterministically reproduce the conditions resulting in A_W_A.  I also went
 back to using grab_cache_page but with the gfp_mask suggestions you made.
 
 I'm happy to report that it all works great now!

That's very encouraging...

 Below is the entirety of
 the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
 others can look it over and see if you find any problems.

... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...

 static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
 {
   int err = -EIO;
   struct inode *inode;
   struct inode *lower_inode;
   struct page *lower_page;
   char *kaddr, *lower_kaddr;
   struct address_space *mapping; /* lower inode mapping */
   gfp_t old_gfp_mask;
 
   inode = page-mapping-host;
   lower_inode = unionfs_lower_inode(inode);
   mapping = lower_inode-i_mapping;
 
   /*
* find lower page (returns a locked page)
*
* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under

On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.

* memory pressure conditions.  This is similar to how the loop
* driver behaves (see loop_set_fd in drivers/block/loop.c).
* If we can't find the lower page, we redirty our page and return
* success so that the VM will call us again in the (hopefully
* near) future.
*/
   old_gfp_mask = mapping_gfp_mask(mapping);
   mapping_set_gfp_mask(mapping, old_gfp_mask  ~(__GFP_IO|__GFP_FS));
 
   lower_page = grab_cache_page(mapping, page-index);
   mapping_set_gfp_mask(mapping, old_gfp_mask);

Hmm, several points on that.

When suggesting something like this, I did remark what locking needed?.
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
 * This is non-atomic.  Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage.  I think you're
probably okay to be doing it only here within -writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level -writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the -writepage (but rather depends on
filesystem).  And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc-for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc-for_reclaim case, I think you don't really
need to call the lower level writepage at all.  Just set_page_dirty
on the lower page, unlock it and return.  In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it.  Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page.  Look at the declaration of grab_cache_page
in linux/pagemap.h: it immediately extracts the gfp_mask from the

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Chris Mason
On Mon, 29 Oct 2007 12:18:22 -0700
Mike Waychison [EMAIL PROTECTED] wrote:

 Zach Brown wrote:
  And another of my pet peeves with -bmap is that it uses 0 to
  mean sparse which causes a conflict on NTFS at least as block
  zero is part of the $Boot system file so it is a real, valid
  block...  NTFS uses -1 to denote sparse blocks internally.
  Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if
  there was a way to indicate your-data-is-here-but-isn't-alone.
  But that's more of a feature for the FIEMAP stuff.
  
  And maybe we can step back and see what the callers of FIBMAP are
  doing with the results they're getting.
  
  One use is to discover the order in which to read file data that
  will result in efficient IO.
  
  If we had an interface specifically for this use case then perhaps a
  sparse block would be better reported as the position of the inode
  relative to other data blocks.  Maybe the inode block number in
  ext* land.
  
 
 Can you clarify what you mean above with an example?  I don't really
 follow.

This is a larger topic of helping userland optimize access to groups of
files.  For example, during a readdir if we knew the next step was to
delete all the files found, we could do one top of readahead (or even
ordering the returned values).

If we knew the next step would be to read all the files found, a
different type of readahead would be useful.

But, we shouldn't inflict all of this on fibmap/fiemapwe'll get
lost trying to make the one true interface for all operations.

For grouping operations on files, I think a read_tree syscall with
hints for what userland will do (read, stat, delete, list
filenames), and a better cookie than readdir should do it.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
Hi Andreas,

Thanks for posting this. I believe that an interface such as FIEMAP
would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

My comments below are generally geared towards understanding the ioctl
interface.

On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:

 2 Functional specification
 
 The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
 ioctl block device ioctl used for mapping an individual logical block
 address in a file to a physical block address in the block device. The
 FIEMAP ioctl will return the logical to physical mapping for the extent
 that contains the specified logical byte address.
 
 struct fiemap_extent {
 __u64 fe_offset;/* offset in bytes for the start of the extent */

I'm a little bit confused by fe_offset. Is it a physical offset, or a
logical offset? The reason I ask is that your description above says FIEMAP
ioctl will return the logical to physical mapping for the extent that
contains the specified logical byte address. Which seems to imply physical,
but your math to get to the next logical start in a very fragmented file,
implies that fe_offset is a logical offset:

   fm_start = fm_extents[fm_extent_count - 1].fe_offset +
 fm_extents[fm_extent_count - 1].fe_length + 1; 


 The logic for the filefrag would be similar to above. The size of the
 extent array will be extrapolated from the filesize and multiple ioctls
 of increasing extent count may be called for very large files. filefrag
 can easily call the FIEMAP ioctls repeatedly using the end of the last
 extent as the start offset for the next ioctl:
 
   fm_start = fm_extents[fm_extent_count - 1].fe_offset +
 fm_extents[fm_extent_count - 1].fe_length + 1;
 
 We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
 will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

I think you meant 'fm_length' instead of 'fm_end' there.


 The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
 given then the fm_extents array is not filled, and only fm_extent_count is
 returned with the total number of extents in the file. Any new flags that
 introduce and/or require an incompatible behaviour in an application or
 in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
 (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
 range if they were not part of the original specification). This is
 currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
 is not large enough then it is possible to use the last INCOMPAT flag
 0x0100 to incidate that more of the flag range contains incompatible
 flags.
 
 #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
 #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
 #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of extents */
 #define FIEMAP_FLAG_INCOMPAT0xff00 /* error for unknown flags in here 
 */
 
 The returned data from the FIEMAP ioctl is an array of fiemap_extent
 elements, one per extent in the file. The first extent will contain the
 byte specified by fm_start and the last extent will contain the byte
 specified by fm_start + fm_len, unless there are more than the passed-in
 fm_extent_count extents in the file, or this is beyond the EOF in which
 case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent
 returned has a set of flags associated with it that provide additional
 information about the extent. Not all filesystems will support all flags.
 
 FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by
 the file. It will be used by default for filefrag since the specific
 extent information is not required in many cases.
 
 #define FIEMAP_EXTENT_HOLE  0x0001 /* has no data or space allocation 
 */

Btw, I really like that holes are explicitely marked.


 #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* space allocated, but no data */
 #define FIEMAP_EXTENT_UNMAPPED  0x0004 /* has data but no space allocated 
 */
 #define FIEMAP_EXTENT_ERROR 0x0008 /* map error, errno in fe_offset. 
 */
 #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* cannot access data directly */
 #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
 #define FIEMAP_EXTENT_DELALLOC  0x0040 /* has data but not yet written */
 #define FIEMAP_EXTENT_SECONDARY 0x0080 /* data in secondary storage */
 #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF */

Is EOF here considering beyond i_size or beyond allocation?


 #define FIEMAP_EXTENT_UNKNOWN   0x0200 /* in use but location is unknown 
 */
 
 
 FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
 encrypted, compressed, etc.)

Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
Btrfs, Ocfs2, and Gfs2 pack small amounts of 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  I'm a little bit confused by fe_offset. Is it a physical offset, or a
  logical offset? The reason I ask is that your description above says FIEMAP
  ioctl will return the logical to physical mapping for the extent that
  contains the specified logical byte address. Which seems to imply physical,
  but your math to get to the next logical start in a very fragmented file,
  implies that fe_offset is a logical offset:
  
 fm_start = fm_extents[fm_extent_count - 1].fe_offset +
   fm_extents[fm_extent_count - 1].fe_length + 1; 
 
 Note the distinction between fe_offset (which is a physical offset for
 a single extent) and fm_offset (which is a logical offset for that file).

Actually, that is completely bunk.  What it should say is something like:
filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
On Mon, Oct 29, 2007 at 04:29:07PM -0600, Andreas Dilger wrote:
 On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
  On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   I'm a little bit confused by fe_offset. Is it a physical offset, or a
   logical offset? The reason I ask is that your description above says 
   FIEMAP
   ioctl will return the logical to physical mapping for the extent that
   contains the specified logical byte address. Which seems to imply 
   physical,
   but your math to get to the next logical start in a very fragmented file,
   implies that fe_offset is a logical offset:
   
  fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1; 
  
  Note the distinction between fe_offset (which is a physical offset for
  a single extent) and fm_offset (which is a logical offset for that file).
 
 Actually, that is completely bunk.  What it should say is something like:
 filefrag can easily call the FIEMAP ioctls repeatedly using the returned
 fm_start and fm_length as the start offset for the next ioctl:
 
 fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Yeah - that's where I was going with my question. This is much more clear
now, thanks.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread David Chinner
On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
 By request on #linuxfs, here is the FIEMAP spec that we used to implement
 the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
 to linux-ext4 entitled [PATCH] FIEMAP ioctl.

Link:

http://marc.info/?l=linux-ext4m=118838241209683w=2

That's a very ext4 specific ioctl interface. Can we get this made
generic like the FIBMAP interface so we don't have to replicate all
the copyin/copyout handling and interface definitions everywhere?
i.e. a -extent_map aops callout to the filesystem in generic code
just like -bmap?

 I've asked Kalpak to post
 an updated version of that patch along with the changes to the filefrag
 tool to use FIEMAP.

Where can I find the test program that validates the implementation?
Also, following the fallocate model, can we get the interface definition
turned into a man page before anything is submitted upstream?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   Thanks for posting this. I believe that an interface such as FIEMAP
 would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

 On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
  The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
  ioctl block device ioctl used for mapping an individual logical block
  address in a file to a physical block address in the block device. The
  FIEMAP ioctl will return the logical to physical mapping for the extent
  that contains the specified logical byte address.
  
  struct fiemap_extent {
  __u64 fe_offset;/* offset in bytes for the start of the extent */
 
 I'm a little bit confused by fe_offset. Is it a physical offset, or a
 logical offset? The reason I ask is that your description above says FIEMAP
 ioctl will return the logical to physical mapping for the extent that
 contains the specified logical byte address. Which seems to imply physical,
 but your math to get to the next logical start in a very fragmented file,
 implies that fe_offset is a logical offset:
 
fm_start = fm_extents[fm_extent_count - 1].fe_offset +
  fm_extents[fm_extent_count - 1].fe_length + 1; 

Note the distinction between fe_offset (which is a physical offset for
a single extent) and fm_offset (which is a logical offset for that file).

  We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
  will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
 
 I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

  #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
  #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF*/
 
 Is EOF here considering beyond i_size or beyond allocation?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
EOF flag may be set on one or more earlier extents.

  FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
  encrypted, compressed, etc.)
 
 Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
 Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
 blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given?  One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought.  The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  17:11 -0700, Mark Fasheh wrote:
 On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
   Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
   blocks.
  
  Hmm, but part of the issue would be how to request the extra data, and
  what offset it would be given?  One could, for example, use negative
  offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
  or similar, I hadn't given that much thought.
 
 Well, fe_offset and fe_length are already expressed in bytes, so we could
 just put the byte offset to where the inline data starts in there. fe_length
 is just used as the length allocated for inline-data.
 
 If fe_offset is required to be block aligned, then we could add a field to
 express an offset within the block where data would be found - say
 'fe_data_start_offset'. In the non-inline case, we could guarantee that
 fe_data_start_offset is zero. That way software which doesn't want to care
 whether something is inline-data (for example, a backup program) or not
 could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking.  Mapping in-inode data is
just fine using the existing interface.  The byte offset of the data is
given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Mark Fasheh
On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  Thanks for posting this. I believe that an interface such as FIEMAP
  would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)
 
 I tried to make it as Lustre-agnostic as possible...

IMHO, your description succeeded at that. I'm hoping that the final patch
can have mostly generic code, like FIBMAP does today.


   #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
   #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond 
   EOF*/
  
  Is EOF here considering beyond i_size or beyond allocation?
 
 _EOF == beyond i_size.
 _LAST == last extent in the file.
 
 In most cases FIEMAP_EXTENT_EOF will be set at the same time as
 FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
 EOF flag may be set on one or more earlier extents.

Oh, ok great - I was primarily looking for a way to say there's allocation
past i_size and it looks like we have it.


   FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
   encrypted, compressed, etc.)
  
  Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
  Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
  blocks.
 
 Hmm, but part of the issue would be how to request the extra data, and
 what offset it would be given?  One could, for example, use negative
 offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
 or similar, I hadn't given that much thought.

Well, fe_offset and fe_length are already expressed in bytes, so we could
just put the byte offset to where the inline data starts in there. fe_length
is just used as the length allocated for inline-data.

If fe_offset is required to be block aligned, then we could add a field to
express an offset within the block where data would be found - say
'fe_data_start_offset'. In the non-inline case, we could guarantee that
fe_data_start_offset is zero. That way software which doesn't want to care
whether something is inline-data (for example, a backup program) or not
could just blidly add it to fe_offset before looking at the data.

Regardless, I think we also want to explicitely flag this:

#define FIEMAP_EXTENT_DATA_IN_INODE 0x0400 /* extent data is stored in 
inode block */


I'm going to pretend that I completely understand reiserfs tail-packing and
say that my approaches above looks like they could work for that case too.
We'd want to add a seperate flag for tail packed data though.


 The other issue is that I'd like to get the basics of the API in place
 before it gets too complex. We can always add functionality with more
 FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is
 being done).

Sure, but I think whatever goes upstream should be able to handle this case
- there's file systems in use _today_ which put data in inode blocks and
pack file tails.

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
G'day,

A number of people have already seen this; I'm posting for wider
comment and to move some interesting discussion to a public list.

I'll apologise in advance for the talk about SGI technologies (including
proprietary ones), but all the problems mentioned apply to in-tree
technologies too.



This proposal seeks to solve three problems in our NAS server product
due to the interaction of the filesystem (XFS) and the block-based
snapshot feature (XVM snapshot).  It's based on discussions held with
various people over the last few weeks, including Roger Strassburg,
Christoph Hellwig, David Chinner, and Donald Douwsma.

a)  The first problem is the server's behaviour when a filesystem
which is subject to snapshot is written to, and the snapshot
repository runs out of room.

The failure mode can be quite severe.  XFS issues a metadata write
to the block device, triggering a Copy-On-Write operation in the
XVM snapshot element, which because of the full repository fails
with EIO.  When XFS sees the failure it shuts down the filesystem.
All subsequent attempts to perform IO to the filesystem block
indefinitely.  In particular any NFS server thread will block
and never reply to the NFS client.  The NFS client will retry,
causing another NFS server thread to block, and repeat until every
NFS server thread is blocked.  At this point all NFS service for
all filesystems ceases.

See PV 958220 and PV 958140 for a description of this problem and
some of the approaches which have been discussed for resolving it.


b)  The second problem is that certain common combinations of
filesystem operations can cause large wastes of space in the XVM
snapshot repository.

Examples include writing the same file twice with dd, or writing
a new file and deleting it.  The cause is the inability of the
XVM snapshot code to be able to free regions in the snapshot
repository that are no longer in use by the filesystem; this
information is simply not available within the block layer.

Note that problem b) also contributes to problem a) by increasing
repository usage and thus making it easier to encounter an
out-of-space condition on the repository.

c)  The third problem is an unfortunate interaction between an XFS
internal log and block snapshots.

The log is a fixed region of the block device which is written as
a side effect of a great many different filesystem operations.
The information written there has no value and is not even
read until and unless log recovery needs to be performed after
the server has crashed.  This means the log does not need to be
preserved by the block feature snapshot (because at the point in
time when the snapshot is taken, log recovery must have already
happened).  In fact the correct procedure when mounting a read-only
snapshot is to use the norecovery option to prevent any attempt
to read the log (although the NAS server software actually doesn't
do this).

However, because the block device layer doesn't have enough
information to know any better, the first pass of writes to the log
are subjected to Copy-On-Write.  This has two undesirable effects.
Firstly, it increases the amount of snapshot repository space
used by each snapshot, thus contributing to problem a).  Secondly,
it puts a significant performance penalty on filesystem metadata
operations for some time after each snapshot is taken; given
that the NAS server can be configured to take regular frequent
snapshots this may mean all of the time.

An obvious solution is to use an external XFS log, but this quite
inconvenient for the NAS server software to arrange.  For one
thing, we would need to construct a separate external log device
for the main filesystem and one for each mounted snapshot.

Note that these problems are not specific to XVM but will be
encountered by any Linux block-COWing snapshot implementation.
For example the DM snapshot implementation is documented to suffer from
problem a).  From the linux/Documentation/device-mapper/snapshot.txt:

 COW device will often be smaller than the origin and if it
 fills up the snapshot will become useless and be disabled,
 returning errors.  So it is important to monitor the amount of
 free space and expand the COW device before it fills up.

During discussions, it became clear that we could solve all three
of these problems by improving the block device interface to allow a
filesystem to provide the block device with dynamic block usage hints.

For example, when unlinking a file the filesystem could tell the
block device a hint of the form I'm about to stop using these
blocks.  Most block devices would silently ignore these hints, but
a snapshot COW implementation (the copy-on-write XVM element or
the snapshot-origin dm target) could use them to help avoid these
problems.  For example, the response 

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote:
 On Monday 29 October 2007, Christoph Hellwig wrote:
  - Forwarded message from Greg Banks [EMAIL PROTECTED] -
  
  Date: Thu, 27 Sep 2007 16:31:13 +1000
  From: Greg Banks [EMAIL PROTECTED]
  Subject: Proposal to improve filesystem/block snapshot interaction
  To: David Chinner [EMAIL PROTECTED], Donald Douwsma [EMAIL PROTECTED],
  Christoph Hellwig [EMAIL PROTECTED], Roger Strassburg [EMAIL 
  PROTECTED]
  Cc: Mark Goodwin [EMAIL PROTECTED],
  Brett Jon Grandbois [EMAIL PROTECTED]
  
  
  
  This proposal seeks to solve three problems in our NAS server product
  due to the interaction of the filesystem (XFS) and the block-based
  snapshot feature (XVM snapshot).  It's based on discussions held with
  various people over the last few weeks, including Roger Strassburg,
  Christoph Hellwig, David Chinner, and Donald Douwsma.
 
 Hi Greg,
 
 Christoph forwarded me your mail, because I mentioned to him that
 I'm trying to come up with a similar change, and it might make sense
 to combine our efforts.

Excellent, thanks Christoph ;-)


 
  For example, when unlinking a file the filesystem could tell the
  block device a hint of the form I'm about to stop using these
  blocks.  Most block devices would silently ignore these hints, but
  a snapshot COW implementation (the copy-on-write XVM element or
  the snapshot-origin dm target) could use them to help avoid these
  problems.  For example, the response to the I'm about to stop using
  these blocks hint could be to free the space used in the snapshot
  repository for unnecessary copies of those blocks.
 
 The case I'm interested in is the more specific case of 'erase',
 which is more of a performance optimization than a space optimization.
 When you have a flash medium, it's useful to erase a block as soon
 as it's becoming unused, so that a subsequent write will be faster.
 Moreover, on an MTD medium, you may not even be able to write to
 a block unless it has been erased before.

Spending the device's time to erase early, when the CPU isn't waiting
for it, instead of later, when it adds to effective write latency.
Makes sense.

  Of course snapshot cow elements may be part of more generic element
  trees.  In general there may be more than one consumer of block usage
  hints in a given filesystem's element tree, and their locations in that
  tree are not predictable.  This means the block extents mentioned in
  the usage hints need to be subject to the block mapping algorithms
  provided by the element tree.  As those algorithms are currently
  implemented using bio mapping and splitting, the easiest and simplest
  way to reuse those algorithms is to add new bio flags.
  
  First we need a mechanism to indicate that a bio is a hint rather
  than a real IO.  Perhaps the easiest way is to add a new flag to
  the bi_rw field:
  
  #define BIO_RW_HINT 5   /* bio is a hint not a real io; no 
  pages */
 
 My first thought was to do this on the request layer, not already
 on bio, but they can easily be combined, I guess.

My first thoughts were along similar lines, but I wasn't expecting
these hint bios to survive deep enough in the stack to need queuing
and thus visibility in struct request; I was expecting their lifetime
to be some passage and splitting through a volume manager and then
conversion to synchronous metadata operations.  Plus, hijacking bios
means not having to modify every single DM target to duplicate it's
block mapping algorithm.

Basically, I was thinking of loopback-like block mapping and not
considering flash.  I suppose for flash where there's a real erase
operation, you'd want to be queuing and that means a new request type.

 
  We'll also need a field to tell us which kind of hint the bio
  represents.  Perhaps a new field could be added, or perhaps the top
  16 bits of bi_rw (currently used to encode the bio's priority, which
  has no meaning for hint bios) could be reused.  The latter approach
  may allow hints to be used without modifying the bio structure or
  any code that uses it other than the filesystem and the snapshot
  implementation.  Such a property would have obvious advantages for
  our NAS server software, where XFS and XVM modules are provided but
  the other users of struct bio are stock SLES code.
  
  
  Next we'll need three bio hints types with the following semantics.
  
  BIO_HINT_ALLOCATE
  The bio's block extent will soon be written by the filesystem
  and any COW that may be necessary to achieve that should begin
  now.  If the COW is going to fail, the bio should fail.  Note
  that this provides a way for the filesystem to manage when and
  how failures to COW are reported.
  
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
   

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Neil Brown
On Tuesday October 30, [EMAIL PROTECTED] wrote:
 
 Of course snapshot cow elements may be part of more generic element
 trees.  In general there may be more than one consumer of block usage
 hints in a given filesystem's element tree, and their locations in that
 tree are not predictable.  This means the block extents mentioned in
 the usage hints need to be subject to the block mapping algorithms
 provided by the element tree.  As those algorithms are currently
 implemented using bio mapping and splitting, the easiest and simplest
 way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say We've got this really cool thing to make
snapshots much faster - wanna share?  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

 
 First we need a mechanism to indicate that a bio is a hint rather
 than a real IO.  Perhaps the easiest way is to add a new flag to
 the bi_rw field:
 
 #define BIO_RW_HINT   5   /* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

 
 Next we'll need three bio hints types with the following semantics.
 
 BIO_HINT_ALLOCATE
 The bio's block extent will soon be written by the filesystem
 and any COW that may be necessary to achieve that should begin
 now.  If the COW is going to fail, the bio should fail.  Note
 that this provides a way for the filesystem to manage when and
 how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. Please allocate 128 blocks, maybe 'here' 
 OK, 128 blocks allocated, but they are actually over 'there'.

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
Allocate 128 blocks, I don't care where.

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

 
 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

 
 BIO_HINT_DONTCOW
 (the Bart Simpson BIO).  The bio's block extent is not needed
 in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the correct thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is Give me a
subordinate device which is not snapshotted, size X, then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

 
 Comments?

On the whole it seems reasonably sane  providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
 On Tuesday October 30, [EMAIL PROTECTED] wrote:
  
  Of course snapshot cow elements may be part of more generic element
  trees.  In general there may be more than one consumer of block usage
  hints in a given filesystem's element tree, and their locations in that
  tree are not predictable.  This means the block extents mentioned in
  the usage hints need to be subject to the block mapping algorithms
  provided by the element tree.  As those algorithms are currently
  implemented using bio mapping and splitting, the easiest and simplest
  way to reuse those algorithms is to add new bio flags.
 
 So are you imagining that you might have a distinct snapshotable
 elements, and that some of these might be combined by e.g. RAID0 into
 a larger device, then a filesystem is created on that?

I was thinking more a concatenation than a stripe, but yes you could
do such a thing, e.g. to parallelise the COW procedure.  We don't do
any such thing in our product; the COW element is always inserted at
the top of the logical element tree.

 I ask because my first thought was that the sort of communication you
 want seems like it would be just between a filesystem and the block
 device that it talks directly to, and as you are particularly
 interested in XFS and XVM, should could come up with whatever protocol
 you want for those two to talk to either other, prototype it, iron out
 all the issues, then say We've got this really cool thing to make
 snapshots much faster - wanna share?  and thus be presenting from a
 position of more strength (the old 'code talks' mantra).

Indeed, code talks ;-)  I was hoping someone else would do that
talking for me, though.

  First we need a mechanism to indicate that a bio is a hint rather
  than a real IO.  Perhaps the easiest way is to add a new flag to
  the bi_rw field:
  
  #define BIO_RW_HINT 5   /* bio is a hint not a real io; no 
  pages */
 
 Reminds me of the new approach to issue_flush_fn which is just to have
 a zero-length barrier bio (is that implemented yet? I lost track).
 But different as a zero length barrier has zero length, and your hints
 have a very meaningful length.

Yes.

  
  Next we'll need three bio hints types with the following semantics.
  
  BIO_HINT_ALLOCATE
  The bio's block extent will soon be written by the filesystem
  and any COW that may be necessary to achieve that should begin
  now.  If the COW is going to fail, the bio should fail.  Note
  that this provides a way for the filesystem to manage when and
  how failures to COW are reported.
 
 Would it make sense to allow the bi_sector to be changed by the device
 and to have that change honoured.
 i.e. Please allocate 128 blocks, maybe 'here' 
  OK, 128 blocks allocated, but they are actually over 'there'.

That wasn't the expectation at all.  Perhaps allocate is a poor
name.   I have just allocated, deal with it might be more appropriate.
Perhaps BIO_HINT_WILLUSE or something.

 If the device is tracking what space is and isn't used, it might make
 life easier for it to do the allocation.  Maybe even have a variant
 Allocate 128 blocks, I don't care where.

That kind of thing might perhaps be useful for flash, but I think
current filesystems would have conniptions.

 Is this bio supposed to block until the copy has happened?  Or only
 until the space of the copy has been allocated and possibly committed?

The latter.  The writes following will block until the COW has
completed, or might be performed sufficiently later that the COW
has meanwhile completed (I think this implies an extra state in the
snapshot metadata to avoid double-COWing).  The point of the hint is
to allow the snapshot code to test for running out of repo space and
report that failure at a time when the filesystem is able to handle
it gracefully.

 Or must it return without doing any IO at all?

I would expect it would be a useful optimisation to start the IO but
not wait for it's completion, but that the first implementation would
just do a space check.

  
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 If the allocation unit of the storage device (e.g. a few MB) does not
 match the allocation unit of the filesystem (e.g. a few KB) then for
 this to be useful either the storage device must start recording tiny
 allocations, or the filesystem should re-release areas as they grow.
 i.e. when releasing a range of a device, look in the filesystem's usage
 records for the largest surrounding free space, and release all of that.

Good point.  I was planning on ignoring this problem :-/ Given that
current snapshot implementations waste *all* the blocks in deleted
files, it would be an improvement to scavenge the blocks in large
extents.