On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
> On Tuesday October 30, [EMAIL PROTECTED] wrote:
> > 
> > Of course snapshot cow elements may be part of more generic element
> > trees.  In general there may be more than one consumer of block usage
> > hints in a given filesystem's element tree, and their locations in that
> > tree are not predictable.  This means the block extents mentioned in
> > the usage hints need to be subject to the block mapping algorithms
> > provided by the element tree.  As those algorithms are currently
> > implemented using bio mapping and splitting, the easiest and simplest
> > way to reuse those algorithms is to add new bio flags.
> 
> So are you imagining that you might have a distinct snapshotable
> elements, and that some of these might be combined by e.g. RAID0 into
> a larger device, then a filesystem is created on that?

I was thinking more a concatenation than a stripe, but yes you could
do such a thing, e.g. to parallelise the COW procedure.  We don't do
any such thing in our product; the COW element is always inserted at
the top of the logical element tree.

> I ask because my first thought was that the sort of communication you
> want seems like it would be just between a filesystem and the block
> device that it talks directly to, and as you are particularly
> interested in XFS and XVM, should could come up with whatever protocol
> you want for those two to talk to either other, prototype it, iron out
> all the issues, then say "We've got this really cool thing to make
> snapshots much faster - wanna share?"  and thus be presenting from a
> position of more strength (the old 'code talks' mantra).

Indeed, code talks ;-)  I was hoping someone else would do that
talking for me, though.

> > First we need a mechanism to indicate that a bio is a hint rather
> > than a real IO.  Perhaps the easiest way is to add a new flag to
> > the bi_rw field:
> > 
> > #define BIO_RW_HINT         5       /* bio is a hint not a real io; no 
> > pages */
> 
> Reminds me of the new approach to issue_flush_fn which is just to have
> a zero-length barrier bio (is that implemented yet? I lost track).
> But different as a zero length barrier has zero length, and your hints
> have a very meaningful length.

Yes.

> > 
> > Next we'll need three bio hints types with the following semantics.
> > 
> > BIO_HINT_ALLOCATE
> >     The bio's block extent will soon be written by the filesystem
> >     and any COW that may be necessary to achieve that should begin
> >     now.  If the COW is going to fail, the bio should fail.  Note
> >     that this provides a way for the filesystem to manage when and
> >     how failures to COW are reported.
> 
> Would it make sense to allow the bi_sector to be changed by the device
> and to have that change honoured.
> i.e. "Please allocate 128 blocks, maybe 'here'" 
>      "OK, 128 blocks allocated, but they are actually over 'there'".

That wasn't the expectation at all.  Perhaps "allocate" is a poor
name.   "I have just allocated, deal with it" might be more appropriate.
Perhaps BIO_HINT_WILLUSE or something.

> If the device is tracking what space is and isn't used, it might make
> life easier for it to do the allocation.  Maybe even have a variant
> "Allocate 128 blocks, I don't care where".

That kind of thing might perhaps be useful for flash, but I think
current filesystems would have conniptions.

> Is this bio supposed to block until the copy has happened?  Or only
> until the space of the copy has been allocated and possibly committed?

The latter.  The writes following will block until the COW has
completed, or might be performed sufficiently later that the COW
has meanwhile completed (I think this implies an extra state in the
snapshot metadata to avoid double-COWing).  The point of the hint is
to allow the snapshot code to test for running out of repo space and
report that failure at a time when the filesystem is able to handle
it gracefully.

> Or must it return without doing any IO at all?

I would expect it would be a useful optimisation to start the IO but
not wait for it's completion, but that the first implementation would
just do a space check.

> > 
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> 
> If the allocation unit of the storage device (e.g. a few MB) does not
> match the allocation unit of the filesystem (e.g. a few KB) then for
> this to be useful either the storage device must start recording tiny
> allocations, or the filesystem should re-release areas as they grow.
> i.e. when releasing a range of a device, look in the filesystem's usage
> records for the largest surrounding free space, and release all of that.

Good point.  I was planning on ignoring this problem :-/ Given that
current snapshot implementations waste *all* the blocks in deleted
files, it would be an improvement to scavenge the blocks in large
extents.  This is especially true for XFS which goes to some effort
to achieve large linear extents.

> Would this be a burden on the filesystems?

I think so.  I would hope the hints could be done in a way which
minimises the impact on filesystems, so that it would be easier to roll
out.  That implies pushing the responsibility for being smart about
combining partial deallocations down to the block device/snapshot code.
Any comments, Roger?

> Is my imagined disparity between block sizes valid?

Yep, at least for XFS and XVM.  If the space was used in lots of
little files, this rounding would probably eat a lot of the savings.

> Would it be just as easy for the storage device to track small
> allocation/deallocations?
> 
> > 
> > BIO_HINT_DONTCOW
> >     (the Bart Simpson BIO).  The bio's block extent is not needed
> >     in mounted snapshots and does not need to be subjected to COW.
> 
> This seems like a much more domain-specific function that the other
> two which themselves could be more generally useful

Agreed, I can't offhand think of a use other than internal logs.

> (I'm imagining
> using hints from them to e.g. accelerate RAID reconstruction).

Ah, interesting idea: delete a file to speed up RAID recovery ;-)

> Surely the "correct" thing to do with the log is to put it on a separate
> device which itself isn't snapshotted.

Indeed.

> If you have a storage manager that is smart enough to handle these
> sorts of things, maybe the functionality you want is "Give me a
> subordinate device which is not snapshotted, size X", then journal to
> that virtual device.

This is usually better, but is not always convenient for a number of
reasons.  For example, you might not have enough disks to build all
of a base, a snapshot repo, and a log device.  Also, the log really
needs to be safe, so you want it mirrored or RAID5, and you want it
fast, and you want it on separate spindles, so it needs several disks;
but now you're using terabytes of disk space for 128 MiB of log.

> I guess that is equally domain specific, but the difference is that if
> you try to read from the DONTCOW part of the snapshot, you get bad
> old data, where as if you try to access the subordinate device of a
> snapshot, you get an IO error - which is probably safer.

I believe (Dave or Roger will correct me here) that XFS needs a log
when you mount, and you get to either provide an external one or use
the internal one.  So when you mount a snapshot of an XFS filesystem
which was built with an external log, you need to provide a new
external log device.  So the storage manager needs to allocate an
external log device for each snapshot it allows.

>
> > 
> > Comments?
> 
> On the whole it seems reasonably sane .... providing you are from the
> school which believes that volume managers and filesystems should be
> kept separate :-)

Yeah, I'm so old-school :-)

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to