Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-10-06 Thread Sarthak Kukreti
On Fri, Jun 9, 2023 at 1:31 PM Mike Snitzer  wrote:
>
> On Wed, Jun 07 2023 at  7:27P -0400,
> Mike Snitzer  wrote:
>
> > On Mon, Jun 05 2023 at  5:14P -0400,
> > Sarthak Kukreti  wrote:
> >
> > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
> > > >
> > > > We all just need to focus on your proposal and Joe's dm-thin
> > > > reservation design...
> > > >
> > > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > > dm-thinp support before Joe's design is implemented.  Otherwise we'll
> > > > have 2 different responses to REQ_OP_PROVISION.  The one that is
> > > > captured in your patchset isn't adequate to properly handle ensuring
> > > > upper layer (like XFS) can depend on the space being available across
> > > > snapshot boundaries.]
> > > >
> > > Ack. Would it be premature for the rest of the series to go through
> > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > > targets)? I'd like to start using this as a reference to suggest
> > > additions to the virtio-spec for virtio-blk support and start looking
> > > at what an ext4 implementation would look like.
> >
> > Please drop the dm-thin.c and dm-snap.c changes.  dm-snap.c would need
> > more work to provide the type of guarantee XFS requires across
> > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> > because it is best to just use dm-thin.
> >
> > And FYI even your dm-thin patch will be the starting point for the
> > dm-thin support (we'll keep attribution to you for all the code in a
> > separate patch).
> >
> > > Fair points, I certainly don't want to derail this conversation; I'd
> > > be happy to see this work merged sooner rather than later.
> >
> > Once those dm target changes are dropped I think the rest of the
> > series is fine to go upstream now.  Feel free to post a v8.
>
> FYI, I've made my latest code available in this
> 'dm-6.5-provision-support' branch (based on 'dm-6.5'):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support
>
> It's what v8 should be plus the 2 dm-thin patches (that I don't think
> should go upstream yet, but are theoretically useful for Dave and
> Joe).
>
Cheers! Apologies for dropping the ball on this, I just sent out v8
with the dm-thin patches dropped.


- Sarthak

> The "dm thin: complete interface for REQ_OP_PROVISION support" commit
> establishes all the dm-thin interface I think is needed.  The FIXME in
> process_provision_bio() (and the patch header) cautions against upper
> layers like XFS using this dm-thinp support quite yet.
>
> Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
> support initially doesn't provide the guarantee that XFS needs across
> snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).
>
> Mike
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-09 Thread Dave Chinner
On Fri, Jun 09, 2023 at 04:31:41PM -0400, Mike Snitzer wrote:
> On Wed, Jun 07 2023 at  7:27P -0400,
> Mike Snitzer  wrote:
> 
> > On Mon, Jun 05 2023 at  5:14P -0400,
> > Sarthak Kukreti  wrote:
> > 
> > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
> > > >
> > > > We all just need to focus on your proposal and Joe's dm-thin
> > > > reservation design...
> > > >
> > > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > > dm-thinp support before Joe's design is implemented.  Otherwise we'll
> > > > have 2 different responses to REQ_OP_PROVISION.  The one that is
> > > > captured in your patchset isn't adequate to properly handle ensuring
> > > > upper layer (like XFS) can depend on the space being available across
> > > > snapshot boundaries.]
> > > >
> > > Ack. Would it be premature for the rest of the series to go through
> > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > > targets)? I'd like to start using this as a reference to suggest
> > > additions to the virtio-spec for virtio-blk support and start looking
> > > at what an ext4 implementation would look like.
> > 
> > Please drop the dm-thin.c and dm-snap.c changes.  dm-snap.c would need
> > more work to provide the type of guarantee XFS requires across
> > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> > because it is best to just use dm-thin.
> > 
> > And FYI even your dm-thin patch will be the starting point for the
> > dm-thin support (we'll keep attribution to you for all the code in a
> > separate patch).
> > 
> > > Fair points, I certainly don't want to derail this conversation; I'd
> > > be happy to see this work merged sooner rather than later.
> > 
> > Once those dm target changes are dropped I think the rest of the
> > series is fine to go upstream now.  Feel free to post a v8.
> 
> FYI, I've made my latest code available in this
> 'dm-6.5-provision-support' branch (based on 'dm-6.5'):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support
> 
> It's what v8 should be plus the 2 dm-thin patches (that I don't think
> should go upstream yet, but are theoretically useful for Dave and
> Joe).
> 
> The "dm thin: complete interface for REQ_OP_PROVISION support" commit
> establishes all the dm-thin interface I think is needed.  The FIXME in
> process_provision_bio() (and the patch header) cautions against upper
> layers like XFS using this dm-thinp support quite yet.
> 
> Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
> support initially doesn't provide the guarantee that XFS needs across
> snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).

Just tag it with EXPERIMENTAL on recpetion of the first
REQ_OP_PROVISION for the device (i.e. dump a log warning), like I'll
end up doing with XFS when it detects provisioning support at mount
time.

We do this all the time to allow merging new features before they
are fully production ready - EXPERIMENTAL means you can expect it to
mostly work, except when it doesn't, and you know that when it
breaks you get to report the bug, help triage it and as a bonus you
get to keep the broken bits!

$ git grep EXPERIMENTAL fs/xfs
fs/xfs/Kconfig:   This feature is considered EXPERIMENTAL.  Use with caution!
fs/xfs/Kconfig:   This feature is considered EXPERIMENTAL.  Use with caution!
fs/xfs/scrub/scrub.c: "EXPERIMENTAL online scrub feature in use. Use at your 
own risk!");
fs/xfs/xfs_fsops.c: "EXPERIMENTAL online shrink feature in use. Use at your 
own risk!");
fs/xfs/xfs_super.c: xfs_warn(mp, "DAX enabled. Warning: EXPERIMENTAL, use 
at your own risk");
fs/xfs/xfs_super.c: "EXPERIMENTAL Large extent counts feature in use. Use 
at your own risk!");
fs/xfs/xfs_xattr.c: "EXPERIMENTAL logged extended attributes feature in use. 
Use at your own risk!");
$

IOWs, I'll be adding a:

"EXPERIMENTAL block device provisioning in use. Use at your own risk!"

warning to XFS, and it won't get removed until both the XFS and
dm-thinp support is solid and production ready

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-09 Thread Mike Snitzer
On Wed, Jun 07 2023 at  7:27P -0400,
Mike Snitzer  wrote:

> On Mon, Jun 05 2023 at  5:14P -0400,
> Sarthak Kukreti  wrote:
> 
> > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
> > >
> > > We all just need to focus on your proposal and Joe's dm-thin
> > > reservation design...
> > >
> > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > dm-thinp support before Joe's design is implemented.  Otherwise we'll
> > > have 2 different responses to REQ_OP_PROVISION.  The one that is
> > > captured in your patchset isn't adequate to properly handle ensuring
> > > upper layer (like XFS) can depend on the space being available across
> > > snapshot boundaries.]
> > >
> > Ack. Would it be premature for the rest of the series to go through
> > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > targets)? I'd like to start using this as a reference to suggest
> > additions to the virtio-spec for virtio-blk support and start looking
> > at what an ext4 implementation would look like.
> 
> Please drop the dm-thin.c and dm-snap.c changes.  dm-snap.c would need
> more work to provide the type of guarantee XFS requires across
> snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> because it is best to just use dm-thin.
> 
> And FYI even your dm-thin patch will be the starting point for the
> dm-thin support (we'll keep attribution to you for all the code in a
> separate patch).
> 
> > Fair points, I certainly don't want to derail this conversation; I'd
> > be happy to see this work merged sooner rather than later.
> 
> Once those dm target changes are dropped I think the rest of the
> series is fine to go upstream now.  Feel free to post a v8.

FYI, I've made my latest code available in this
'dm-6.5-provision-support' branch (based on 'dm-6.5'):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support

It's what v8 should be plus the 2 dm-thin patches (that I don't think
should go upstream yet, but are theoretically useful for Dave and
Joe).

The "dm thin: complete interface for REQ_OP_PROVISION support" commit
establishes all the dm-thin interface I think is needed.  The FIXME in
process_provision_bio() (and the patch header) cautions against upper
layers like XFS using this dm-thinp support quite yet.

Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
support initially doesn't provide the guarantee that XFS needs across
snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-08 Thread Dave Chinner
On Wed, Jun 07, 2023 at 07:50:25PM -0400, Mike Snitzer wrote:
> Do you think you're OK to scope out, and/or implement, the XFS changes
> if you use v7 of this patchset as the starting point? (v8 should just
> be v7 minus the dm-thin.c and dm-snap.c changes).  The thinp
> support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION
> and/or fallocate (via mkfs.xfs) to dm-thin devices.

Yup, XFS only needs blkdev_issue_provision() and
bdev_max_provision_sectors() to be present.  filesystem code. The
initial XFS provisioning detection and fallocate() support is just
under 50 lines of new code...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-08 Thread Dave Chinner
On Wed, Jun 07, 2023 at 10:03:40PM -0400, Martin K. Petersen wrote:
> 
> Dave,
> 
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's
> > what I intended - the operation does not contain data at all. It's an
> > operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a
> > range of sectors that need to be provisioned (or discarded), and
> > nothing else.
> 
> Yep. That's also how SCSI defines it. The act of provisioning a block
> range is done through an UNMAP command using a special flag. All it does
> is pin down those LBAs so future writes to them won't result in ENOSPC.

*nod*

That I knew, and it's one of the reasons I'd like the filesystem <->
block layer provisioning model to head in this direction. i.e. we
don't have to do anything special to enable routing of provisioning
requests to hardware and/or remote block storage devices (e.g.
ceph-rbd, nbd, etc). Hence "external" devices can provide the same
guarantees as a native software-only block device implementations
like dm-thinp can provide and everything gets just that little bit
better behaved...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-07 Thread Martin K. Petersen


Dave,

> Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's
> what I intended - the operation does not contain data at all. It's an
> operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a
> range of sectors that need to be provisioned (or discarded), and
> nothing else.

Yep. That's also how SCSI defines it. The act of provisioning a block
range is done through an UNMAP command using a special flag. All it does
is pin down those LBAs so future writes to them won't result in ENOSPC.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-07 Thread Mike Snitzer
On Tue, Jun 06 2023 at 10:01P -0400,
Dave Chinner  wrote:

> On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> > On Fri, Jun 02 2023 at  8:52P -0400,
> > Dave Chinner  wrote:
> > 
> > > Mike, I think you might have misunderstood what I have been proposing.
> > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > > that's what I intended - the operation does not contain data at all.
> > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > > contains a range of sectors that need to be provisioned (or
> > > discarded), and nothing else.
> > 
> > No, I understood that.
> > 
> > > The write IOs themselves are not tagged with anything special at all.
> > 
> > I know, but I've been looking at how to also handle the delalloc
> > usecase (and yes I know you feel it doesn't need handling, the issue
> > is XFS does deal nicely with ensuring it has space when it tracks its
> > allocations on "thick" storage
> 
> Oh, no it doesn't. It -works for most cases-, but that does not mean
> it provides any guarantees at all. We can still get ENOSPC for user
> data when delayed allocation reservations "run out".
> 
> This may be news to you, but the ephemeral XFS delayed allocation
> space reservation is not accurate. It contains a "fudge factor"
> called "indirect length". This is a "wet finger in the wind"
> estimation of how much new metadata will need to be allocated to
> index the physical allocations when they are made. It assumes large
> data extents are allocated, which is good enough for most cases, but
> it is no guarantee when there are no large data extents available to
> allocate (e.g. near ENOSPC!).
> 
> And therein lies the fundamental problem with ephemeral range
> reservations: at the time of reservation, we don't know how many
> individual physical LBA ranges the reserved data range is actually
> going to span.
> 
> As a result, XFS delalloc reservations are a "close-but-not-quite"
> reservation backed by a global reserve pool that can be dipped into
> if we run out of delalloc reservation. If the reserve pool is then
> fully depleted before all delalloc conversion completes, we'll still
> give ENOSPC. The pool is sized such that the vast majority of
> workloads will complete delalloc conversion successfully before the
> pool is depleted.
> 
> Hence XFS gives everyone the -appearance- that it deals nicely with
> ENOSPC conditions, but it never provides a -guarantee- that any
> accepted write will always succeed without ENOSPC.
> 
> IMO, using this "close-but-not-quite" reservation as the basis of
> space requirements for other layers to provide "won't ENOSPC"
> guarantees is fraught with problems. We already know that it is
> insufficient in important corner cases at the filesystem level, and
> we also know that lower layers trying to do ephemeral space
> reservations will have exactly the same problems providing a
> guarantee. And these are problems we've been unable to engineer
> around in the past, so the likelihood we can engineer around them
> now or in the future is also very unlikely.

Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in
the air" ;)

So do you think it reasonable to require applications to fallocate
their data files? Unaware if users are aware to take that extra step.

> > -- so adding coordination between XFS
> > and dm-thin layers provides comparable safety.. that safety is an
> > expected norm).
> >
> > But rather than discuss in terms of data vs metadata, the distinction
> > is:
> > 1) LBA range reservation (normal case, your proposal)
> > 2) non-LBA reservation (absolute value, LBA range is known at later stage)
> > 
> > But I'm clearly going off script for dwelling on wanting to handle
> > both.
> 
> Right, because if we do 1) then we don't need 2). :)

Sure.

> > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> > a crude simplification for branching between the 2 approaches.
> > 
> > And I understand I made you nervous by expanding the scope to a much
> > more muddled/shitty interface. ;)
> 
> Nervous? No, I'm simply trying to make sure that everyone is on the
> same page. i.e. that if we water down the guarantee that 1) relies
> on, then it's not actually useful to filesystems at all.

Yeah, makes sense.
 
> > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > > REQ_META writes (or any other specific type of write operation) then
> > > it's simply not worth persuing at the filesystem level because the
> > > guarantees we actually need just aren't there and the complexity of
> > > discovering and handling those corner cases just isn't worth the
> > > effort.
> > 
> > Here is where I get to say: I think you misunderstood me (but it was
> > my fault for not being absolutely clear: I'm very much on the same
> > page as you and Joe; and your visions need to just be implemented
> > ASAP).
> 
> OK, good that we've clarified the misunderstandings on both sides
> quickly :)

Do you 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-07 Thread Mike Snitzer
On Mon, Jun 05 2023 at  5:14P -0400,
Sarthak Kukreti  wrote:

> On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
> >
> > We all just need to focus on your proposal and Joe's dm-thin
> > reservation design...
> >
> > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > dm-thinp support before Joe's design is implemented.  Otherwise we'll
> > have 2 different responses to REQ_OP_PROVISION.  The one that is
> > captured in your patchset isn't adequate to properly handle ensuring
> > upper layer (like XFS) can depend on the space being available across
> > snapshot boundaries.]
> >
> Ack. Would it be premature for the rest of the series to go through
> (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> targets)? I'd like to start using this as a reference to suggest
> additions to the virtio-spec for virtio-blk support and start looking
> at what an ext4 implementation would look like.

Please drop the dm-thin.c and dm-snap.c changes.  dm-snap.c would need
more work to provide the type of guarantee XFS requires across
snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
because it is best to just use dm-thin.

And FYI even your dm-thin patch will be the starting point for the
dm-thin support (we'll keep attribution to you for all the code in a
separate patch).

> Fair points, I certainly don't want to derail this conversation; I'd
> be happy to see this work merged sooner rather than later.

Once those dm target changes are dropped I think the rest of the
series is fine to go upstream now.  Feel free to post a v8.

> For posterity, I'll distill what I said above into the following: I'd like
> a capability for userspace to create thin snapshots that ignore the
> thin volume's provisioned areas. IOW, an opt-in flag which makes
> snapshots fallback to what they do today to provide flexibility to
> userspace to decide the space requirements for the above mentioned
> scenarios, and at the same time, not adding separate corner case
> handling for filesystems. But to reiterate, my intent isn't to pile
> this onto the work you, Mike and Joe have planned; just some insight
> into why I'm in favor of ideas that reduce the snapshot size.

I think it'd be useful to ignore a thin device's reservation for
read-only snapshots.  Adding the ability to create read-only thin
snapshots could make sense (later activations don't necessarily need
to impose read-only, doing so would require some additional work).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-06 Thread Dave Chinner
On Mon, Jun 05, 2023 at 02:14:44PM -0700, Sarthak Kukreti wrote:
> On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
> > On Fri, Jun 02 2023 at  8:52P -0400,
> > Dave Chinner  wrote:
> > > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > > > The only way to distinquish the caller (between on-behalf of user data
> > > > > vs XFS metadata) would be REQ_META?
> > > > >
> > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > > > > all REQ_OP_PROVISION the same?
> > > > >
> > > > I'm in favor of a REQ_META-based distinction.
> > >
> > > Why? What *requirement* is driving the need for this distinction?
> >
> > Think I answered that above, XFS delalloc accounting parity on thinp.
> >
> I actually had a few different use-cases in mind (apart from the user
> data provisioning 'fear' that you pointed out): in essence, there are
> cases where userspace would benefit from having more control over how
> much space a snapshot takes:
> 
> 1) In the original RFC patchset [1], I alluded to this being a
> mechanism for pre-allocating space for preserving space for thin
> logical volumes. The use-case I'd like to explore is delta updatable
> read-only filesystems similar to systemd system extensions [2]: In
> essence:
> a) Preserve space for a 'base' thin logical volume that will contain a
> read-only filesystem on over-the-air installation: for filesystems
> like squashfs and erofs, pretty much the entire image is a compressed
> file that I'd like to reserve space for before installation.
> b) Before update, create a thin snapshot and preserve enough space to
> ensure that a delta update will succeed (eg. block level diff of the
> base image). Then, the update is guaranteed to have disk space to
> succeed (similar to the A-B update guarantees on ChromeOS). On
> success, we merge the snapshot and reserve an update snapshot for the
> next possible update. On failure, we drop the snapshot.

Sounds very similar to the functionality blksnap is supposed to
provide

https://lore.kernel.org/linux-fsdevel/20230404140835.25166-1-sergei.sht...@veeam.com/


> 2) The other idea I wanted to explore was rollback protection for
> stateful filesystem features: in essence, if an update from kernel 4.x
> to 5.y failed very quickly (due to unrelated reasons) and we enabled
> some stateful filesystem features that are only supported on 5.y, we'd
> be able to rollback to 4.x if we used short-lived snapshots (in the
> ChromiumOS world, the lifetime of these snapshots would be < 10s per
> boot).

Not sure that blksnap has a "roll origin back to read-only snapshot"
feature yet, but that's what you'd need for this. i.e. on success,
drop the snapshot. On failure, "roll origin back to snapshot and
reboot".

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-06 Thread Dave Chinner
On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> On Fri, Jun 02 2023 at  8:52P -0400,
> Dave Chinner  wrote:
> 
> > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
> > > >
> > > > On Tue, May 30 2023 at 10:55P -0400,
> > > > Joe Thornber  wrote:
> > > >
> > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own 
> > > > > > provides
> > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > > reserve space to accomodate it).
> > > > > >
> > > > >
> > > > > My proposal only involves 'reserve'.  Provisioning will be done as 
> > > > > part of
> > > > > the usual io path.
> > > >
> > > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > > question. Because this patchset's block interface patch (2/5) header
> > > > says:
> > > >
> > > > "This patch also adds the capability to call fallocate() in mode 0
> > > > on block devices, which will send REQ_OP_PROVISION to the block
> > > > device for the specified range,"
> > > >
> > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > > user of XFS could then use fallocate() for user data -- which would
> > > > cause thinp's reserve to _not_ be used for critical metadata.
> > 
> > Mike, I think you might have misunderstood what I have been proposing.
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > that's what I intended - the operation does not contain data at all.
> > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > contains a range of sectors that need to be provisioned (or
> > discarded), and nothing else.
> 
> No, I understood that.
> 
> > The write IOs themselves are not tagged with anything special at all.
> 
> I know, but I've been looking at how to also handle the delalloc
> usecase (and yes I know you feel it doesn't need handling, the issue
> is XFS does deal nicely with ensuring it has space when it tracks its
> allocations on "thick" storage

Oh, no it doesn't. It -works for most cases-, but that does not mean
it provides any guarantees at all. We can still get ENOSPC for user
data when delayed allocation reservations "run out".

This may be news to you, but the ephemeral XFS delayed allocation
space reservation is not accurate. It contains a "fudge factor"
called "indirect length". This is a "wet finger in the wind"
estimation of how much new metadata will need to be allocated to
index the physical allocations when they are made. It assumes large
data extents are allocated, which is good enough for most cases, but
it is no guarantee when there are no large data extents available to
allocate (e.g. near ENOSPC!).

And therein lies the fundamental problem with ephemeral range
reservations: at the time of reservation, we don't know how many
individual physical LBA ranges the reserved data range is actually
going to span.

As a result, XFS delalloc reservations are a "close-but-not-quite"
reservation backed by a global reserve pool that can be dipped into
if we run out of delalloc reservation. If the reserve pool is then
fully depleted before all delalloc conversion completes, we'll still
give ENOSPC. The pool is sized such that the vast majority of
workloads will complete delalloc conversion successfully before the
pool is depleted.

Hence XFS gives everyone the -appearance- that it deals nicely with
ENOSPC conditions, but it never provides a -guarantee- that any
accepted write will always succeed without ENOSPC.

IMO, using this "close-but-not-quite" reservation as the basis of
space requirements for other layers to provide "won't ENOSPC"
guarantees is fraught with problems. We already know that it is
insufficient in important corner cases at the filesystem level, and
we also know that lower layers trying to do ephemeral space
reservations will have exactly the same problems providing a
guarantee. And these are problems we've been unable to engineer
around in the past, so the likelihood we can engineer around them
now or in the future is also very unlikely.

> -- so adding coordination between XFS
> and dm-thin layers provides comparable safety.. that safety is an
> expected norm).
>
> But rather than discuss in terms of data vs metadata, the distinction
> is:
> 1) LBA range reservation (normal case, your proposal)
> 2) non-LBA reservation (absolute value, LBA range is known at later stage)
> 
> But I'm clearly going off script for dwelling on wanting to handle
> both.

Right, because if we do 1) then we don't need 2). :)

> My 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-05 Thread Sarthak Kukreti
On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer  wrote:
>
> On Fri, Jun 02 2023 at  8:52P -0400,
> Dave Chinner  wrote:
>
> > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
> > > >
> > > > On Tue, May 30 2023 at 10:55P -0400,
> > > > Joe Thornber  wrote:
> > > >
> > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own 
> > > > > > provides
> > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > > reserve space to accomodate it).
> > > > > >
> > > > >
> > > > > My proposal only involves 'reserve'.  Provisioning will be done as 
> > > > > part of
> > > > > the usual io path.
> > > >
> > > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > > question. Because this patchset's block interface patch (2/5) header
> > > > says:
> > > >
> > > > "This patch also adds the capability to call fallocate() in mode 0
> > > > on block devices, which will send REQ_OP_PROVISION to the block
> > > > device for the specified range,"
> > > >
> > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > > user of XFS could then use fallocate() for user data -- which would
> > > > cause thinp's reserve to _not_ be used for critical metadata.
> >
> > Mike, I think you might have misunderstood what I have been proposing.
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > that's what I intended - the operation does not contain data at all.
> > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > contains a range of sectors that need to be provisioned (or
> > discarded), and nothing else.
>
> No, I understood that.
>
> > The write IOs themselves are not tagged with anything special at all.
>
> I know, but I've been looking at how to also handle the delalloc
> usecase (and yes I know you feel it doesn't need handling, the issue
> is XFS does deal nicely with ensuring it has space when it tracks its
> allocations on "thick" storage -- so adding coordination between XFS
> and dm-thin layers provides comparable safety.. that safety is an
> expected norm).
>
> But rather than discuss in terms of data vs metadata, the distinction
> is:
> 1) LBA range reservation (normal case, your proposal)
> 2) non-LBA reservation (absolute value, LBA range is known at later stage)
>
> But I'm clearly going off script for dwelling on wanting to handle
> both.
>
> My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> a crude simplification for branching between the 2 approaches.
>
> And I understand I made you nervous by expanding the scope to a much
> more muddled/shitty interface. ;)
>
> We all just need to focus on your proposal and Joe's dm-thin
> reservation design...
>
> [Sarthak: FYI, this implies that it doesn't really make sense to add
> dm-thinp support before Joe's design is implemented.  Otherwise we'll
> have 2 different responses to REQ_OP_PROVISION.  The one that is
> captured in your patchset isn't adequate to properly handle ensuring
> upper layer (like XFS) can depend on the space being available across
> snapshot boundaries.]
>
Ack. Would it be premature for the rest of the series to go through
(REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
targets)? I'd like to start using this as a reference to suggest
additions to the virtio-spec for virtio-blk support and start looking
at what an ext4 implementation would look like.

> > i.e. The proposal I made does not use REQ_PROVISION anywhere in the
> > metadata/data IO path; provisioned regions are created by separate
> > operations and must be tracked by the underlying block device, then
> > treat any write IO to those regions as "must not fail w/ ENOSPC"
> > IOs.
> >
> > There seems to be a lot of fear about user data requiring
> > provisioning. This is unfounded - provisioning is only needed for
> > explicitly provisioned space via fallocate(), not every byte of
> > user data written to the filesystem (the model Brian is proposing).
>
> As I mentioned above, I was just trying to get XFS-on-thinp to
> maintain parity with how XFS's delalloc accounting works on "thick"
> storage.
>
> But happy to put that to one side.  Maintain focus like I mentioned
> above.  I'm happy we have momentum and agreement on this design now.
> Rather than be content with that, I was mistakenly looking at other
> aspects and in doing so introduced "noise" before we've implemented
> what we all completely agree on: your and joe's designs.
>
> > Excessive use of fallocate() is 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-03 Thread Mike Snitzer
On Fri, Jun 02 2023 at  8:52P -0400,
Dave Chinner  wrote:

> On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
> > >
> > > On Tue, May 30 2023 at 10:55P -0400,
> > > Joe Thornber  wrote:
> > >
> > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:
> > > >
> > > > >
> > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > reserve space to accomodate it).
> > > > >
> > > >
> > > > My proposal only involves 'reserve'.  Provisioning will be done as part 
> > > > of
> > > > the usual io path.
> > >
> > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > question. Because this patchset's block interface patch (2/5) header
> > > says:
> > >
> > > "This patch also adds the capability to call fallocate() in mode 0
> > > on block devices, which will send REQ_OP_PROVISION to the block
> > > device for the specified range,"
> > >
> > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > user of XFS could then use fallocate() for user data -- which would
> > > cause thinp's reserve to _not_ be used for critical metadata.
> 
> Mike, I think you might have misunderstood what I have been proposing.
> Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> that's what I intended - the operation does not contain data at all.
> It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> contains a range of sectors that need to be provisioned (or
> discarded), and nothing else.

No, I understood that.

> The write IOs themselves are not tagged with anything special at all.

I know, but I've been looking at how to also handle the delalloc
usecase (and yes I know you feel it doesn't need handling, the issue
is XFS does deal nicely with ensuring it has space when it tracks its
allocations on "thick" storage -- so adding coordination between XFS
and dm-thin layers provides comparable safety.. that safety is an
expected norm).

But rather than discuss in terms of data vs metadata, the distinction
is:
1) LBA range reservation (normal case, your proposal)
2) non-LBA reservation (absolute value, LBA range is known at later stage)

But I'm clearly going off script for dwelling on wanting to handle
both.

My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
a crude simplification for branching between the 2 approaches.

And I understand I made you nervous by expanding the scope to a much
more muddled/shitty interface. ;)

We all just need to focus on your proposal and Joe's dm-thin
reservation design...

[Sarthak: FYI, this implies that it doesn't really make sense to add
dm-thinp support before Joe's design is implemented.  Otherwise we'll
have 2 different responses to REQ_OP_PROVISION.  The one that is
captured in your patchset isn't adequate to properly handle ensuring
upper layer (like XFS) can depend on the space being available across
snapshot boundaries.]

> i.e. The proposal I made does not use REQ_PROVISION anywhere in the
> metadata/data IO path; provisioned regions are created by separate
> operations and must be tracked by the underlying block device, then
> treat any write IO to those regions as "must not fail w/ ENOSPC"
> IOs.
> 
> There seems to be a lot of fear about user data requiring
> provisioning. This is unfounded - provisioning is only needed for
> explicitly provisioned space via fallocate(), not every byte of
> user data written to the filesystem (the model Brian is proposing).

As I mentioned above, I was just trying to get XFS-on-thinp to
maintain parity with how XFS's delalloc accounting works on "thick"
storage.

But happy to put that to one side.  Maintain focus like I mentioned
above.  I'm happy we have momentum and agreement on this design now.
Rather than be content with that, I was mistakenly looking at other
aspects and in doing so introduced "noise" before we've implemented
what we all completely agree on: your and joe's designs.

> Excessive use of fallocate() is self correcting - if users and/or
> their applications provision too much, they are going to get ENOSPC
> or have to pay more to expand the backing pool reserves they need.
> But that's not a problem the block device should be trying to solve;
> that's a problem for the sysadmin and/or bean counters to address.
> 
> > >
> > > The only way to distinquish the caller (between on-behalf of user data
> > > vs XFS metadata) would be REQ_META?
> > >
> > > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > > all REQ_OP_PROVISION the same?
> > >
> > I'm in favor of 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-02 Thread Dave Chinner
On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
> >
> > On Tue, May 30 2023 at 10:55P -0400,
> > Joe Thornber  wrote:
> >
> > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:
> > >
> > > >
> > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > reserve space to accomodate it).
> > > >
> > >
> > > My proposal only involves 'reserve'.  Provisioning will be done as part of
> > > the usual io path.
> >
> > OK, I think we'd do well to pin down the top-level block interfaces in
> > question. Because this patchset's block interface patch (2/5) header
> > says:
> >
> > "This patch also adds the capability to call fallocate() in mode 0
> > on block devices, which will send REQ_OP_PROVISION to the block
> > device for the specified range,"
> >
> > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > user of XFS could then use fallocate() for user data -- which would
> > cause thinp's reserve to _not_ be used for critical metadata.

Mike, I think you might have misunderstood what I have been proposing.
Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
that's what I intended - the operation does not contain data at all.
It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
contains a range of sectors that need to be provisioned (or
discarded), and nothing else. The write IOs themselves are not
tagged with anything special at all.

i.e. The proposal I made does not use REQ_PROVISION anywhere in the
metadata/data IO path; provisioned regions are created by separate
operations and must be tracked by the underlying block device, then
treat any write IO to those regions as "must not fail w/ ENOSPC"
IOs.

There seems to be a lot of fear about user data requiring
provisioning. This is unfounded - provisioning is only needed for
explicitly provisioned space via fallocate(), not every byte of
user data written to the filesystem (the model Brian is proposing).

Excessive use of fallocate() is self correcting - if users and/or
their applications provision too much, they are going to get ENOSPC
or have to pay more to expand the backing pool reserves they need.
But that's not a problem the block device should be trying to solve;
that's a problem for the sysadmin and/or bean counters to address.

> >
> > The only way to distinquish the caller (between on-behalf of user data
> > vs XFS metadata) would be REQ_META?
> >
> > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > all REQ_OP_PROVISION the same?
> >
> I'm in favor of a REQ_META-based distinction.

Why? What *requirement* is driving the need for this distinction?

As the person who proposed this new REQ_OP_PROVISION architecture,
I'm dead set against it.  Allowing the block device provide a set of
poorly defined "conditional guarantees" policies instead of a
mechanism with a single ironclad guarantee defeats the entire
purpose of the proposal. 

We have a requirement from the *kernel ABI* that *user data writes*
must not fail with ENOSPC after an fallocate() operation.  That's
one of the high level policies we need to implement. The filesystem
is already capable of guaranteeing it won't give the user ENOSPC
after fallocate, we now need a guarantee from the filesystem's
backing store that it won't give ENOSPC, too.

The _other thing_ we need to implement is a method of guaranteeing
the filesystem won't shut down when the backing device goes ENOSPC
unexpected during metadata writeback.  So we also need the backing
device to guarantee the regions we write metadata to won't give
ENOSPC.

That's the whole point of REQ_OP_PROVISION: from the layers above
the block device, there is -zero- difference between the guarantee
we need for user data writes to avoid ENOSPC and for metadata writes
to avoid ENOSPC. They are one and the same.

Hence if the block device is going to say "I support provisioning"
but then give different conditional guarantees according to the
*type of data* in the IO request, then it does not provide the
functionality the higher layers actually require from it.

Indeed, what type of data the IO contains is *context dependent*.
For example, sometimes we write metadata with user data IO and but
we still need provisioning guarantees as if it was issued as
metadata IO. This is the case for mkfs initialising the file system
by writing directly to the block device.

IOWs, filesystem metadata IO issued from kernel context would be
considered metadata IO, but from userspace it would be considered
normal user data IO and hence treated 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-02 Thread Mike Snitzer
On Fri, Jun 02 2023 at  2:44P -0400,
Sarthak Kukreti  wrote:

> On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
> >
> > On Tue, May 30 2023 at 10:55P -0400,
> > Joe Thornber  wrote:
> >
> > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:
> > >
> > > >
> > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > reserve space to accomodate it).
> > > >
> > >
> > > My proposal only involves 'reserve'.  Provisioning will be done as part of
> > > the usual io path.
> >
> > OK, I think we'd do well to pin down the top-level block interfaces in
> > question. Because this patchset's block interface patch (2/5) header
> > says:
> >
> > "This patch also adds the capability to call fallocate() in mode 0
> > on block devices, which will send REQ_OP_PROVISION to the block
> > device for the specified range,"
> >
> > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > user of XFS could then use fallocate() for user data -- which would
> > cause thinp's reserve to _not_ be used for critical metadata.
> >
> > The only way to distinquish the caller (between on-behalf of user data
> > vs XFS metadata) would be REQ_META?
> >
> > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > all REQ_OP_PROVISION the same?
> >
> I'm in favor of a REQ_META-based distinction. Does that imply that
> REQ_META also needs to be passed through the block/filesystem stack
> (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> fallocate() to the underlying file)?

Unclear, I was thinking your REQ_UNSHARE (tied to fallocate) might be
a means to translate REQ_OP_PROVISION + REQ_META to fallocate and have
it perform the LBA-specific provisioning of Joe's design (referenced
below).

> 
> I think that might have applications beyond just provisioning:
> currently, for stacked filesystems (eg filesystems residing in a file
> on top of another filesystem), even if the upper filesystem issues
> read/write requests with REQ_META | REQ_PRIO, these flags are lost in
> translation at the loop device layer.  A flag like the above would
> allow the prioritization of stacked filesystem metadata requests.
> 

Yes, it could prove useful.

> Bringing the discussion back to this series for a bit, I'm still
> waiting on feedback from the Block maintainers before sending out v8
> (which at the moment, only have a
> s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation
> most of the above is follow up work, but please let me know if you'd
> prefer I add some of this to the current series!

I need a bit more time to work through various aspects of the broader
requirements and the resulting interfaces that fall out.

Joe's design is pretty compelling because it will properly handle
snapshot thin devices:
https://listman.redhat.com/archives/dm-devel/2023-May/054351.html

Here is my latest status:
- Focused on prototype for thinp block reservation (XFS metadata, XFS
  delalloc, fallocate)
- Decided the "dynamic" (non-LBA specific) reservation stuff (old
  prototype code) is best left independent from Joe's design.  SO 2
  classes of thinp reservation.
  - Forward-ported the old prototype code that Brian Foster, Joe
Thornber and I worked on years ago.  It needs more careful review
(and very likely will need fixes from Brian and myself).  The XFS
changes are pretty intrusive and likely up for serious debate (as
to whether we even care to handle reservations for user data).
- REQ_OP_PROVISION bio’s with REQ_META will use Joe’s design,
  otherwise data (XFS data and fallocate) will use “dynamic”
  reservation.
  - "dynamic" name is due to the reservation being generic (non-LBA:
not in terms of an LBA range). Also, in-core only; so the associated
“dynamic_reserve_count” accounting is reset to 0 every activation. 
  - Fallocate may require stronger guarantees in the end (in which
case we’ll add a REQ_UNSHARE flag that is selectable from the
fallocate interface) 
- Will try to share common code, but just sorting out highlevel
  interface(s) still...

I'll try to get a git tree together early next week.  It will be the
forward ported "dynamic" prototype code and your latest v7 code with
some additional work to branch accordingly for each class of thinp
reservation.  And I'll use your v7 code as a crude stub for Joe's
approach (branch taken if REQ_META set).

Lastly, here are some additional TODOs I've noted in code earlier in
my review process:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 0d9301802609..43a6702f9efe 100644
--- a/drivers/md/dm-thin.c
+++ 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-06-02 Thread Sarthak Kukreti
On Tue, May 30, 2023 at 8:28 AM Mike Snitzer  wrote:
>
> On Tue, May 30 2023 at 10:55P -0400,
> Joe Thornber  wrote:
>
> > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:
> >
> > >
> > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > between "provision" and "reserve": Would it make sense for REQ_META
> > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > reserve space to accomodate it).
> > >
> >
> > My proposal only involves 'reserve'.  Provisioning will be done as part of
> > the usual io path.
>
> OK, I think we'd do well to pin down the top-level block interfaces in
> question. Because this patchset's block interface patch (2/5) header
> says:
>
> "This patch also adds the capability to call fallocate() in mode 0
> on block devices, which will send REQ_OP_PROVISION to the block
> device for the specified range,"
>
> So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> user of XFS could then use fallocate() for user data -- which would
> cause thinp's reserve to _not_ be used for critical metadata.
>
> The only way to distinquish the caller (between on-behalf of user data
> vs XFS metadata) would be REQ_META?
>
> So should dm-thinp have a REQ_META-based distinction? Or just treat
> all REQ_OP_PROVISION the same?
>
I'm in favor of a REQ_META-based distinction. Does that imply that
REQ_META also needs to be passed through the block/filesystem stack
(eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
fallocate() to the underlying file)?


I think that might have applications beyond just provisioning:
currently, for stacked filesystems (eg filesystems residing in a file
on top of another filesystem), even if the upper filesystem issues
read/write requests with REQ_META | REQ_PRIO, these flags are lost in
translation at the loop device layer.  A flag like the above would
allow the prioritization of stacked filesystem metadata requests.


Bringing the discussion back to this series for a bit, I'm still
waiting on feedback from the Block maintainers before sending out v8
(which at the moment, only have a
s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation
most of the above is follow up work, but please let me know if you'd
prefer I add some of this to the current series!

Best
Sarthak

> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Mike Snitzer
On Tue, May 30 2023 at 10:55P -0400,
Joe Thornber  wrote:

> On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:
> 
> >
> > Also Joe, for you proposed dm-thinp design where you distinquish
> > between "provision" and "reserve": Would it make sense for REQ_META
> > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > more freedom to just reserve the length of blocks? (e.g. for XFS
> > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > reserve space to accomodate it).
> >
> 
> My proposal only involves 'reserve'.  Provisioning will be done as part of
> the usual io path.

OK, I think we'd do well to pin down the top-level block interfaces in
question. Because this patchset's block interface patch (2/5) header
says:

"This patch also adds the capability to call fallocate() in mode 0
on block devices, which will send REQ_OP_PROVISION to the block
device for the specified range,"

So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
user of XFS could then use fallocate() for user data -- which would
cause thinp's reserve to _not_ be used for critical metadata.

The only way to distinquish the caller (between on-behalf of user data
vs XFS metadata) would be REQ_META?

So should dm-thinp have a REQ_META-based distinction? Or just treat
all REQ_OP_PROVISION the same?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Joe Thornber
On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:

>
> Also Joe, for you proposed dm-thinp design where you distinquish
> between "provision" and "reserve": Would it make sense for REQ_META
> (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> more freedom to just reserve the length of blocks? (e.g. for XFS
> delalloc where LBA range is unknown, but dm-thinp can be asked to
> reserve space to accomodate it).
>

My proposal only involves 'reserve'.  Provisioning will be done as part of
the usual io path.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Mike Snitzer
On Tue, May 30 2023 at  3:27P -0400,
Joe Thornber  wrote:

> On Sat, May 27, 2023 at 12:45 AM Dave Chinner  wrote:
> 
> > On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
> >
> > > 1) We have an api (ioctl, bio flag, whatever) that lets you
> > > reserve/guarantee a region:
> > >
> > >   int reserve_region(dev, sector_t begin, sector_t end);
> >
> > A C-based interface is not sufficient because the layer that must do
> > provsioning is not guaranteed to be directly under the filesystem.
> > We must be able to propagate the request down to the layers that
> > need to provision storage, and that includes hardware devices.
> >
> > e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
> > allocates in it's backing device to guarantee that the provisioned
> > LBA range it allocates is also fully provisioned by the storage
> > below it
> >
> 
> Fine, bio flag it is.
> 
> 
> >
> > >   This api should be used minimally, eg, critical FS metadata only.
> >
> >
> >
> > Plan for having to support tens of GBs of provisioned space in
> > filesystems, not tens of MBs
> >
> 
> Also fine.
> 
> 
> I think there's a 2-3 solid days of coding to fully implement
> > REQ_PROVISION support in XFS, including userspace tool support.
> > Maybe a couple of weeks more to flush the bugs out before it's
> > largely ready to go.
> >
> > So if there's buy in from the block layer and DM people for
> > REQ_PROVISION as described, then I'll definitely have XFS support
> > ready for you to test whenever dm-thinp is ready to go.
> >
> 
> Great, this is what I wanted to hear.  I guess we need an ack from the
> block guys and then I'll get started.

The block portion is where this discussion started (in the context of
this thread's patchset, now at v7).

During our discussion I think there were 2 gaps identified with this
patchset:

1) provisioning should be opt-in, and we need a clear flag that upper
   layers look for to know if REQ_PROVISION available
   - we do get this with the max_provision_sectors = 0 default, is
 checking queue_limits (via queue_max_provision_sectors)
 sufficient for upper layers like xfs?

2) DM thinp needs REQ_PROVISION passdown support
   - also dm_table_supports_provision() needs to be stricter by
 requiring _all_ underlying devices support provisioning?

Bonus dm-thinp work: add ranged REQ_PROVISION support to reduce number
of calls (and bios) block core needs to pass to dm thinp.

Also Joe, for you proposed dm-thinp design where you distinquish
between "provision" and "reserve": Would it make sense for REQ_META
(e.g. all XFS metadata) with REQ_PROVISION to be treated as an
LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
more freedom to just reserve the length of blocks? (e.g. for XFS
delalloc where LBA range is unknown, but dm-thinp can be asked to
reserve space to accomodate it).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Joe Thornber
On Sat, May 27, 2023 at 12:45 AM Dave Chinner  wrote:

> On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
>
> > 1) We have an api (ioctl, bio flag, whatever) that lets you
> > reserve/guarantee a region:
> >
> >   int reserve_region(dev, sector_t begin, sector_t end);
>
> A C-based interface is not sufficient because the layer that must do
> provsioning is not guaranteed to be directly under the filesystem.
> We must be able to propagate the request down to the layers that
> need to provision storage, and that includes hardware devices.
>
> e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
> allocates in it's backing device to guarantee that the provisioned
> LBA range it allocates is also fully provisioned by the storage
> below it
>

Fine, bio flag it is.


>
> >   This api should be used minimally, eg, critical FS metadata only.
>
>
>
> Plan for having to support tens of GBs of provisioned space in
> filesystems, not tens of MBs
>

Also fine.


I think there's a 2-3 solid days of coding to fully implement
> REQ_PROVISION support in XFS, including userspace tool support.
> Maybe a couple of weeks more to flush the bugs out before it's
> largely ready to go.
>
> So if there's buy in from the block layer and DM people for
> REQ_PROVISION as described, then I'll definitely have XFS support
> ready for you to test whenever dm-thinp is ready to go.
>

Great, this is what I wanted to hear.  I guess we need an ack from the
block guys and
then I'll get started.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Dave Chinner
On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
> Here's my take:
> 
> I don't see why the filesystem cares if thinp is doing a reservation or
> provisioning under the hood.  All that matters is that a future write
> to that region will be honoured (barring device failure etc.).
> 
> I agree that the reservation/force mapped status needs to be inherited
> by snapshots.
> 
> 
> One of the few strengths of thinp is the performance of taking a snapshot.
> Most snapshots created are never activated.  Many other snapshots are
> only alive for a brief period, and used read-only.  eg, blk-archive
> (https://github.com/jthornber/blk-archive) uses snapshots to do very
> fast incremental backups.  As such I'm strongly against any scheme that
> requires provisioning as part of the snapshot operation.
> 
> Hank and I are in the middle of the range tree work which requires a
> metadata
> change.  So now is a convenient time to piggyback other metadata changes to
> support reservations.
> 
> 
> Given the above this is what I suggest:
> 
> 1) We have an api (ioctl, bio flag, whatever) that lets you
> reserve/guarantee a region:
> 
>   int reserve_region(dev, sector_t begin, sector_t end);

A C-based interface is not sufficient because the layer that must do
provsioning is not guaranteed to be directly under the filesystem.
We must be able to propagate the request down to the layers that
need to provision storage, and that includes hardware devices.

e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
allocates in it's backing device to guarantee that the provisioned
LBA range it allocates is also fully provisioned by the storage
below it

>   This api should be used minimally, eg, critical FS metadata only.

Keep in mind that "critical FS metadata" in this context is any
metadata which could cause the filesystem to hang or enter a global
error state if an unexpected ENOSPC error occurs during a metadata
write IO.

Which, in pretty much every journalling filesystem, equates to all
metadata in the filesystem. For a typical root filesystem, that
might be a in the range of a 1-200MB (depending on journal size).
For larger filesytems with lots of files in them, it will be in the
range of GBs of space.

Plan for having to support tens of GBs of provisioned space in
filesystems, not tens of MBs

[snip]

> Now this is a lot of work.  As well as the kernel changes we'll need to
> update the userland tools: thin_check, thin_ls, thin_metadata_unpack,
> thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim,
> thin_dump, thin_metadata_size, thin_restore.  Are we confident that we
> have buy in from the FS teams that this will be widely adopted?  Are users
> asking for this?  I really don't want to do 6 months of work for nothing.

I think there's a 2-3 solid days of coding to fully implement
REQ_PROVISION support in XFS, including userspace tool support.
Maybe a couple of weeks more to flush the bugs out before it's
largely ready to go.

So if there's buy in from the block layer and DM people for
REQ_PROVISION as described, then I'll definitely have XFS support
ready for you to test whenever dm-thinp is ready to go.

I can't speak for other filesystems, I suspect the only one we care
about is ext4.  btrfs and f2fs don't need dm-thinp and there aren't
any other filesystems that are used in production on top of
dm-thinp, so I think only XFS and ext4 matter at this point in time.

I suspect that ext4 would be fairly easy to add support for as well.
ext4 has a lot more fixed-place metadata than XFS has so much more
of it's metadata is covered by mkfs-time provisioning. Limiting
dynamic metadata to specific fully provisioned block groups and
provisioning new block groups for metadata when they are near full
would be equivalent to how I plan to provision metadata space in
XFS. Hence the implementation for ext4 looks to be broadly similar
in scope and complexity as XFS

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Brian Foster
On Thu, May 25, 2023 at 07:35:14PM -0700, Sarthak Kukreti wrote:
> On Thu, May 25, 2023 at 6:36 PM Dave Chinner  wrote:
> >
> > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer  wrote:
> > > > On Thu, May 25 2023 at  7:39P -0400,
> > > > Dave Chinner  wrote:
> > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > > > On Tue, May 23 2023 at  8:40P -0400,
> > > > > > Dave Chinner  wrote:
> > > > > > > It's worth noting that XFS already has a coarse-grained
> > > > > > > implementation of preferred regions for metadata storage. It will
> > > > > > > currently not use those metadata-preferred regions for user data
> > > > > > > unless all the remaining user data space is full.  Hence I'm 
> > > > > > > pretty
> > > > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > > > added.
> > > > > > >
> > > > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > > > region for pre-provisioning. But that's not really a huge deal as 
> > > > > > > we
> > > > > > > could also run an internal background post-mount fstrim operation 
> > > > > > > to
> > > > > > > remove any unused pre-provisioning that was left over from when 
> > > > > > > the
> > > > > > > system went down.
> > > > > >
> > > > > > This would be the FITRIM with extension you mention below? Which is 
> > > > > > a
> > > > > > filesystem interface detail?
> > > > >
> > > > > No. We might reuse some of the internal infrastructure we use to
> > > > > implement FITRIM, but that's about it. It's just something kinda
> > > > > like FITRIM but with different constraints determined by the
> > > > > filesystem rather than the user...
> > > > >
> > > > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > > > would get cleaned up eventually without the filesystem having to do
> > > > > anything specific...
> > > > >
> > > > > > So dm-thinp would _not_ need to have new
> > > > > > state that tracks "provisioned but unused" block?
> > > > >
> > > > > No idea - that's your domain. :)
> > > > >
> > > > > dm-snapshot, for certain, will need to track provisioned regions
> > > > > because it has to guarantee that overwrites to provisioned space in
> > > > > the origin device will always succeed. Hence it needs to know how
> > > > > much space breaking sharing in provisioned regions after a snapshot
> > > > > has been taken with be required...
> > > >
> > > > dm-thinp offers its own much more scalable snapshot support (doesn't
> > > > use old dm-snapshot N-way copyout target).
> > > >
> > > > dm-snapshot isn't going to be modified to support this level of
> > > > hardening (dm-snapshot is basically in "maintenance only" now).
> >
> > Ah, of course. Sorry for the confusion, I was kinda using
> > dm-snapshot as shorthand for "dm-thinp + snapshots".
> >
> > > > But I understand your meaning: what you said is 100% applicable to
> > > > dm-thinp's snapshot implementation and needs to be accounted for in
> > > > thinp's metadata (inherent 'provisioned' flag).
> >
> > *nod*
> >
> > > A bit orthogonal: would dm-thinp need to differentiate between
> > > user-triggered provision requests (eg. from fallocate()) vs
> > > fs-triggered requests?
> >
> > Why?  How is the guarantee the block device has to provide to
> > provisioned areas different for user vs filesystem internal
> > provisioned space?
> >
> After thinking this through, I stand corrected. I was primarily
> concerned with how this would balloon thin snapshot sizes if users
> potentially provision a large chunk of the filesystem but that's
> putting the cart way before the horse.
> 

I think that's a legitimate concern. At some point to provide full
-ENOSPC protection the filesystem needs to provision space before it
writes to it, whether it be data or metadata, right? At what point does
that turn into a case where pretty much everything the fs wrote is
provisioned, and therefore a snapshot is just a full copy operation?

That might be Ok I guess, but if that's an eventuality then what's the
need to track provision state at dm-thin block level? Using some kind of
flag you mention below could be a good way to qualify which blocks you'd
want to copy vs. which to share on snapshot and perhaps mitigate that
problem.

> Best
> Sarthak
> 
> > > I would lean towards user provisioned areas not
> > > getting dedup'd on snapshot creation,
> >
> > 
> >
> > Snapshotting is a clone operation, not a dedupe operation.
> >
> > Yes, the end result of both is that you have a block shared between
> > multiple indexes that needs COW on the next overwrite, but the two
> > operations that get to that point are very different...
> >
> > 
> >
> > > but that would entail tracking
> > > the state of the original request and possibly a 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Brian Foster
On Fri, May 26, 2023 at 07:37:43PM +1000, Dave Chinner wrote:
> On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > > > wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in 
> > > > > terms
> > > > > of size and active/inactive state, which would allow the fs to use an
> > > > > emergency pool scheme for managing metadata provisioning and not have 
> > > > > to
> > > > > track and provision individual metadata buffers at all (dealing with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > > filesystems that want more granularity for better behavior could 
> > > > > achieve
> > > > > that with more work. Filesystems that don't would be free to rely on 
> > > > > the
> > > > > simple/basic mechanism provided by dm-thin and still have basic 
> > > > > -ENOSPC
> > > > > protection with very minimal changes.
> > > > > 
> > > > > That's getting too far into the weeds on the future bits, though. This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if 
> > > > > there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > > clean it up further and have dm guys look at it, or if you guys see 
> > > > > any
> > > > > obvious issues in what it does that makes it potentially problematic, 
> > > > > or
> > > > > if you would just prefer to go down the path described above...
> > > > 
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > me because it is quite precise.  That said, it doesn't account for
> > > > hard requirements that _all_ blocks will always succeed.
> > > 
> > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > but I don't think we'd ever need a hard guarantee from the block
> > > device that every write bio issued from the filesystem will succeed
> > > without ENOSPC.
> > > 
> > 
> > The bigger picture goal that I didn't get into in my previous mail is
> > the "full device" reservation model is intended to be a simple, crude
> > reference implementation that can be enabled for any arbitrary thin
> > volume consumer (filesystem or application). The idea is to build that
> > on a simple enough reservation mechanism that any such consumer could
> > override it based on its own operational model. The goal is to guarantee
> > that a particular filesystem never receives -ENOSPC from dm-thin on
> > writes, but the first phase of implementing that is to simply guarantee
> > every block is writeable.
> > 
> > As a specific filesystem is able to more explicitly provision its own
> > allocations in a way that it can guarantee to return -ENOSPC from
> > dm-thin up front (rather than at write bio time), it can reduce the need
> > for the amount of reservation required, ultimately to zero if that
> > filesystem provides the ability to pre-provision all of its writes to
> > storage in some way or another.
> > 
> > I think for filesystems with complex metadata management like XFS, it's
> > not very realistic to expect explicit 1-1 provisioning for all metadata
> > changes on a per-transaction basis in the same way that can (fairly
> > easily) be done for data, which means a pool mechanism is probably still
> > needed for the metadata class of writes.
> 
> I'm trying to avoid need for 1-1 provisioning and the need for a
> accounting based reservation pool approach. I've tried the
> reservation pool thing several times, and they've all collapsed
> under the complexity of behaviour guarantees under worst case write
> amplification situations.
> 
> The whole point of the LBA provisioning approach is that it
> completely avoids the need to care about write amplification because
> the underlying device guarantees any write to a LBA that is
> provisioned will succeed. It takes care of the write amplification
> problem for us, and we can make it even easier for the backing
> device by aligning LBA range provision requests to device region
> sizes.
> 
> > > If the block device can provide a guarantee that a provisioned LBA
> > > range is always writable, then everything else is a filesystem level
> > > optimisation problem and we don't have to involve the block device
> > > in any way. All we need is a flag we can ready out of the bdev at
> > > mount time to determine if the filesystem should be operating with
> > > LBA provisioning enabled...
> > > 
> > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > filesystem metadata, we 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Joe Thornber
Here's my take:

I don't see why the filesystem cares if thinp is doing a reservation or
provisioning under the hood.  All that matters is that a future write
to that region will be honoured (barring device failure etc.).

I agree that the reservation/force mapped status needs to be inherited
by snapshots.


One of the few strengths of thinp is the performance of taking a snapshot.
Most snapshots created are never activated.  Many other snapshots are
only alive for a brief period, and used read-only.  eg, blk-archive
(https://github.com/jthornber/blk-archive) uses snapshots to do very
fast incremental backups.  As such I'm strongly against any scheme that
requires provisioning as part of the snapshot operation.

Hank and I are in the middle of the range tree work which requires a
metadata
change.  So now is a convenient time to piggyback other metadata changes to
support reservations.


Given the above this is what I suggest:

1) We have an api (ioctl, bio flag, whatever) that lets you
reserve/guarantee a region:

  int reserve_region(dev, sector_t begin, sector_t end);

  This api should be used minimally, eg, critical FS metadata only.

2) Each thinp device will gain two extra fields in the metadata:

  - Total reserved (TR)
  - reserved actually provisioned (RAP)

  The difference between these two is the amount of space we need to
guarantee
  for this device.

3) Each individual mapping for a device will gain a flag indicating if it's
'reserved'.
   We will need to support 'reserved but unmapped' mappings.  There are
   two provisioning events:

  - unmapped but reserved -> mapped reserved.  Initial provision, RAP
incremented.
  - mapped and reserved -> ie. break sharing after snapshot.  Again RAP
incremented.

4) When a snapshot is taken:
  - Reset RAP to zero for origin
  - New snap has Total reserved set to that of origin, RAP <- 0
  - All mappings are shared, so the reserved flags for each mapping is
preserved
  - Only allow snapshot if there is enough free space for the new reserve
pool.


5) Reserve for the whole pool is the sum of (TR - RAP) for all devices.
This pool
   can only be touched if we're provisioning for a reserved mapping.

One drawback with this scheme is the double accounting due to RAP being
reset to zero for the origin device.  This comes about because after the
snapshot operation the two devices are equal, no sense of which device
is the origin is preserved or needed.  So a write to a given block
will trigger provisioning on whichever device it is written to first.
A later write to the other device will not trigger a provision.  I think
this problem can be solved without too much complexity.


Now this is a lot of work.  As well as the kernel changes we'll need to
update the userland tools: thin_check, thin_ls, thin_metadata_unpack,
thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim,
thin_dump, thin_metadata_size, thin_restore.  Are we confident that we
have buy in from the FS teams that this will be widely adopted?  Are users
asking for this?  I really don't want to do 6 months of work for nothing.

- Joe

On Fri, May 26, 2023 at 10:37 AM Dave Chinner  wrote:

> On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <
> bfos...@redhat.com> wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in
> terms
> > > > > of size and active/inactive state, which would allow the fs to use
> an
> > > > > emergency pool scheme for managing metadata provisioning and not
> have to
> > > > > track and provision individual metadata buffers at all (dealing
> with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity,
> and
> > > > > filesystems that want more granularity for better behavior could
> achieve
> > > > > that with more work. Filesystems that don't would be free to rely
> on the
> > > > > simple/basic mechanism provided by dm-thin and still have basic
> -ENOSPC
> > > > > protection with very minimal changes.
> > > > >
> > > > > That's getting too far into the weeds on the future bits, though.
> This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if
> there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try
> and
> > > > > clean it up further and have dm guys look at it, or if you guys
> see any
> > > > > obvious issues in what it does that makes it potentially
> problematic, or
> > > > > if you would just prefer to go down the path described above...
> > > >
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Dave Chinner
On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > > wrote:
> > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > ... since I also happen to think there is a potentially interesting
> > > > development path to make this sort of reserve pool configurable in terms
> > > > of size and active/inactive state, which would allow the fs to use an
> > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > track and provision individual metadata buffers at all (dealing with
> > > > user data is much easier to provision explicitly). So the space
> > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > filesystems that want more granularity for better behavior could achieve
> > > > that with more work. Filesystems that don't would be free to rely on the
> > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > protection with very minimal changes.
> > > > 
> > > > That's getting too far into the weeds on the future bits, though. This
> > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > obvious issues in what it does that makes it potentially problematic, or
> > > > if you would just prefer to go down the path described above...
> > > 
> > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > me because it is quite precise.  That said, it doesn't account for
> > > hard requirements that _all_ blocks will always succeed.
> > 
> > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > but I don't think we'd ever need a hard guarantee from the block
> > device that every write bio issued from the filesystem will succeed
> > without ENOSPC.
> > 
> 
> The bigger picture goal that I didn't get into in my previous mail is
> the "full device" reservation model is intended to be a simple, crude
> reference implementation that can be enabled for any arbitrary thin
> volume consumer (filesystem or application). The idea is to build that
> on a simple enough reservation mechanism that any such consumer could
> override it based on its own operational model. The goal is to guarantee
> that a particular filesystem never receives -ENOSPC from dm-thin on
> writes, but the first phase of implementing that is to simply guarantee
> every block is writeable.
> 
> As a specific filesystem is able to more explicitly provision its own
> allocations in a way that it can guarantee to return -ENOSPC from
> dm-thin up front (rather than at write bio time), it can reduce the need
> for the amount of reservation required, ultimately to zero if that
> filesystem provides the ability to pre-provision all of its writes to
> storage in some way or another.
> 
> I think for filesystems with complex metadata management like XFS, it's
> not very realistic to expect explicit 1-1 provisioning for all metadata
> changes on a per-transaction basis in the same way that can (fairly
> easily) be done for data, which means a pool mechanism is probably still
> needed for the metadata class of writes.

I'm trying to avoid need for 1-1 provisioning and the need for a
accounting based reservation pool approach. I've tried the
reservation pool thing several times, and they've all collapsed
under the complexity of behaviour guarantees under worst case write
amplification situations.

The whole point of the LBA provisioning approach is that it
completely avoids the need to care about write amplification because
the underlying device guarantees any write to a LBA that is
provisioned will succeed. It takes care of the write amplification
problem for us, and we can make it even easier for the backing
device by aligning LBA range provision requests to device region
sizes.

> > If the block device can provide a guarantee that a provisioned LBA
> > range is always writable, then everything else is a filesystem level
> > optimisation problem and we don't have to involve the block device
> > in any way. All we need is a flag we can ready out of the bdev at
> > mount time to determine if the filesystem should be operating with
> > LBA provisioning enabled...
> > 
> > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > filesystem metadata, we can do that ahead of time and track the
> > pre-provisioned range(s) in the filesystem itself.
> > 
> > In XFS, That could be as simple as having small chunks of each AG
> > reserved to metadata (e.g. start with the first 100MB) and limiting
> > all metadata allocation free space searches to that specific block
> 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Sarthak Kukreti
On Thu, May 25, 2023 at 6:36 PM Dave Chinner  wrote:
>
> On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer  wrote:
> > > On Thu, May 25 2023 at  7:39P -0400,
> > > Dave Chinner  wrote:
> > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > > On Tue, May 23 2023 at  8:40P -0400,
> > > > > Dave Chinner  wrote:
> > > > > > It's worth noting that XFS already has a coarse-grained
> > > > > > implementation of preferred regions for metadata storage. It will
> > > > > > currently not use those metadata-preferred regions for user data
> > > > > > unless all the remaining user data space is full.  Hence I'm pretty
> > > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > > added.
> > > > > >
> > > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > > could also run an internal background post-mount fstrim operation to
> > > > > > remove any unused pre-provisioning that was left over from when the
> > > > > > system went down.
> > > > >
> > > > > This would be the FITRIM with extension you mention below? Which is a
> > > > > filesystem interface detail?
> > > >
> > > > No. We might reuse some of the internal infrastructure we use to
> > > > implement FITRIM, but that's about it. It's just something kinda
> > > > like FITRIM but with different constraints determined by the
> > > > filesystem rather than the user...
> > > >
> > > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > > would get cleaned up eventually without the filesystem having to do
> > > > anything specific...
> > > >
> > > > > So dm-thinp would _not_ need to have new
> > > > > state that tracks "provisioned but unused" block?
> > > >
> > > > No idea - that's your domain. :)
> > > >
> > > > dm-snapshot, for certain, will need to track provisioned regions
> > > > because it has to guarantee that overwrites to provisioned space in
> > > > the origin device will always succeed. Hence it needs to know how
> > > > much space breaking sharing in provisioned regions after a snapshot
> > > > has been taken with be required...
> > >
> > > dm-thinp offers its own much more scalable snapshot support (doesn't
> > > use old dm-snapshot N-way copyout target).
> > >
> > > dm-snapshot isn't going to be modified to support this level of
> > > hardening (dm-snapshot is basically in "maintenance only" now).
>
> Ah, of course. Sorry for the confusion, I was kinda using
> dm-snapshot as shorthand for "dm-thinp + snapshots".
>
> > > But I understand your meaning: what you said is 100% applicable to
> > > dm-thinp's snapshot implementation and needs to be accounted for in
> > > thinp's metadata (inherent 'provisioned' flag).
>
> *nod*
>
> > A bit orthogonal: would dm-thinp need to differentiate between
> > user-triggered provision requests (eg. from fallocate()) vs
> > fs-triggered requests?
>
> Why?  How is the guarantee the block device has to provide to
> provisioned areas different for user vs filesystem internal
> provisioned space?
>
After thinking this through, I stand corrected. I was primarily
concerned with how this would balloon thin snapshot sizes if users
potentially provision a large chunk of the filesystem but that's
putting the cart way before the horse.

Best
Sarthak

> > I would lean towards user provisioned areas not
> > getting dedup'd on snapshot creation,
>
> 
>
> Snapshotting is a clone operation, not a dedupe operation.
>
> Yes, the end result of both is that you have a block shared between
> multiple indexes that needs COW on the next overwrite, but the two
> operations that get to that point are very different...
>
> 
>
> > but that would entail tracking
> > the state of the original request and possibly a provision request
> > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> > (REQ_PROVISION_NODEDUP). Possibly too convoluted...
>
> Let's not try to add everyone's favourite pony to this interface
> before we've even got it off the ground.
>
> It's the simple precision of the API, the lack of cross-layer
> communication requirements and the ability to implement and optimise
> the independent layers independently that makes this a very
> appealing solution.
>
> We need to start with getting the simple stuff working and prove the
> concept. Then once we can observe the behaviour of a working system
> we can start working on optimising individual layers for efficiency
> and performance
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Dave Chinner
On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> On Thu, May 25, 2023 at 9:00 AM Mike Snitzer  wrote:
> > On Thu, May 25 2023 at  7:39P -0400,
> > Dave Chinner  wrote:
> > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at  8:40P -0400,
> > > > Dave Chinner  wrote:
> > > > > It's worth noting that XFS already has a coarse-grained
> > > > > implementation of preferred regions for metadata storage. It will
> > > > > currently not use those metadata-preferred regions for user data
> > > > > unless all the remaining user data space is full.  Hence I'm pretty
> > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > added.
> > > > >
> > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > could also run an internal background post-mount fstrim operation to
> > > > > remove any unused pre-provisioning that was left over from when the
> > > > > system went down.
> > > >
> > > > This would be the FITRIM with extension you mention below? Which is a
> > > > filesystem interface detail?
> > >
> > > No. We might reuse some of the internal infrastructure we use to
> > > implement FITRIM, but that's about it. It's just something kinda
> > > like FITRIM but with different constraints determined by the
> > > filesystem rather than the user...
> > >
> > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > would get cleaned up eventually without the filesystem having to do
> > > anything specific...
> > >
> > > > So dm-thinp would _not_ need to have new
> > > > state that tracks "provisioned but unused" block?
> > >
> > > No idea - that's your domain. :)
> > >
> > > dm-snapshot, for certain, will need to track provisioned regions
> > > because it has to guarantee that overwrites to provisioned space in
> > > the origin device will always succeed. Hence it needs to know how
> > > much space breaking sharing in provisioned regions after a snapshot
> > > has been taken with be required...
> >
> > dm-thinp offers its own much more scalable snapshot support (doesn't
> > use old dm-snapshot N-way copyout target).
> >
> > dm-snapshot isn't going to be modified to support this level of
> > hardening (dm-snapshot is basically in "maintenance only" now).

Ah, of course. Sorry for the confusion, I was kinda using
dm-snapshot as shorthand for "dm-thinp + snapshots".

> > But I understand your meaning: what you said is 100% applicable to
> > dm-thinp's snapshot implementation and needs to be accounted for in
> > thinp's metadata (inherent 'provisioned' flag).

*nod*

> A bit orthogonal: would dm-thinp need to differentiate between
> user-triggered provision requests (eg. from fallocate()) vs
> fs-triggered requests?

Why?  How is the guarantee the block device has to provide to
provisioned areas different for user vs filesystem internal
provisioned space?

> I would lean towards user provisioned areas not
> getting dedup'd on snapshot creation,



Snapshotting is a clone operation, not a dedupe operation.

Yes, the end result of both is that you have a block shared between
multiple indexes that needs COW on the next overwrite, but the two
operations that get to that point are very different...



> but that would entail tracking
> the state of the original request and possibly a provision request
> flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> (REQ_PROVISION_NODEDUP). Possibly too convoluted...

Let's not try to add everyone's favourite pony to this interface
before we've even got it off the ground.

It's the simple precision of the API, the lack of cross-layer
communication requirements and the ability to implement and optimise
the independent layers independently that makes this a very
appealing solution.

We need to start with getting the simple stuff working and prove the
concept. Then once we can observe the behaviour of a working system
we can start working on optimising individual layers for efficiency
and performance

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Sarthak Kukreti
On Thu, May 25, 2023 at 9:00 AM Mike Snitzer  wrote:
>
> On Thu, May 25 2023 at  7:39P -0400,
> Dave Chinner  wrote:
>
> > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at  8:40P -0400,
> > > Dave Chinner  wrote:
> > >
> > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster 
> > > > >  wrote:
> > > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > > ... since I also happen to think there is a potentially interesting
> > > > > > development path to make this sort of reserve pool configurable in 
> > > > > > terms
> > > > > > of size and active/inactive state, which would allow the fs to use 
> > > > > > an
> > > > > > emergency pool scheme for managing metadata provisioning and not 
> > > > > > have to
> > > > > > track and provision individual metadata buffers at all (dealing with
> > > > > > user data is much easier to provision explicitly). So the space
> > > > > > inefficiency thing is potentially just a tradeoff for simplicity, 
> > > > > > and
> > > > > > filesystems that want more granularity for better behavior could 
> > > > > > achieve
> > > > > > that with more work. Filesystems that don't would be free to rely 
> > > > > > on the
> > > > > > simple/basic mechanism provided by dm-thin and still have basic 
> > > > > > -ENOSPC
> > > > > > protection with very minimal changes.
> > > > > >
> > > > > > That's getting too far into the weeds on the future bits, though. 
> > > > > > This
> > > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if 
> > > > > > there's
> > > > > > sufficient interest in this sort of "reserve mode" approach to try 
> > > > > > and
> > > > > > clean it up further and have dm guys look at it, or if you guys see 
> > > > > > any
> > > > > > obvious issues in what it does that makes it potentially 
> > > > > > problematic, or
> > > > > > if you would just prefer to go down the path described above...
> > > > >
> > > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > > me because it is quite precise.  That said, it doesn't account for
> > > > > hard requirements that _all_ blocks will always succeed.
> > > >
> > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > > but I don't think we'd ever need a hard guarantee from the block
> > > > device that every write bio issued from the filesystem will succeed
> > > > without ENOSPC.
> > > >
> > > > If the block device can provide a guarantee that a provisioned LBA
> > > > range is always writable, then everything else is a filesystem level
> > > > optimisation problem and we don't have to involve the block device
> > > > in any way. All we need is a flag we can ready out of the bdev at
> > > > mount time to determine if the filesystem should be operating with
> > > > LBA provisioning enabled...
> > > >
> > > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > > filesystem metadata, we can do that ahead of time and track the
> > > > pre-provisioned range(s) in the filesystem itself.
> > > >
> > > > In XFS, That could be as simple as having small chunks of each AG
> > > > reserved to metadata (e.g. start with the first 100MB) and limiting
> > > > all metadata allocation free space searches to that specific block
> > > > range. When we run low on that space, we pre-provision another 100MB
> > > > chunk and then allocate all metadata out of that new range. If we
> > > > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > > > the regions and log low space warnings to userspace. If we can't
> > > > pre-provision any space at all and we've completely run out, we
> > > > simply declare ENOSPC for all incoming operations that require
> > > > metadata allocation until pre-provisioning succeeds again.
> > >
> > > This is basically saying the same thing but:
> > >
> > > It could be that the LBA space is fragmented and so falling back to
> > > the smallest region size (that matches the thinp block size) would be
> > > the last resort?  Then if/when thinp cannot even service allocating a
> > > new free thin block, dm-thinp will transition to out-of-data-space
> > > mode.
> >
> > Yes, something of that sort, though we'd probably give up if we
> > can't get at least megabyte scale reservations - a single
> > modification in XFS can modify many structures and require
> > allocation of a lot of new metadata, so the fileystem cut-off would
> > for metadata provisioning failure would be much larger than the
> > dm-thinp region size
> >
> > > > This is built entirely on the premise that once proactive backing
> > > > device provisioning fails, the backing device is at ENOSPC and we
> > > > have to wait for that situation to go away before allowing new data
> > > > to be ingested. Hence the block device really doesn't need 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Brian Foster
On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > wrote:
> > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > ... since I also happen to think there is a potentially interesting
> > > development path to make this sort of reserve pool configurable in terms
> > > of size and active/inactive state, which would allow the fs to use an
> > > emergency pool scheme for managing metadata provisioning and not have to
> > > track and provision individual metadata buffers at all (dealing with
> > > user data is much easier to provision explicitly). So the space
> > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > filesystems that want more granularity for better behavior could achieve
> > > that with more work. Filesystems that don't would be free to rely on the
> > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > protection with very minimal changes.
> > > 
> > > That's getting too far into the weeds on the future bits, though. This
> > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > sufficient interest in this sort of "reserve mode" approach to try and
> > > clean it up further and have dm guys look at it, or if you guys see any
> > > obvious issues in what it does that makes it potentially problematic, or
> > > if you would just prefer to go down the path described above...
> > 
> > The model that Dave detailed, which builds on REQ_PROVISION and is
> > sticky (by provisioning same blocks for snapshot) seems more useful to
> > me because it is quite precise.  That said, it doesn't account for
> > hard requirements that _all_ blocks will always succeed.
> 
> Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> but I don't think we'd ever need a hard guarantee from the block
> device that every write bio issued from the filesystem will succeed
> without ENOSPC.
> 

The bigger picture goal that I didn't get into in my previous mail is
the "full device" reservation model is intended to be a simple, crude
reference implementation that can be enabled for any arbitrary thin
volume consumer (filesystem or application). The idea is to build that
on a simple enough reservation mechanism that any such consumer could
override it based on its own operational model. The goal is to guarantee
that a particular filesystem never receives -ENOSPC from dm-thin on
writes, but the first phase of implementing that is to simply guarantee
every block is writeable.

As a specific filesystem is able to more explicitly provision its own
allocations in a way that it can guarantee to return -ENOSPC from
dm-thin up front (rather than at write bio time), it can reduce the need
for the amount of reservation required, ultimately to zero if that
filesystem provides the ability to pre-provision all of its writes to
storage in some way or another.

I think for filesystems with complex metadata management like XFS, it's
not very realistic to expect explicit 1-1 provisioning for all metadata
changes on a per-transaction basis in the same way that can (fairly
easily) be done for data, which means a pool mechanism is probably still
needed for the metadata class of writes. Therefore, my expectation for
something like XFS is that it grows the ability to explicitly provision
data writes up front (we solved this part years ago), and then uses a
much smaller pool of reservation for the purpose of dealing with
metadata.

I think what you describe below around preprovisioned perag metadata
ranges is interesting because it _very closely_ maps conceptually to
what I envisioned the evolution of the reserve pool scheme to end up
looking like, but just implemented rather differently to use
reservations instead of specific LBA ranges.

Let me try to connect the dots and identify the differences/tradeoffs...

> If the block device can provide a guarantee that a provisioned LBA
> range is always writable, then everything else is a filesystem level
> optimisation problem and we don't have to involve the block device
> in any way. All we need is a flag we can ready out of the bdev at
> mount time to determine if the filesystem should be operating with
> LBA provisioning enabled...
> 
> e.g. If we need to "pre-provision" a chunk of the LBA space for
> filesystem metadata, we can do that ahead of time and track the
> pre-provisioned range(s) in the filesystem itself.
> 
> In XFS, That could be as simple as having small chunks of each AG
> reserved to metadata (e.g. start with the first 100MB) and limiting
> all metadata allocation free space searches to that specific block
> range. When we run low on that space, we pre-provision another 100MB
> chunk and then allocate all metadata out of that new range. If we
> start getting ENOSPC to pre-provisioning, then we reduce the size of
> the regions and log low space 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Mike Snitzer
On Thu, May 25 2023 at  7:39P -0400,
Dave Chinner  wrote:

> On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at  8:40P -0400,
> > Dave Chinner  wrote:
> > 
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > > > wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in 
> > > > > terms
> > > > > of size and active/inactive state, which would allow the fs to use an
> > > > > emergency pool scheme for managing metadata provisioning and not have 
> > > > > to
> > > > > track and provision individual metadata buffers at all (dealing with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > > filesystems that want more granularity for better behavior could 
> > > > > achieve
> > > > > that with more work. Filesystems that don't would be free to rely on 
> > > > > the
> > > > > simple/basic mechanism provided by dm-thin and still have basic 
> > > > > -ENOSPC
> > > > > protection with very minimal changes.
> > > > > 
> > > > > That's getting too far into the weeds on the future bits, though. This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if 
> > > > > there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > > clean it up further and have dm guys look at it, or if you guys see 
> > > > > any
> > > > > obvious issues in what it does that makes it potentially problematic, 
> > > > > or
> > > > > if you would just prefer to go down the path described above...
> > > > 
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > me because it is quite precise.  That said, it doesn't account for
> > > > hard requirements that _all_ blocks will always succeed.
> > > 
> > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > but I don't think we'd ever need a hard guarantee from the block
> > > device that every write bio issued from the filesystem will succeed
> > > without ENOSPC.
> > > 
> > > If the block device can provide a guarantee that a provisioned LBA
> > > range is always writable, then everything else is a filesystem level
> > > optimisation problem and we don't have to involve the block device
> > > in any way. All we need is a flag we can ready out of the bdev at
> > > mount time to determine if the filesystem should be operating with
> > > LBA provisioning enabled...
> > > 
> > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > filesystem metadata, we can do that ahead of time and track the
> > > pre-provisioned range(s) in the filesystem itself.
> > > 
> > > In XFS, That could be as simple as having small chunks of each AG
> > > reserved to metadata (e.g. start with the first 100MB) and limiting
> > > all metadata allocation free space searches to that specific block
> > > range. When we run low on that space, we pre-provision another 100MB
> > > chunk and then allocate all metadata out of that new range. If we
> > > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > > the regions and log low space warnings to userspace. If we can't
> > > pre-provision any space at all and we've completely run out, we
> > > simply declare ENOSPC for all incoming operations that require
> > > metadata allocation until pre-provisioning succeeds again.
> > 
> > This is basically saying the same thing but:
> > 
> > It could be that the LBA space is fragmented and so falling back to
> > the smallest region size (that matches the thinp block size) would be
> > the last resort?  Then if/when thinp cannot even service allocating a
> > new free thin block, dm-thinp will transition to out-of-data-space
> > mode.
> 
> Yes, something of that sort, though we'd probably give up if we
> can't get at least megabyte scale reservations - a single
> modification in XFS can modify many structures and require
> allocation of a lot of new metadata, so the fileystem cut-off would
> for metadata provisioning failure would be much larger than the
> dm-thinp region size
> 
> > > This is built entirely on the premise that once proactive backing
> > > device provisioning fails, the backing device is at ENOSPC and we
> > > have to wait for that situation to go away before allowing new data
> > > to be ingested. Hence the block device really doesn't need to know
> > > anything about what the filesystem is doing and vice versa - The
> > > block dev just says "yes" or "no" and the filesystem handles
> > > everything else.
> > 
> > Yes.
> > 
> > > It's worth noting that XFS already has a coarse-grained
> > > implementation 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-25 Thread Dave Chinner
On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> On Tue, May 23 2023 at  8:40P -0400,
> Dave Chinner  wrote:
> 
> > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > > wrote:
> > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > ... since I also happen to think there is a potentially interesting
> > > > development path to make this sort of reserve pool configurable in terms
> > > > of size and active/inactive state, which would allow the fs to use an
> > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > track and provision individual metadata buffers at all (dealing with
> > > > user data is much easier to provision explicitly). So the space
> > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > filesystems that want more granularity for better behavior could achieve
> > > > that with more work. Filesystems that don't would be free to rely on the
> > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > protection with very minimal changes.
> > > > 
> > > > That's getting too far into the weeds on the future bits, though. This
> > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > obvious issues in what it does that makes it potentially problematic, or
> > > > if you would just prefer to go down the path described above...
> > > 
> > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > me because it is quite precise.  That said, it doesn't account for
> > > hard requirements that _all_ blocks will always succeed.
> > 
> > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > but I don't think we'd ever need a hard guarantee from the block
> > device that every write bio issued from the filesystem will succeed
> > without ENOSPC.
> > 
> > If the block device can provide a guarantee that a provisioned LBA
> > range is always writable, then everything else is a filesystem level
> > optimisation problem and we don't have to involve the block device
> > in any way. All we need is a flag we can ready out of the bdev at
> > mount time to determine if the filesystem should be operating with
> > LBA provisioning enabled...
> > 
> > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > filesystem metadata, we can do that ahead of time and track the
> > pre-provisioned range(s) in the filesystem itself.
> > 
> > In XFS, That could be as simple as having small chunks of each AG
> > reserved to metadata (e.g. start with the first 100MB) and limiting
> > all metadata allocation free space searches to that specific block
> > range. When we run low on that space, we pre-provision another 100MB
> > chunk and then allocate all metadata out of that new range. If we
> > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > the regions and log low space warnings to userspace. If we can't
> > pre-provision any space at all and we've completely run out, we
> > simply declare ENOSPC for all incoming operations that require
> > metadata allocation until pre-provisioning succeeds again.
> 
> This is basically saying the same thing but:
> 
> It could be that the LBA space is fragmented and so falling back to
> the smallest region size (that matches the thinp block size) would be
> the last resort?  Then if/when thinp cannot even service allocating a
> new free thin block, dm-thinp will transition to out-of-data-space
> mode.

Yes, something of that sort, though we'd probably give up if we
can't get at least megabyte scale reservations - a single
modification in XFS can modify many structures and require
allocation of a lot of new metadata, so the fileystem cut-off would
for metadata provisioning failure would be much larger than the
dm-thinp region size

> > This is built entirely on the premise that once proactive backing
> > device provisioning fails, the backing device is at ENOSPC and we
> > have to wait for that situation to go away before allowing new data
> > to be ingested. Hence the block device really doesn't need to know
> > anything about what the filesystem is doing and vice versa - The
> > block dev just says "yes" or "no" and the filesystem handles
> > everything else.
> 
> Yes.
> 
> > It's worth noting that XFS already has a coarse-grained
> > implementation of preferred regions for metadata storage. It will
> > currently not use those metadata-preferred regions for user data
> > unless all the remaining user data space is full.  Hence I'm pretty
> > sure that a pre-provisioning enhancment like this can be done
> > entirely in-memory without requiring any new on-disk state 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-24 Thread Mike Snitzer
On Tue, May 23 2023 at  8:40P -0400,
Dave Chinner  wrote:

> On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at 10:05P -0400, Brian Foster  
> > wrote:
> > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > ... since I also happen to think there is a potentially interesting
> > > development path to make this sort of reserve pool configurable in terms
> > > of size and active/inactive state, which would allow the fs to use an
> > > emergency pool scheme for managing metadata provisioning and not have to
> > > track and provision individual metadata buffers at all (dealing with
> > > user data is much easier to provision explicitly). So the space
> > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > filesystems that want more granularity for better behavior could achieve
> > > that with more work. Filesystems that don't would be free to rely on the
> > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > protection with very minimal changes.
> > > 
> > > That's getting too far into the weeds on the future bits, though. This
> > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > sufficient interest in this sort of "reserve mode" approach to try and
> > > clean it up further and have dm guys look at it, or if you guys see any
> > > obvious issues in what it does that makes it potentially problematic, or
> > > if you would just prefer to go down the path described above...
> > 
> > The model that Dave detailed, which builds on REQ_PROVISION and is
> > sticky (by provisioning same blocks for snapshot) seems more useful to
> > me because it is quite precise.  That said, it doesn't account for
> > hard requirements that _all_ blocks will always succeed.
> 
> Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> but I don't think we'd ever need a hard guarantee from the block
> device that every write bio issued from the filesystem will succeed
> without ENOSPC.
> 
> If the block device can provide a guarantee that a provisioned LBA
> range is always writable, then everything else is a filesystem level
> optimisation problem and we don't have to involve the block device
> in any way. All we need is a flag we can ready out of the bdev at
> mount time to determine if the filesystem should be operating with
> LBA provisioning enabled...
> 
> e.g. If we need to "pre-provision" a chunk of the LBA space for
> filesystem metadata, we can do that ahead of time and track the
> pre-provisioned range(s) in the filesystem itself.
> 
> In XFS, That could be as simple as having small chunks of each AG
> reserved to metadata (e.g. start with the first 100MB) and limiting
> all metadata allocation free space searches to that specific block
> range. When we run low on that space, we pre-provision another 100MB
> chunk and then allocate all metadata out of that new range. If we
> start getting ENOSPC to pre-provisioning, then we reduce the size of
> the regions and log low space warnings to userspace. If we can't
> pre-provision any space at all and we've completely run out, we
> simply declare ENOSPC for all incoming operations that require
> metadata allocation until pre-provisioning succeeds again.

This is basically saying the same thing but:

It could be that the LBA space is fragmented and so falling back to
the smallest region size (that matches the thinp block size) would be
the last resort?  Then if/when thinp cannot even service allocating a
new free thin block, dm-thinp will transition to out-of-data-space
mode.

> This is built entirely on the premise that once proactive backing
> device provisioning fails, the backing device is at ENOSPC and we
> have to wait for that situation to go away before allowing new data
> to be ingested. Hence the block device really doesn't need to know
> anything about what the filesystem is doing and vice versa - The
> block dev just says "yes" or "no" and the filesystem handles
> everything else.

Yes.

> It's worth noting that XFS already has a coarse-grained
> implementation of preferred regions for metadata storage. It will
> currently not use those metadata-preferred regions for user data
> unless all the remaining user data space is full.  Hence I'm pretty
> sure that a pre-provisioning enhancment like this can be done
> entirely in-memory without requiring any new on-disk state to be
> added.
> 
> Sure, if we crash and remount, then we might chose a different LBA
> region for pre-provisioning. But that's not really a huge deal as we
> could also run an internal background post-mount fstrim operation to
> remove any unused pre-provisioning that was left over from when the
> system went down.

This would be the FITRIM with extension you mention below?  Which is a
filesystem interface detail? So dm-thinp would _not_ need to have new
state that tracks "provisioned but unused" block?  Nor would the block
layer need an extra discard 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-23 Thread Dave Chinner
On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> On Tue, May 23 2023 at 10:05P -0400, Brian Foster  wrote:
> > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > ... since I also happen to think there is a potentially interesting
> > development path to make this sort of reserve pool configurable in terms
> > of size and active/inactive state, which would allow the fs to use an
> > emergency pool scheme for managing metadata provisioning and not have to
> > track and provision individual metadata buffers at all (dealing with
> > user data is much easier to provision explicitly). So the space
> > inefficiency thing is potentially just a tradeoff for simplicity, and
> > filesystems that want more granularity for better behavior could achieve
> > that with more work. Filesystems that don't would be free to rely on the
> > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > protection with very minimal changes.
> > 
> > That's getting too far into the weeds on the future bits, though. This
> > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > sufficient interest in this sort of "reserve mode" approach to try and
> > clean it up further and have dm guys look at it, or if you guys see any
> > obvious issues in what it does that makes it potentially problematic, or
> > if you would just prefer to go down the path described above...
> 
> The model that Dave detailed, which builds on REQ_PROVISION and is
> sticky (by provisioning same blocks for snapshot) seems more useful to
> me because it is quite precise.  That said, it doesn't account for
> hard requirements that _all_ blocks will always succeed.

Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
but I don't think we'd ever need a hard guarantee from the block
device that every write bio issued from the filesystem will succeed
without ENOSPC.

If the block device can provide a guarantee that a provisioned LBA
range is always writable, then everything else is a filesystem level
optimisation problem and we don't have to involve the block device
in any way. All we need is a flag we can ready out of the bdev at
mount time to determine if the filesystem should be operating with
LBA provisioning enabled...

e.g. If we need to "pre-provision" a chunk of the LBA space for
filesystem metadata, we can do that ahead of time and track the
pre-provisioned range(s) in the filesystem itself.

In XFS, That could be as simple as having small chunks of each AG
reserved to metadata (e.g. start with the first 100MB) and limiting
all metadata allocation free space searches to that specific block
range. When we run low on that space, we pre-provision another 100MB
chunk and then allocate all metadata out of that new range. If we
start getting ENOSPC to pre-provisioning, then we reduce the size of
the regions and log low space warnings to userspace. If we can't
pre-provision any space at all and we've completely run out, we
simply declare ENOSPC for all incoming operations that require
metadata allocation until pre-provisioning succeeds again.

This is built entirely on the premise that once proactive backing
device provisioning fails, the backing device is at ENOSPC and we
have to wait for that situation to go away before allowing new data
to be ingested. Hence the block device really doesn't need to know
anything about what the filesystem is doing and vice versa - The
block dev just says "yes" or "no" and the filesystem handles
everything else.

It's worth noting that XFS already has a coarse-grained
implementation of preferred regions for metadata storage. It will
currently not use those metadata-preferred regions for user data
unless all the remaining user data space is full.  Hence I'm pretty
sure that a pre-provisioning enhancment like this can be done
entirely in-memory without requiring any new on-disk state to be
added.

Sure, if we crash and remount, then we might chose a different LBA
region for pre-provisioning. But that's not really a huge deal as we
could also run an internal background post-mount fstrim operation to
remove any unused pre-provisioning that was left over from when the
system went down.

Further, managing shared pool exhaustion doesn't require a
reservation pool in the backing device and for the filesystems to
request space from it. Filesystems already have their own reserve
pools via pre-provisioning. If we want the filesystems to be able to
release that space back to the shared pool (e.g. because the shared
backing pool is critically short on space) then all we need is an
extension to FITRIM to tell the filesystem to also release internal
pre-provisioned reserves.

Then the backing pool admin (person or automated daemon!) can simply
issue a trim on all the filesystems in the pool and spce will be
returned. Then filesystems will ask for new pre-provisioned space
when they next need to ingest modifications, and the backing pool
can manage the new 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-23 Thread Mike Snitzer
On Tue, May 23 2023 at 10:05P -0400,
Brian Foster  wrote:

> On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > On Fri, May 19 2023 at  7:07P -0400,
> > Dave Chinner  wrote:
> > 
...
> > > e.g. If the device takes a snapshot, it needs to reprovision the
> > > potential COW ranges that overlap with the provisioned LBA range at
> > > snapshot time. e.g. by re-reserving the space from the backing pool
> > > for the provisioned space so if a COW occurs there is space
> > > guaranteed for it to succeed.  If there isn't space in the backing
> > > pool for the reprovisioning, then whatever operation that triggers
> > > the COW behaviour should fail with ENOSPC before doing anything
> > > else
> > 
> > Happy to implement this in dm-thinp.  Each thin block will need a bit
> > to say if the block must be REQ_PROVISION'd at time of snapshot (and
> > the resulting block will need the same bit set).
> > 
> > Walking all blocks of a thin device and triggering REQ_PROVISION for
> > each will obviously make thin snapshot creation take more time.
> > 
> > I think this approach is better than having a dedicated bitmap hooked
> > off each thin device's metadata (with bitmap being copied and walked
> > at the time of snapshot). But we'll see... I'll get with Joe to
> > discuss further.
> > 
> 
> Hi Mike,
> 
> If you recall our most recent discussions on this topic, I was thinking
> about the prospect of reserving the entire volume at mount time as an
> initial solution to this problem. When looking through some of the old
> reservation bits we prototyped years ago, it occurred to me that we have
> enough mechanism to actually prototype this.
> 
> So FYI, I have some hacky prototype code that essentially has the
> filesystem at mount time tell dm it's using the volume and expects all
> further writes to succeed. dm-thin acquires reservation for the entire
> range of the volume for which writes would require block allocation
> (i.e., holes and shared dm blocks) or otherwise warns that the fs cannot
> be "safely" mounted.
> 
> The reservation pool associates with the thin volume (not the
> filesystem), so if a snapshot is requested from dm, the snapshot request
> locates the snapshot origin and if it's currently active, increases the
> reservation pool to account for outstanding blocks that are about to
> become shared, or otherwise fails the snapshot with -ENOSPC. (I suspect
> discard needs similar treatment, but I hadn't got to that yet.). If the
> fs is not active, there is nothing to protect and so the snapshot
> proceeds as normal.
> 
> This seems to work on my simple, initial tests for protecting actively
> mounted filesystems from dm-thin -ENOSPC. This definitely needs a sanity
> check from dm-thin folks, however, because I don't know enough about the
> broader subsystem to reason about whether it's sufficiently correct. I
> just managed to beat the older prototype code into submission to get it
> to do what I wanted on simple experiments.

Feel free to share what you have.

But my initial gut on the approach is: why even use thin provisioning
at all if you're just going to reserve the entire logical address
space of each thin device?

> Thoughts on something like this? I think the main advantage is that it
> significantly reduces the requirements on the fs to track individual
> allocations. It's basically an on/off switch from the fs perspective,
> doesn't require any explicit provisioning whatsoever (though it can be
> done to improve things in the future) and in fact could probably be tied
> to thin volume activation to be made completely filesystem agnostic.
> Another advantage is that it requires no on-disk changes, no breaking
> COWs up front during snapshots, etc.

I'm just really unclear on the details without seeing it.

You shared a roll-up of the code we did from years ago so I can kind
of imagine the nature of the changes.  I'm concerned about snapshots,
and the implicit need to compound the reservation for each snapshot.

> The disadvantages are that it's space inefficient wrt to thin pool free
> space, but IIUC this is essentially what userspace management layers
> (such as Stratis) are doing today, they just put restrictions up front
> at volume configuration/creation time instead of at runtime. There also
> needs to be some kind of interface between the fs and dm. I suppose we
> could co-opt provision and discard primitives with a "reservation"
> modifier flag to get around that in a simple way, but that sounds
> potentially ugly. TBH, the more I think about this the more I think it
> makes sense to reserve on volume activation (with some caveats to allow
> a read-only mode, explicit bypass, etc.) and then let the
> cross-subsystem interface be dictated by granularity improvements...

It just feels imprecise to the point of being both excessive and
nebulous.

thin devices, and snapshots of them, can be active without associated
filesystem mounts being active.  It just takes a single 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-23 Thread Brian Foster
On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> On Fri, May 19 2023 at  7:07P -0400,
> Dave Chinner  wrote:
> 
> > On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> > > On Fri, May 19 2023 at 12:09P -0400,
> > > Christoph Hellwig  wrote:
> > > 
> > > > FYI, I really don't think this primitive is a good idea.  In the
> > > > concept of non-overwritable storage (NAND, SMR drives) the entire
> > > > concept of a one-shoot 'provisioning' that will guarantee later writes
> > > > are always possible is simply bogus.
> > > 
> > > Valid point for sure, such storage shouldn't advertise support (and
> > > will return -EOPNOTSUPP).
> > > 
> > > But the primitive still has utility for other classes of storage.
> > 
> > Yet the thing people are wanting to us filesystem developers to use
> > this with is thinly provisioned storage that has snapshot
> > capability. That, by definition, is non-overwritable storage. These
> > are the use cases people are asking filesystes to gracefully handle
> > and report errors when the sparse backing store runs out of space.
> 
> DM thinp falls into this category but as you detailed it can be made
> to work reliably. To carry that forward we need to first establish
> the REQ_PROVISION primitive (with this series).
> 
> Follow-on associated dm-thinp enhancements can then serve as reference
> for how to take advantage of XFS's ability to operate reliably of
> thinly provisioned storage.
>  
> > e.g. journal writes after a snapshot is taken on a busy filesystem
> > are always an overwrite and this requires more space in the storage
> > device for the write to succeed. ENOSPC from the backing device for
> > journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
> > guarantee space for overwrites after snapshots, then it's not
> > actually useful for solving the real world use cases we actually
> > need device-level provisioning to solve.
> > 
> > It is not viable for filesystems to have to reprovision space for
> > in-place metadata overwrites after every snapshot - the filesystem
> > may not even know a snapshot has been taken! And it's not feasible
> > for filesystems to provision on demand before they modify metadata
> > because we don't know what metadata is going to need to be modified
> > before we start modifying metadata in transactions. If we get ENOSPC
> > from provisioning in the middle of a dirty transcation, it's all
> > over just the same as if we get ENOSPC during metadata writeback...
> > 
> > Hence what filesystems actually need is device provisioned space to
> > be -always over-writable- without ENOSPC occurring.  Ideally, if we
> > provision a range of the block device, the block device *must*
> > guarantee all future writes to that LBA range succeeds. That
> > guarantee needs to stand until we discard or unmap the LBA range,
> > and for however many writes we do to that LBA range.
> > 
> > e.g. If the device takes a snapshot, it needs to reprovision the
> > potential COW ranges that overlap with the provisioned LBA range at
> > snapshot time. e.g. by re-reserving the space from the backing pool
> > for the provisioned space so if a COW occurs there is space
> > guaranteed for it to succeed.  If there isn't space in the backing
> > pool for the reprovisioning, then whatever operation that triggers
> > the COW behaviour should fail with ENOSPC before doing anything
> > else
> 
> Happy to implement this in dm-thinp.  Each thin block will need a bit
> to say if the block must be REQ_PROVISION'd at time of snapshot (and
> the resulting block will need the same bit set).
> 
> Walking all blocks of a thin device and triggering REQ_PROVISION for
> each will obviously make thin snapshot creation take more time.
> 
> I think this approach is better than having a dedicated bitmap hooked
> off each thin device's metadata (with bitmap being copied and walked
> at the time of snapshot). But we'll see... I'll get with Joe to
> discuss further.
> 

Hi Mike,

If you recall our most recent discussions on this topic, I was thinking
about the prospect of reserving the entire volume at mount time as an
initial solution to this problem. When looking through some of the old
reservation bits we prototyped years ago, it occurred to me that we have
enough mechanism to actually prototype this.

So FYI, I have some hacky prototype code that essentially has the
filesystem at mount time tell dm it's using the volume and expects all
further writes to succeed. dm-thin acquires reservation for the entire
range of the volume for which writes would require block allocation
(i.e., holes and shared dm blocks) or otherwise warns that the fs cannot
be "safely" mounted.

The reservation pool associates with the thin volume (not the
filesystem), so if a snapshot is requested from dm, the snapshot request
locates the snapshot origin and if it's currently active, increases the
reservation pool to account for outstanding blocks that are about to
become shared, or 

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-22 Thread Mike Snitzer
On Fri, May 19 2023 at  7:07P -0400,
Dave Chinner  wrote:

> On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> > On Fri, May 19 2023 at 12:09P -0400,
> > Christoph Hellwig  wrote:
> > 
> > > FYI, I really don't think this primitive is a good idea.  In the
> > > concept of non-overwritable storage (NAND, SMR drives) the entire
> > > concept of a one-shoot 'provisioning' that will guarantee later writes
> > > are always possible is simply bogus.
> > 
> > Valid point for sure, such storage shouldn't advertise support (and
> > will return -EOPNOTSUPP).
> > 
> > But the primitive still has utility for other classes of storage.
> 
> Yet the thing people are wanting to us filesystem developers to use
> this with is thinly provisioned storage that has snapshot
> capability. That, by definition, is non-overwritable storage. These
> are the use cases people are asking filesystes to gracefully handle
> and report errors when the sparse backing store runs out of space.

DM thinp falls into this category but as you detailed it can be made
to work reliably. To carry that forward we need to first establish
the REQ_PROVISION primitive (with this series).

Follow-on associated dm-thinp enhancements can then serve as reference
for how to take advantage of XFS's ability to operate reliably of
thinly provisioned storage.
 
> e.g. journal writes after a snapshot is taken on a busy filesystem
> are always an overwrite and this requires more space in the storage
> device for the write to succeed. ENOSPC from the backing device for
> journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
> guarantee space for overwrites after snapshots, then it's not
> actually useful for solving the real world use cases we actually
> need device-level provisioning to solve.
> 
> It is not viable for filesystems to have to reprovision space for
> in-place metadata overwrites after every snapshot - the filesystem
> may not even know a snapshot has been taken! And it's not feasible
> for filesystems to provision on demand before they modify metadata
> because we don't know what metadata is going to need to be modified
> before we start modifying metadata in transactions. If we get ENOSPC
> from provisioning in the middle of a dirty transcation, it's all
> over just the same as if we get ENOSPC during metadata writeback...
> 
> Hence what filesystems actually need is device provisioned space to
> be -always over-writable- without ENOSPC occurring.  Ideally, if we
> provision a range of the block device, the block device *must*
> guarantee all future writes to that LBA range succeeds. That
> guarantee needs to stand until we discard or unmap the LBA range,
> and for however many writes we do to that LBA range.
> 
> e.g. If the device takes a snapshot, it needs to reprovision the
> potential COW ranges that overlap with the provisioned LBA range at
> snapshot time. e.g. by re-reserving the space from the backing pool
> for the provisioned space so if a COW occurs there is space
> guaranteed for it to succeed.  If there isn't space in the backing
> pool for the reprovisioning, then whatever operation that triggers
> the COW behaviour should fail with ENOSPC before doing anything
> else

Happy to implement this in dm-thinp.  Each thin block will need a bit
to say if the block must be REQ_PROVISION'd at time of snapshot (and
the resulting block will need the same bit set).

Walking all blocks of a thin device and triggering REQ_PROVISION for
each will obviously make thin snapshot creation take more time.

I think this approach is better than having a dedicated bitmap hooked
off each thin device's metadata (with bitmap being copied and walked
at the time of snapshot). But we'll see... I'll get with Joe to
discuss further.

> Software devices like dm-thin/snapshot should really only need to
> keep a persistent map of the provisioned space and refresh space
> reservations for used space within that map whenever something that
> triggers COW behaviour occurs. i.e. a snapshot needs to reset the
> provisioned ranges back to "all ranges are freshly provisioned"
> before the snapshot is started. If that space is not available in
> the backing pool, then the snapshot attempt gets ENOSPC
> 
> That means filesystems only need to provision space for journals and
> fixed metadata at mkfs time, and they only need issue a
> REQ_PROVISION bio when they first allocate over-write in place
> metadata. We already have online discard and/or fstrim for releasing
> provisioned space via discards.
> 
> This will require some mods to filesystems like ext4 and XFS to
> issue REQ_PROVISION and fail gracefully during metadata allocation.
> However, doing so means that we can actually harden filesystems
> against sparse block device ENOSPC errors by ensuring they will
> never occur in critical filesystem structures

Yes, let's finally _do_ this! ;)

Mike

--
dm-devel mailing list
dm-devel@redhat.com

Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-19 Thread Dave Chinner
On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> On Fri, May 19 2023 at 12:09P -0400,
> Christoph Hellwig  wrote:
> 
> > FYI, I really don't think this primitive is a good idea.  In the
> > concept of non-overwritable storage (NAND, SMR drives) the entire
> > concept of a one-shoot 'provisioning' that will guarantee later writes
> > are always possible is simply bogus.
> 
> Valid point for sure, such storage shouldn't advertise support (and
> will return -EOPNOTSUPP).
> 
> But the primitive still has utility for other classes of storage.

Yet the thing people are wanting to us filesystem developers to use
this with is thinly provisioned storage that has snapshot
capability. That, by definition, is non-overwritable storage. These
are the use cases people are asking filesystes to gracefully handle
and report errors when the sparse backing store runs out of space.

e.g. journal writes after a snapshot is taken on a busy filesystem
are always an overwrite and this requires more space in the storage
device for the write to succeed. ENOSPC from the backing device for
journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
guarantee space for overwrites after snapshots, then it's not
actually useful for solving the real world use cases we actually
need device-level provisioning to solve.

It is not viable for filesystems to have to reprovision space for
in-place metadata overwrites after every snapshot - the filesystem
may not even know a snapshot has been taken! And it's not feasible
for filesystems to provision on demand before they modify metadata
because we don't know what metadata is going to need to be modified
before we start modifying metadata in transactions. If we get ENOSPC
from provisioning in the middle of a dirty transcation, it's all
over just the same as if we get ENOSPC during metadata writeback...

Hence what filesystems actually need is device provisioned space to
be -always over-writable- without ENOSPC occurring.  Ideally, if we
provision a range of the block device, the block device *must*
guarantee all future writes to that LBA range succeeds. That
guarantee needs to stand until we discard or unmap the LBA range,
and for however many writes we do to that LBA range.

e.g. If the device takes a snapshot, it needs to reprovision the
potential COW ranges that overlap with the provisioned LBA range at
snapshot time. e.g. by re-reserving the space from the backing pool
for the provisioned space so if a COW occurs there is space
guaranteed for it to succeed.  If there isn't space in the backing
pool for the reprovisioning, then whatever operation that triggers
the COW behaviour should fail with ENOSPC before doing anything
else

Software devices like dm-thin/snapshot should really only need to
keep a persistent map of the provisioned space and refresh space
reservations for used space within that map whenever something that
triggers COW behaviour occurs. i.e. a snapshot needs to reset the
provisioned ranges back to "all ranges are freshly provisioned"
before the snapshot is started. If that space is not available in
the backing pool, then the snapshot attempt gets ENOSPC

That means filesystems only need to provision space for journals and
fixed metadata at mkfs time, and they only need issue a
REQ_PROVISION bio when they first allocate over-write in place
metadata. We already have online discard and/or fstrim for releasing
provisioned space via discards.

This will require some mods to filesystems like ext4 and XFS to
issue REQ_PROVISION and fail gracefully during metadata allocation.
However, doing so means that we can actually harden filesystems
against sparse block device ENOSPC errors by ensuring they will
never occur in critical filesystem structures

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-19 Thread Mike Snitzer
On Fri, May 19 2023 at 12:09P -0400,
Christoph Hellwig  wrote:

> FYI, I really don't think this primitive is a good idea.  In the
> concept of non-overwritable storage (NAND, SMR drives) the entire
> concept of a one-shoot 'provisioning' that will guarantee later writes
> are always possible is simply bogus.

Valid point for sure, such storage shouldn't advertise support (and
will return -EOPNOTSUPP).

But the primitive still has utility for other classes of storage.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-18 Thread Christoph Hellwig
FYI, I really don't think this primitive is a good idea.  In the
concept of non-overwritable storage (NAND, SMR drives) the entire
concept of a one-shoot 'provisioning' that will guarantee later writes
are always possible is simply bogus.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel