Re: [developer] Manual dedup, aka --reflink support.

Jason King Mon, 15 Jun 2020 11:46:38 -0700

How would cross-dataset files be accounted in the dataset quotas? With
snapshot/clones IIRC, since there’s an origin dataset that gets ‘charged’
the space for all the unchanged blocks. It seems like doing the same here
would make the most sense (the original source of the file gets ‘charged’
for the space of the file) — but it seems like it’d be good to confirm this
is the expectation.



From: Allan Jude <allanj...@freebsd.org> <allanj...@freebsd.org>
Reply: openzfs-developer <developer@lists.open-zfs.org>
<developer@lists.open-zfs.org>
Date: June 15, 2020 at 1:38:49 PM
To: developer@lists.open-zfs.org <developer@lists.open-zfs.org>
<developer@lists.open-zfs.org>
Subject:  Re: [developer] Manual dedup, aka --reflink support.

If we used a bit in the block pointer, similar to the one we have for
dedup, we would only need to example the BRT in zio_free() if the BP had
the bit set. Of course the problem with that idea is that the 'original'
file won't have that bit set in its BP, so you'd need to search the BRT
to ensure there is not some other BP referencing your DVAs before you
free, so I guess that won't work.

For the FCFR idea, would that work across datasets? Or could it be made to.

I really like the idea of being able to restore an individual file from
a snapshot without having to copy it, but also copying files between
datasets without taking the additional space.


On 2020-06-15 12:18, Matthew Ahrens via openzfs-developer wrote:
> Cool!  Couple of questions/observations:
>
> Do I understand correctly that the new data structure you're proposing
> (the BRT) maps from DVA to refcount?
>
> If so, and we can keep this data structure sorted on disk (by DVA), we
> would be more likely to get multiple useful entries when reading one
> block of the BRT.  That would reduce the pathologies of the DDT (where
> each block of the DDT contains random entries).
>
> However, even so, looking up in the BRT for every single zio_free()
> would be a substantial cost.  I imagine that in practice, the BRT would
> need to be fully cached to get good performance.  In practice the
> substantial difference from using the DDT may only be that we don't have
> to use a strong checksum (because it's indexed by DVA instead of
> checksum).  Aside from that, you could use the DDT, and assume that if
> we don't find an entry, it has an effective refcount of 1.
>
> I think there could be a more efficient solution to a subset of the
> problems that this tackles.  For example, here is an incomplete idea for
> file cloning: We could create a new "file clone family refcount" (FCFR)
> data structure when each file is cloned.  The FCFR would map from
> blockID -> refcount.  Each object would have a (normally empty) pointer
> to its FCFR data structure.  This way, only files that are clone would
> pay any performance penalty.  And each cloned file's data structure is
> independent, so manipulating one cloned file doesn't have to deal with a
> huge (pool-wide) data structure.
>
> --matt
>
> On Sat, Jun 13, 2020 at 4:07 PM Pawel Jakub Dawidek <pa...@dawidek.net
> <mailto:pa...@dawidek.net>> wrote:
>
> For reference, see https://github.com/openzfs/zfs/issues/405
>
> The functionality will allow to clone a file very quickly by avoiding
> copying of the actual data and taking (almost) no additional space from
> the pool.
>
> Once the file is cloned this way, either source or destination can be
> modified without affecting the other copy.
>
> The closest analogy currently is dedup - when dedup is enabled on the
> given dataset and we store a block, we set the dedup bit in its block
> pointer (BP) and add an entry to the dedup table (DDT) referencing this
> block. This reference includes block's checksum/hash (ideally
> cryptographically strong one). When we write block with the same content
> and dedup property is still enabled on the dataset, we will check if
> there is an entry already with the same hash in DDT. If it is, then we
> will increase reference count on the entry in DDT, modify BP to point at
> existing block and skip writing data to VDEVs.
>
> Can we reuse existing dedup machinery to implement this manual dedup?
>
> Short answer: no.
>
> Longer answer:
>
> Dedup takes advantage of the fact that a reference to the block is
> stored in DDT when the block appears for the first time. So on the write
> the block is already in DDT and we just need to increase reference
> counter.
>
> Here, we don't want to keep a table with references to all the blocks
> (let's learn a lesson). We want to be able to keep track of only the
> blocks that are referenced more than once.
>
> We cannot keep reference counter in BP for the given block, as once we
> create another reference, we cannot modify existing BP pointing at this
> block.
>
> The conclusion, at least for me, is pretty clear: we need additional
> table aside that will keep track of blocks referenced more than once.
> Let's call it Block Reference Table (BRT).
>
> When a block is written in a normal way, this table is not touched.
>
> When we create a reference to an already existing block, we look for the
> entry in BRT (in case the block has more than one reference already). If
> we find it, we increase reference count, if we dont' we create an entry
> in BRT with refcount equals 2. Note, that all entries in BRT have
> refcount >= 2. Of course we can agree that refcount means "extra
> references" and make it start from 1. Doesn't really matter.
>
> Random observations about the feature:
>
> - It can be a pool-wide thing. I think it is acceptable to clone a block
> from another dataset with a different compression, checksum, copies or
> recordsize. It is not ok to clone a block between encrypted and
> unencrypted datasets. It is ok to clone a block between two encrypted
> datasets, but only if they share the same encryption key.
>
> - It can be always enabled - if unused, BRT is empty, so there should be
> no performance impact. Optional toggle would only be useful to control
> the use of this feature by the users.
>
> - It should affect neither write nor read performance. It may affect
> block freeing performance (so also overwriting) as we need to consult
> BRT on every level-0 plain-file-contents block free.
>
> - If we would have a toggle to turn it on or off for selected datasets,
> we could have a bit in BP (similar to dedup) to hint that we should
> consult BRT on free.
>
> - The performance impact above should be much lower than DDT even for
> large BRT as BRT entry is very small and entries can be sorted properly
> (eg. by vdev/offset instead of being random like dedup hashes).
>
> - BRT entry may contain only VDEV id and offset, because when we clone
> the block we have to provide source BP anyway, which contains all the
> properties.
>
> - Currently I don't see a way to make those block references survive
> send/recv. It is similar to dedup, but dedup can reconstruct referenced
> on receive based on hashes. Here would need to have some kind of magic
> mapping between source BP and destination BP.
>
> - Maybe it is obvious, but let me mention, that this will only work for
> level-0 plain-file-contents blocks, just like dedup, so all indirect
> blocks still have to be allocated.
>
> Some potential use cases:
>
> - Quick cloning of big files (eg. VM images) - obvious one.
>
> - Recovering accidentally deleted files from a snapshot (without the
> need of rollback/clone or copying, which takes additional space).
>
> - Moving files between datasets.
>
> - This should work even within a single file, so imaging creating a
> space at the beginning of a file (in recordsize increments, though).
>
> I started the implementation already, but this is hobby project for me,
> so don't expect very fast pace.
>
> Your comments and thoughts are greatly appreciated.
>
> --
> Pawel Jakub Dawidek
>
> ------------------------------------------
> openzfs: openzfs-developer
> Permalink:
>
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-M77d44ed51187031460b91160
> Delivery options:
> https://openzfs.topicbox.com/groups/developer/subscription
>
> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer /
> see discussions <https://openzfs.topicbox.com/groups/developer> +
> participants <https://openzfs.topicbox.com/groups/developer/members> +
> delivery options
> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
> <
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-Ma15bb4984382ab368f7e13f0>



-- 
Allan Jude

----------=_1592246248-330386-1--

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-Mfaac7be99e0c8ec1dfb341b5
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] Manual dedup, aka --reflink support.

Reply via email to