Cool!  Couple of questions/observations:

Do I understand correctly that the new data structure you're proposing (the
BRT) maps from DVA to refcount?

If so, and we can keep this data structure sorted on disk (by DVA), we
would be more likely to get multiple useful entries when reading one block
of the BRT.  That would reduce the pathologies of the DDT (where each block
of the DDT contains random entries).

However, even so, looking up in the BRT for every single zio_free() would
be a substantial cost.  I imagine that in practice, the BRT would need to
be fully cached to get good performance.  In practice the substantial
difference from using the DDT may only be that we don't have to use a
strong checksum (because it's indexed by DVA instead of checksum).  Aside
from that, you could use the DDT, and assume that if we don't find an
entry, it has an effective refcount of 1.

I think there could be a more efficient solution to a subset of the
problems that this tackles.  For example, here is an incomplete idea for
file cloning: We could create a new "file clone family refcount" (FCFR)
data structure when each file is cloned.  The FCFR would map from blockID
-> refcount.  Each object would have a (normally empty) pointer to its FCFR
data structure.  This way, only files that are clone would pay any
performance penalty.  And each cloned file's data structure is independent,
so manipulating one cloned file doesn't have to deal with a huge
(pool-wide) data structure.

--matt

On Sat, Jun 13, 2020 at 4:07 PM Pawel Jakub Dawidek <pa...@dawidek.net>
wrote:

> For reference, see https://github.com/openzfs/zfs/issues/405
>
> The functionality will allow to clone a file very quickly by avoiding
> copying of the actual data and taking (almost) no additional space from
> the pool.
>
> Once the file is cloned this way, either source or destination can be
> modified without affecting the other copy.
>
> The closest analogy currently is dedup - when dedup is enabled on the
> given dataset and we store a block, we set the dedup bit in its block
> pointer (BP) and add an entry to the dedup table (DDT) referencing this
> block. This reference includes block's checksum/hash (ideally
> cryptographically strong one). When we write block with the same content
> and dedup property is still enabled on the dataset, we will check if
> there is an entry already with the same hash in DDT. If it is, then we
> will increase reference count on the entry in DDT, modify BP to point at
> existing block and skip writing data to VDEVs.
>
> Can we reuse existing dedup machinery to implement this manual dedup?
>
> Short answer: no.
>
> Longer answer:
>
> Dedup takes advantage of the fact that a reference to the block is
> stored in DDT when the block appears for the first time. So on the write
> the block is already in DDT and we just need to increase reference counter.
>
> Here, we don't want to keep a table with references to all the blocks
> (let's learn a lesson). We want to be able to keep track of only the
> blocks that are referenced more than once.
>
> We cannot keep reference counter in BP for the given block, as once we
> create another reference, we cannot modify existing BP pointing at this
> block.
>
> The conclusion, at least for me, is pretty clear: we need additional
> table aside that will keep track of blocks referenced more than once.
> Let's call it Block Reference Table (BRT).
>
> When a block is written in a normal way, this table is not touched.
>
> When we create a reference to an already existing block, we look for the
> entry in BRT (in case the block has more than one reference already). If
> we find it, we increase reference count, if we dont' we create an entry
> in BRT with refcount equals 2. Note, that all entries in BRT have
> refcount >= 2. Of course we can agree that refcount means "extra
> references" and make it start from 1. Doesn't really matter.
>
> Random observations about the feature:
>
> - It can be a pool-wide thing. I think it is acceptable to clone a block
> from another dataset with a different compression, checksum, copies or
> recordsize. It is not ok to clone a block between encrypted and
> unencrypted datasets. It is ok to clone a block between two encrypted
> datasets, but only if they share the same encryption key.
>
> - It can be always enabled - if unused, BRT is empty, so there should be
> no performance impact. Optional toggle would only be useful to control
> the use of this feature by the users.
>
> - It should affect neither write nor read performance. It may affect
> block freeing performance (so also overwriting) as we need to consult
> BRT on every level-0 plain-file-contents block free.
>
> - If we would have a toggle to turn it on or off for selected datasets,
> we could have a bit in BP (similar to dedup) to hint that we should
> consult BRT on free.
>
> - The performance impact above should be much lower than DDT even for
> large BRT as BRT entry is very small and entries can be sorted properly
> (eg. by vdev/offset instead of being random like dedup hashes).
>
> - BRT entry may contain only VDEV id and offset, because when we clone
> the block we have to provide source BP anyway, which contains all the
> properties.
>
> - Currently I don't see a way to make those block references survive
> send/recv. It is similar to dedup, but dedup can reconstruct referenced
> on receive based on hashes. Here would need to have some kind of magic
> mapping between source BP and destination BP.
>
> - Maybe it is obvious, but let me mention, that this will only work for
> level-0 plain-file-contents blocks, just like dedup, so all indirect
> blocks still have to be allocated.
>
> Some potential use cases:
>
> - Quick cloning of big files (eg. VM images) - obvious one.
>
> - Recovering accidentally deleted files from a snapshot (without the
> need of rollback/clone or copying, which takes additional space).
>
> - Moving files between datasets.
>
> - This should work even within a single file, so imaging creating a
> space at the beginning of a file (in recordsize increments, though).
>
> I started the implementation already, but this is hobby project for me,
> so don't expect very fast pace.
>
> Your comments and thoughts are greatly appreciated.
>
> --
> Pawel Jakub Dawidek
>
> ------------------------------------------
> openzfs: openzfs-developer
> Permalink:
> https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-M77d44ed51187031460b91160
> Delivery options:
> https://openzfs.topicbox.com/groups/developer/subscription
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-Ma15bb4984382ab368f7e13f0
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to