Pawel spent a fair amount of time discussing this with me, which is good 'cause 
I
apparently had been confused.

The idea and implementation he suggests sounds reasonable to me, and will
(finally!) allow offline dedup :).

Sean.

> On Jun 13, 2020, at 12:52 PM, Pawel Jakub Dawidek <pa...@dawidek.net> wrote:
> 
> For reference, see https://github.com/openzfs/zfs/issues/405
> 
> The functionality will allow to clone a file very quickly by avoiding
> copying of the actual data and taking (almost) no additional space from
> the pool.
> 
> Once the file is cloned this way, either source or destination can be
> modified without affecting the other copy.
> 
> The closest analogy currently is dedup - when dedup is enabled on the
> given dataset and we store a block, we set the dedup bit in its block
> pointer (BP) and add an entry to the dedup table (DDT) referencing this
> block. This reference includes block's checksum/hash (ideally
> cryptographically strong one). When we write block with the same content
> and dedup property is still enabled on the dataset, we will check if
> there is an entry already with the same hash in DDT. If it is, then we
> will increase reference count on the entry in DDT, modify BP to point at
> existing block and skip writing data to VDEVs.
> 
> Can we reuse existing dedup machinery to implement this manual dedup?
> 
> Short answer: no.
> 
> Longer answer:
> 
> Dedup takes advantage of the fact that a reference to the block is
> stored in DDT when the block appears for the first time. So on the write
> the block is already in DDT and we just need to increase reference counter.
> 
> Here, we don't want to keep a table with references to all the blocks
> (let's learn a lesson). We want to be able to keep track of only the
> blocks that are referenced more than once.
> 
> We cannot keep reference counter in BP for the given block, as once we
> create another reference, we cannot modify existing BP pointing at this
> block.
> 
> The conclusion, at least for me, is pretty clear: we need additional
> table aside that will keep track of blocks referenced more than once.
> Let's call it Block Reference Table (BRT).
> 
> When a block is written in a normal way, this table is not touched.
> 
> When we create a reference to an already existing block, we look for the
> entry in BRT (in case the block has more than one reference already). If
> we find it, we increase reference count, if we dont' we create an entry
> in BRT with refcount equals 2. Note, that all entries in BRT have
> refcount >= 2. Of course we can agree that refcount means "extra
> references" and make it start from 1. Doesn't really matter.
> 
> Random observations about the feature:
> 
> - It can be a pool-wide thing. I think it is acceptable to clone a block
> from another dataset with a different compression, checksum, copies or
> recordsize. It is not ok to clone a block between encrypted and
> unencrypted datasets. It is ok to clone a block between two encrypted
> datasets, but only if they share the same encryption key.
> 
> - It can be always enabled - if unused, BRT is empty, so there should be
> no performance impact. Optional toggle would only be useful to control
> the use of this feature by the users.
> 
> - It should affect neither write nor read performance. It may affect
> block freeing performance (so also overwriting) as we need to consult
> BRT on every level-0 plain-file-contents block free.
> 
> - If we would have a toggle to turn it on or off for selected datasets,
> we could have a bit in BP (similar to dedup) to hint that we should
> consult BRT on free.
> 
> - The performance impact above should be much lower than DDT even for
> large BRT as BRT entry is very small and entries can be sorted properly
> (eg. by vdev/offset instead of being random like dedup hashes).
> 
> - BRT entry may contain only VDEV id and offset, because when we clone
> the block we have to provide source BP anyway, which contains all the
> properties.
> 
> - Currently I don't see a way to make those block references survive
> send/recv. It is similar to dedup, but dedup can reconstruct referenced
> on receive based on hashes. Here would need to have some kind of magic
> mapping between source BP and destination BP.
> 
> - Maybe it is obvious, but let me mention, that this will only work for
> level-0 plain-file-contents blocks, just like dedup, so all indirect
> blocks still have to be allocated.
> 
> Some potential use cases:
> 
> - Quick cloning of big files (eg. VM images) - obvious one.
> 
> - Recovering accidentally deleted files from a snapshot (without the
> need of rollback/clone or copying, which takes additional space).
> 
> - Moving files between datasets.
> 
> - This should work even within a single file, so imaging creating a
> space at the beginning of a file (in recordsize increments, though).
> 
> I started the implementation already, but this is hobby project for me,
> so don't expect very fast pace.
> 
> Your comments and thoughts are greatly appreciated.
> 
> --
> Pawel Jakub Dawidek

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-Me95374fa175298d7cd20a07e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to