[developer] Manual dedup, aka --reflink support.

Pawel Jakub Dawidek Sat, 13 Jun 2020 16:08:06 -0700

For reference, see https://github.com/openzfs/zfs/issues/405


The functionality will allow to clone a file very quickly by avoiding
copying of the actual data and taking (almost) no additional space from
the pool.

Once the file is cloned this way, either source or destination can be
modified without affecting the other copy.

The closest analogy currently is dedup - when dedup is enabled on the
given dataset and we store a block, we set the dedup bit in its block
pointer (BP) and add an entry to the dedup table (DDT) referencing this
block. This reference includes block's checksum/hash (ideally
cryptographically strong one). When we write block with the same content
and dedup property is still enabled on the dataset, we will check if
there is an entry already with the same hash in DDT. If it is, then we
will increase reference count on the entry in DDT, modify BP to point at
existing block and skip writing data to VDEVs.

Can we reuse existing dedup machinery to implement this manual dedup?

Short answer: no.

Longer answer:

Dedup takes advantage of the fact that a reference to the block is
stored in DDT when the block appears for the first time. So on the write
the block is already in DDT and we just need to increase reference counter.

Here, we don't want to keep a table with references to all the blocks
(let's learn a lesson). We want to be able to keep track of only the
blocks that are referenced more than once.

We cannot keep reference counter in BP for the given block, as once we
create another reference, we cannot modify existing BP pointing at this
block.

The conclusion, at least for me, is pretty clear: we need additional
table aside that will keep track of blocks referenced more than once.
Let's call it Block Reference Table (BRT).

When a block is written in a normal way, this table is not touched.

When we create a reference to an already existing block, we look for the
entry in BRT (in case the block has more than one reference already). If
we find it, we increase reference count, if we dont' we create an entry
in BRT with refcount equals 2. Note, that all entries in BRT have
refcount >= 2. Of course we can agree that refcount means "extra
references" and make it start from 1. Doesn't really matter.

Random observations about the feature:

- It can be a pool-wide thing. I think it is acceptable to clone a block
from another dataset with a different compression, checksum, copies or
recordsize. It is not ok to clone a block between encrypted and
unencrypted datasets. It is ok to clone a block between two encrypted
datasets, but only if they share the same encryption key.

- It can be always enabled - if unused, BRT is empty, so there should be
no performance impact. Optional toggle would only be useful to control
the use of this feature by the users.

- It should affect neither write nor read performance. It may affect
block freeing performance (so also overwriting) as we need to consult
BRT on every level-0 plain-file-contents block free.

- If we would have a toggle to turn it on or off for selected datasets,
we could have a bit in BP (similar to dedup) to hint that we should
consult BRT on free.

- The performance impact above should be much lower than DDT even for
large BRT as BRT entry is very small and entries can be sorted properly
(eg. by vdev/offset instead of being random like dedup hashes).

- BRT entry may contain only VDEV id and offset, because when we clone
the block we have to provide source BP anyway, which contains all the
properties.

- Currently I don't see a way to make those block references survive
send/recv. It is similar to dedup, but dedup can reconstruct referenced
on receive based on hashes. Here would need to have some kind of magic
mapping between source BP and destination BP.

- Maybe it is obvious, but let me mention, that this will only work for
level-0 plain-file-contents blocks, just like dedup, so all indirect
blocks still have to be allocated.

Some potential use cases:

- Quick cloning of big files (eg. VM images) - obvious one.

- Recovering accidentally deleted files from a snapshot (without the
need of rollback/clone or copying, which takes additional space).

- Moving files between datasets.

- This should work even within a single file, so imaging creating a
space at the beginning of a file (in recordsize increments, though).

I started the implementation already, but this is hobby project for me,
so don't expect very fast pace.

Your comments and thoughts are greatly appreciated.

-- 
Pawel Jakub Dawidek

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-M77d44ed51187031460b91160
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

[developer] Manual dedup, aka --reflink support.

Reply via email to