Pawel spent a fair amount of time discussing this with me, which is good 'cause I apparently had been confused.
The idea and implementation he suggests sounds reasonable to me, and will (finally!) allow offline dedup :). Sean. > On Jun 13, 2020, at 12:52 PM, Pawel Jakub Dawidek <pa...@dawidek.net> wrote: > > For reference, see https://github.com/openzfs/zfs/issues/405 > > The functionality will allow to clone a file very quickly by avoiding > copying of the actual data and taking (almost) no additional space from > the pool. > > Once the file is cloned this way, either source or destination can be > modified without affecting the other copy. > > The closest analogy currently is dedup - when dedup is enabled on the > given dataset and we store a block, we set the dedup bit in its block > pointer (BP) and add an entry to the dedup table (DDT) referencing this > block. This reference includes block's checksum/hash (ideally > cryptographically strong one). When we write block with the same content > and dedup property is still enabled on the dataset, we will check if > there is an entry already with the same hash in DDT. If it is, then we > will increase reference count on the entry in DDT, modify BP to point at > existing block and skip writing data to VDEVs. > > Can we reuse existing dedup machinery to implement this manual dedup? > > Short answer: no. > > Longer answer: > > Dedup takes advantage of the fact that a reference to the block is > stored in DDT when the block appears for the first time. So on the write > the block is already in DDT and we just need to increase reference counter. > > Here, we don't want to keep a table with references to all the blocks > (let's learn a lesson). We want to be able to keep track of only the > blocks that are referenced more than once. > > We cannot keep reference counter in BP for the given block, as once we > create another reference, we cannot modify existing BP pointing at this > block. > > The conclusion, at least for me, is pretty clear: we need additional > table aside that will keep track of blocks referenced more than once. > Let's call it Block Reference Table (BRT). > > When a block is written in a normal way, this table is not touched. > > When we create a reference to an already existing block, we look for the > entry in BRT (in case the block has more than one reference already). If > we find it, we increase reference count, if we dont' we create an entry > in BRT with refcount equals 2. Note, that all entries in BRT have > refcount >= 2. Of course we can agree that refcount means "extra > references" and make it start from 1. Doesn't really matter. > > Random observations about the feature: > > - It can be a pool-wide thing. I think it is acceptable to clone a block > from another dataset with a different compression, checksum, copies or > recordsize. It is not ok to clone a block between encrypted and > unencrypted datasets. It is ok to clone a block between two encrypted > datasets, but only if they share the same encryption key. > > - It can be always enabled - if unused, BRT is empty, so there should be > no performance impact. Optional toggle would only be useful to control > the use of this feature by the users. > > - It should affect neither write nor read performance. It may affect > block freeing performance (so also overwriting) as we need to consult > BRT on every level-0 plain-file-contents block free. > > - If we would have a toggle to turn it on or off for selected datasets, > we could have a bit in BP (similar to dedup) to hint that we should > consult BRT on free. > > - The performance impact above should be much lower than DDT even for > large BRT as BRT entry is very small and entries can be sorted properly > (eg. by vdev/offset instead of being random like dedup hashes). > > - BRT entry may contain only VDEV id and offset, because when we clone > the block we have to provide source BP anyway, which contains all the > properties. > > - Currently I don't see a way to make those block references survive > send/recv. It is similar to dedup, but dedup can reconstruct referenced > on receive based on hashes. Here would need to have some kind of magic > mapping between source BP and destination BP. > > - Maybe it is obvious, but let me mention, that this will only work for > level-0 plain-file-contents blocks, just like dedup, so all indirect > blocks still have to be allocated. > > Some potential use cases: > > - Quick cloning of big files (eg. VM images) - obvious one. > > - Recovering accidentally deleted files from a snapshot (without the > need of rollback/clone or copying, which takes additional space). > > - Moving files between datasets. > > - This should work even within a single file, so imaging creating a > space at the beginning of a file (in recordsize increments, though). > > I started the implementation already, but this is hobby project for me, > so don't expect very fast pace. > > Your comments and thoughts are greatly appreciated. > > -- > Pawel Jakub Dawidek ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/Te62797341aee0806-Me95374fa175298d7cd20a07e Delivery options: https://openzfs.topicbox.com/groups/developer/subscription