Quoting o lu (2020-07-31 01:37:47)

> How does Perkeep know where to cut media files when deduping?

The key idea is: after each byte, compute a (non-cryptographic) hash of
the most recent N bytes seen so far. If the bottom M bits of the hash
are zero, split there, otherwise keep going. IIRC perkeep chooses N = 64
and M = 13.

For an intuition of why this works well, imagine the case where some
characters are inserted in the middle of a large file, which was
previously stored in several chunks. All of the chunks before the
location of the insertion will be the same. The insertion itself will
result in not re-using the chunk that contained that section of the
file, but because the split point is chosen based on the content at that
point, the new chunk will probably end at the same place the old one
did, and then for subsequent chunks we'll have "synchronized" with the
old copy, allowing us to re-use the old chunks after the modification as
well.

The particular hash function used is designed to make incrementally
hashing a sliding window like this efficient, but for the purposes of
choosing a "good" split point you could use any hash function, it would
just be slower. The code for the hash function is in the package
go4.org/rollsum

The full implementation takes into account some other details -- things
like enforcing a maximum & minimum chunk size -- but the above is where
the magic comes from. The actual code that does the splitting is in
writeFileChunks in pkg/schema/filewriter.go, if you're interested in all
the gory details.

I have kicking around on my hard drive a WIP library that just provides
the splitting algorithm, without being entangled with the rest of the
file upload logic. I should finish that up and publish it...

Hope this is useful,

-Ian

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/159618442004.25997.10874365701907187897%40localhost.localdomain.

Reply via email to