Quoting o lu (2020-07-31 01:37:47) > How does Perkeep know where to cut media files when deduping?
The key idea is: after each byte, compute a (non-cryptographic) hash of the most recent N bytes seen so far. If the bottom M bits of the hash are zero, split there, otherwise keep going. IIRC perkeep chooses N = 64 and M = 13. For an intuition of why this works well, imagine the case where some characters are inserted in the middle of a large file, which was previously stored in several chunks. All of the chunks before the location of the insertion will be the same. The insertion itself will result in not re-using the chunk that contained that section of the file, but because the split point is chosen based on the content at that point, the new chunk will probably end at the same place the old one did, and then for subsequent chunks we'll have "synchronized" with the old copy, allowing us to re-use the old chunks after the modification as well. The particular hash function used is designed to make incrementally hashing a sliding window like this efficient, but for the purposes of choosing a "good" split point you could use any hash function, it would just be slower. The code for the hash function is in the package go4.org/rollsum The full implementation takes into account some other details -- things like enforcing a maximum & minimum chunk size -- but the above is where the magic comes from. The actual code that does the splitting is in writeFileChunks in pkg/schema/filewriter.go, if you're interested in all the gory details. I have kicking around on my hard drive a WIP library that just provides the splitting algorithm, without being entangled with the rest of the file upload logic. I should finish that up and publish it... Hope this is useful, -Ian -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/159618442004.25997.10874365701907187897%40localhost.localdomain.
