Quoting Joe Moore (2020-08-14 12:21:34) > I know why splitting blobs is useful (performance, scale, dedup) but is > there a reason why all backends need to use the same > algorithm/settings? > If I'm pulling a file from Amazon glacier, and being charged > per-file-requested, I'd rather have the file in large chunks. > If it's a video file, there may be better ways to split it up (such as > on key frames) than just looking at the rolling hash.� > If it's just tweets, why ever bother calculating the rollsum, it's > never going to be that big.
I think making constants (things like target block size, min/max sizes) tunable is fine, and esp. given that there are already systems out in the wild using different parameters here, libraries building this stuff ought to allow the caller to provide these parameters. It's easy enough to do since they're just the choice of numbers. I think there's a stronger reason to avoid a proliferation of hash functions though, as each additional hash adds meaningful implementation burden. We my still want to specify a number of options to capture existing systems, but we should avoid adding hash functions without clear use cases. > Proposal: > Why not let the backend blobstore service take care of chunking a blob > into pieces however it makes sense? I can think of a couple reasons why hiding the splits from the client might not be desirable: - If the client doesn't know about the splitting algorithm, it has to transfer whole files, every time, potentially wasting lots of bandwidth. If the splitting is done client side, only modified blocks need to be transferred. - If the splits are opaque, transferring from one store to another using only public interfaces basically has to re-duplicate the blobs, as all all of the files are stitched back together, transferred, and then split again. If you do this really naively it could result in potentially exponential space blowup, but even being smart about only copying whole files once this could be very expensive. As a point of reference, I migrated to perkeep from a backup tool I'd written myself that did hash-based dedup at the file level only, and the perkeep version used about half the storage of the old system. Your proposal may be fine for some use cases, but I still think there's value in being able to replicate the same split without using the same piece of software. -Ian -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/159743125197.2217.14685993393169303496%40localhost.localdomain.
