Quoting Joe Moore (2020-08-14 12:21:34)

>    I know why splitting blobs is useful (performance, scale, dedup) but is
>    there a reason why all backends need to use the same
>    algorithm/settings?
>    If I'm pulling a file from Amazon glacier, and being charged
>    per-file-requested, I'd rather have the file in large chunks.
>    If it's a video file, there may be better ways to split it up (such as
>    on key frames) than just looking at the rolling hash.�
>    If it's just tweets, why ever bother calculating the rollsum, it's
>    never going to be that big.

I think making constants (things like target block size, min/max sizes)
tunable is fine, and esp. given that there are already systems out in
the wild using different parameters here, libraries building this stuff
ought to allow the caller to provide these parameters. It's easy enough
to do since they're just the choice of numbers.

I think there's a stronger reason to avoid a proliferation of hash
functions though, as each additional hash adds meaningful implementation
burden. We my still want to specify a number of options to capture
existing systems, but we should avoid adding hash functions without
clear use cases.

>    Proposal:
>    Why not let the backend blobstore service take care of chunking a blob
>    into pieces however it makes sense?

I can think of a couple reasons why hiding the splits from the client
might not be desirable:

- If the client doesn't know about the splitting algorithm, it has to
  transfer whole files, every time, potentially wasting lots of
  bandwidth. If the splitting is done client side, only modified blocks
  need to be transferred.
- If the splits are opaque, transferring from one store to another using
  only public interfaces basically has to re-duplicate the blobs, as all
  all of the files are stitched back together, transferred, and then
  split again. If you do this really naively it could result in
  potentially exponential space blowup, but even being smart about only
  copying whole files once this could be very expensive. As a point of
  reference, I migrated to perkeep from a backup tool I'd written myself
  that did hash-based dedup at the file level only, and the perkeep
  version used about half the storage of the old system.

Your proposal may be fine for some use cases, but I still think there's
value in being able to replicate the same split without using the same
piece of software.

-Ian

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/159743125197.2217.14685993393169303496%40localhost.localdomain.

Reply via email to