Re: Cutting files

[email protected] Fri, 14 Aug 2020 11:40:06 -0700

The client is chunking because then it can hash the chunk and check whether 
that chunk is already at the backend, saving transfer costs.
Different clients should chunk the same, otherwise this "already there" 
phenomenon would be rare.
________________________________
From: [email protected] <[email protected]> on behalf of Joe 
Moore <[email protected]>
Sent: Friday, August 14, 2020 6:21:34 PM
To: [email protected] <[email protected]>
Subject: Re: Cutting files

I'm a bit late to this part of the discussion, but I think this is relevant:

I know why splitting blobs is useful (performance, scale, dedup) but is there a 
reason why all backends need to use the same algorithm/settings?

If I'm pulling a file from Amazon glacier, and being charged 
per-file-requested, I'd rather have the file in large chunks.

If it's a video file, there may be better ways to split it up (such as on key 
frames) than just looking at the rolling hash.

If it's just tweets, why ever bother calculating the rollsum, it's never going 
to be that big.

Proposal:
Why not let the backend blobstore service take care of chunking a blob into 
pieces however it makes sense?

The use cases above (huge single-chunk file, content-aware splitting, optimize 
for small blobs) naturally resolve with different parameters.

For the developers, you could replicate from blobstore-with-rollsum (dechunk, 
transfer data, chunk differently) over to blobstore-with-different-rollsum to 
compare performance or sizes or whatever.

It would make for a very simple existing-filesystem layer: set the split size 
to \inf and point the backend existing (whole) files.

The downside as I see it would be a reansonable-default config option for 
blobservers that choose to implement this parameter.

--Joe

On Thu, Aug 13, 2020, 2:30 PM Ian Denhardt 
<[email protected]<mailto:[email protected]>> wrote:
Okay, that's starting to seem a bit more relevant.

It would be interesting to measure empirically what the mean & median
block sizes for a real-world perkeep store are; if there's a high
correlation between bits one would expect these metrics to be lower
than a uniformly random hash function would provide. It would be
valuable on its own to have documentation describing how to get a
desired block size with this statistically imperfect function.

I'm okay with having a statistically more compelling hash be our
"recommended" choice, including this one only for compatibility.

I started writing a spec for this hash (based on the rsync document, but
including the deviations made by perkeep & bup). I'll publish it
somewhere soon so we can collaborate.

-Ian

Quoting Bob Glickstein (2020-08-13 10:58:28)
>    On Tue, Aug 11, 2020 at 3:38 PM Ian Denhardt 
> <[1][email protected]<mailto:[email protected]>>
>    wrote:
>
>      I think
>      it's valuable for this function to be one of the options for
>      compatibility
>
>    I agree there, but I'm becoming increasingly convinced that it should
>    not be the default, or even recommended.
>    I've added a new level of analysis to my benchmark function as of
>    [2]the latest commit: in addition to counting how often each bit is
>    zero (which should approach 50%), it counts how often each pair of bits
>    is correlated (which should also approach 50%). The results for rollsum
>    are not great. On the other hand, the results for the other algorithms
>    in� 
> [3]github.com/chmduquesne/rollinghash<http://github.com/chmduquesne/rollinghash>,
>  which are now added to the
>    benchmark, are great (except for adler32). Try running with and without
>    the env var� BENCHMARK_ROLLSUM_ANALYZE=1 to see the� results.
>    By the way, there's probably more sophisticated analysis that could be
>    done on the distribution produced by these hashes but I suspect we're
>    into diminishing returns after the pairwise bit correlations I'm now
>    doing. I could be wrong though. Whether I am, and how else the results
>    should be analyzed, are left as an exercise for other readers of this
>    thread.
>    Cheers,
>    - Bob
>
>      and (2) I don't want to give too many options; a
>      different hash function is much more extra implementation work than
>      a
>      numeric parameter, and if we add too many we've sort of missed the
>      point of standardization. So I'd only want to do this if there are
>      clear
>      compelling use cases for each of the functions we include.
>      Whatever parameters we decide to add, we should pick a
>      default/recommended set of values for them.
>
> Verweise
>
>    1. mailto:[email protected]<mailto:[email protected]>
>    2. 
> https://github.com/bobg/hashsplit/commit/17195adda444fcc11e96cfd6058613edd88af5be
>    3. http://github.com/chmduquesne/rollinghash

--
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:perkeep%[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/159734342442.787.1979326535815718855%40localhost.localdomain.

--
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAJSw_v%2BpPq2bdsmVkPy1JHQZA0F6BPnCvZ9Db2A3CTELNe3AfA%40mail.gmail.com<https://groups.google.com/d/msgid/perkeep/CAJSw_v%2BpPq2bdsmVkPy1JHQZA0F6BPnCvZ9Db2A3CTELNe3AfA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/VI1PR0501MB2175142D60BE13E7EAA783D4FD400%40VI1PR0501MB2175.eurprd05.prod.outlook.com.

Re: Cutting files

Reply via email to