Re: Cutting files

Ian Denhardt Thu, 13 Aug 2020 11:30:46 -0700

Okay, that's starting to seem a bit more relevant.

It would be interesting to measure empirically what the mean & median
block sizes for a real-world perkeep store are; if there's a high
correlation between bits one would expect these metrics to be lower
than a uniformly random hash function would provide. It would be
valuable on its own to have documentation describing how to get a
desired block size with this statistically imperfect function.


I'm okay with having a statistically more compelling hash be our
"recommended" choice, including this one only for compatibility.

I started writing a spec for this hash (based on the rsync document, but
including the deviations made by perkeep & bup). I'll publish it
somewhere soon so we can collaborate.

-Ian

Quoting Bob Glickstein (2020-08-13 10:58:28)
>    On Tue, Aug 11, 2020 at 3:38 PM Ian Denhardt <[1][email protected]>
>    wrote:
>
>      I think
>      it's valuable for this function to be one of the options for
>      compatibility
>
>    I agree there, but I'm becoming increasingly convinced that it should
>    not be the default, or even recommended.
>    I've added a new level of analysis to my benchmark function as of
>    [2]the latest commit: in addition to counting how often each bit is
>    zero (which should approach 50%), it counts how often each pair of bits
>    is correlated (which should also approach 50%). The results for rollsum
>    are not great. On the other hand, the results for the other algorithms
>    in� [3]github.com/chmduquesne/rollinghash, which are now added to the
>    benchmark, are great (except for adler32). Try running with and without
>    the env var� BENCHMARK_ROLLSUM_ANALYZE=1 to see the� results.
>    By the way, there's probably more sophisticated analysis that could be
>    done on the distribution produced by these hashes but I suspect we're
>    into diminishing returns after the pairwise bit correlations I'm now
>    doing. I could be wrong though. Whether I am, and how else the results
>    should be analyzed, are left as an exercise for other readers of this
>    thread.
>    Cheers,
>    - Bob
>
>      and (2) I don't want to give too many options; a
>      different hash function is much more extra implementation work than
>      a
>      numeric parameter, and if we add too many we've sort of missed the
>      point of standardization. So I'd only want to do this if there are
>      clear
>      compelling use cases for each of the functions we include.
>      Whatever parameters we decide to add, we should pick a
>      default/recommended set of values for them.
>
> Verweise
>
>    1. mailto:[email protected]
>    2. 
> https://github.com/bobg/hashsplit/commit/17195adda444fcc11e96cfd6058613edd88af5be
>    3. http://github.com/chmduquesne/rollinghash

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/159734342442.787.1979326535815718855%40localhost.localdomain.

Re: Cutting files

Reply via email to