Okay, that's starting to seem a bit more relevant. It would be interesting to measure empirically what the mean & median block sizes for a real-world perkeep store are; if there's a high correlation between bits one would expect these metrics to be lower than a uniformly random hash function would provide. It would be valuable on its own to have documentation describing how to get a desired block size with this statistically imperfect function.
I'm okay with having a statistically more compelling hash be our "recommended" choice, including this one only for compatibility. I started writing a spec for this hash (based on the rsync document, but including the deviations made by perkeep & bup). I'll publish it somewhere soon so we can collaborate. -Ian Quoting Bob Glickstein (2020-08-13 10:58:28) > On Tue, Aug 11, 2020 at 3:38 PM Ian Denhardt <[1][email protected]> > wrote: > > I think > it's valuable for this function to be one of the options for > compatibility > > I agree there, but I'm becoming increasingly convinced that it should > not be the default, or even recommended. > I've added a new level of analysis to my benchmark function as of > [2]the latest commit: in addition to counting how often each bit is > zero (which should approach 50%), it counts how often each pair of bits > is correlated (which should also approach 50%). The results for rollsum > are not great. On the other hand, the results for the other algorithms > in� [3]github.com/chmduquesne/rollinghash, which are now added to the > benchmark, are great (except for adler32). Try running with and without > the env var� BENCHMARK_ROLLSUM_ANALYZE=1 to see the� results. > By the way, there's probably more sophisticated analysis that could be > done on the distribution produced by these hashes but I suspect we're > into diminishing returns after the pairwise bit correlations I'm now > doing. I could be wrong though. Whether I am, and how else the results > should be analyzed, are left as an exercise for other readers of this > thread. > Cheers, > - Bob > > and (2) I don't want to give too many options; a > different hash function is much more extra implementation work than > a > numeric parameter, and if we add too many we've sort of missed the > point of standardization. So I'd only want to do this if there are > clear > compelling use cases for each of the functions we include. > Whatever parameters we decide to add, we should pick a > default/recommended set of values for them. > > Verweise > > 1. mailto:[email protected] > 2. > https://github.com/bobg/hashsplit/commit/17195adda444fcc11e96cfd6058613edd88af5be > 3. http://github.com/chmduquesne/rollinghash -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/159734342442.787.1979326535815718855%40localhost.localdomain.
