On Mon, Aug 10, 2020 at 7:46 PM Ian Denhardt <[email protected]> wrote:
> Frankly, I think we should just stick to this hash for any > formalization I have been on the fence about this. On the plus side, there's a lot of existing code, so no need to invent something new. On the minus side, as you pointed out the existing implementations are all slightly mutually incompatible, so what is there to preserve, exactly? Then just now I did an experiment that convinced me: https://github.com/bobg/hashsplit/commit/5a64185e331e11fb69800411edd763ad0a6f6bf4 If you run my new benchmark with environment variable BENCHMARK_ROLLSUM_COUNT_ZEROES=1, it counts how often each bit of the digest is 0. Should be about 50% for every bit, all the time, right? But look: BenchmarkRollsum: rollsum_test.go:25: using seed 1597156860 BenchmarkRollsum: rollsum_test.go:52: with b.N == 8493040: BenchmarkRollsum: rollsum_test.go:54: bit 0 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 1 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 2 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 3 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 4 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 5 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 6 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 7 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 8 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 9 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 10 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 11 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 12 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 13 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 14 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 15 is zero 48.3% of the time BenchmarkRollsum: rollsum_test.go:54: bit 16 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 17 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 18 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 19 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 20 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 21 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 22 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 23 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 24 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 25 is zero 50.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 26 is zero 46.7% of the time BenchmarkRollsum: rollsum_test.go:54: bit 27 is zero 56.1% of the time BenchmarkRollsum: rollsum_test.go:54: bit 28 is zero 99.9% of the time BenchmarkRollsum: rollsum_test.go:54: bit 29 is zero 0.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 30 is zero 100.0% of the time BenchmarkRollsum: rollsum_test.go:54: bit 31 is zero 100.0% of the time The highest few bits do not exhibit good randomness. Now the hunt is on for a better rolling checksum: one that's fast (this one's only about 6-7 ns on my computer for each Roll/Digest call pair, when running with BENCHMARK_ROLLSUM_COUNT_ZEROES turned off), where every bit is zero 50% of the time, and where there is little or no measurable correlation between bits (a test I haven't written yet). You might argue we shouldn't care about the highest few bits for hashsplitting purposes - we seldom care about more than 20 or so of the lowest bits - but I don't feel ready to settle for that until we've evaluated some other candidate algorithms. Cheers, - Bob it works fine and there are apparently already a couple > implementations -- for the hash itself I think we should just nab the > rsync doc, stick the char offset constant in those formulae and be done > with it. > > Note that bup's splitting function appears to look for a number of > bits that are *set*, rather than bits that are clear. > > Documenting the splitting algorithm on top of that is I think > the more significant amount of work, as well as figuring out how much we > want to be configurable, do we want to try to capture the parameters of > existing implementations, etc. > > -Ian > -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAEf8c4_BzsTso7462Hv8O0oR6zLM05mXXi%3Dpw2y5HmTMW5yp1A%40mail.gmail.com.
