Re: Cutting files

Bob Glickstein Tue, 11 Aug 2020 08:01:22 -0700

On Mon, Aug 10, 2020 at 7:46 PM Ian Denhardt <[email protected]> wrote:


> Frankly, I think we should just stick to this hash for any
> formalization


I have been on the fence about this. On the plus side, there's a lot of
existing code, so no need to invent something new. On the minus side, as
you pointed out the existing implementations are all slightly mutually
incompatible, so what is there to preserve, exactly?

Then just now I did an experiment that convinced me:
https://github.com/bobg/hashsplit/commit/5a64185e331e11fb69800411edd763ad0a6f6bf4

If you run my new benchmark with environment
variable BENCHMARK_ROLLSUM_COUNT_ZEROES=1, it counts how often each bit of
the digest is 0. Should be about 50% for every bit, all the time, right?
But look:

    BenchmarkRollsum: rollsum_test.go:25: using seed 1597156860
    BenchmarkRollsum: rollsum_test.go:52: with b.N == 8493040:
    BenchmarkRollsum: rollsum_test.go:54:   bit 0 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 1 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 2 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 3 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 4 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 5 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 6 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 7 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 8 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 9 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 10 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 11 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 12 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 13 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 14 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 15 is zero 48.3% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 16 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 17 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 18 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 19 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 20 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 21 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 22 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 23 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 24 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 25 is zero 50.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 26 is zero 46.7% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 27 is zero 56.1% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 28 is zero 99.9% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 29 is zero 0.0% of the time
    BenchmarkRollsum: rollsum_test.go:54:   bit 30 is zero 100.0% of the
time
    BenchmarkRollsum: rollsum_test.go:54:   bit 31 is zero 100.0% of the
time

The highest few bits do not exhibit good randomness.

Now the hunt is on for a better rolling checksum: one that's fast (this
one's only about 6-7 ns on my computer for each Roll/Digest call pair, when
running with BENCHMARK_ROLLSUM_COUNT_ZEROES turned off), where every bit is
zero 50% of the time, and where there is little or no measurable
correlation between bits (a test I haven't written yet).

You might argue we shouldn't care about the highest few bits for
hashsplitting purposes - we seldom care about more than 20 or so of the
lowest bits - but I don't feel ready to settle for that until we've
evaluated some other candidate algorithms.

Cheers,
- Bob

it works fine and there are apparently already a couple
> implementations -- for the hash itself I think we should just nab the
> rsync doc, stick the char offset constant in those formulae and be done
> with it.
>
> Note that bup's splitting function appears to look for a number of
> bits that are *set*, rather than bits that are clear.
>
> Documenting the splitting algorithm on top of that is I think
> the more significant amount of work, as well as figuring out how much we
> want to be configurable, do we want to try to capture the parameters of
> existing implementations, etc.
>
> -Ian
>

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAEf8c4_BzsTso7462Hv8O0oR6zLM05mXXi%3Dpw2y5HmTMW5yp1A%40mail.gmail.com.

Re: Cutting files

Reply via email to