Quoting Bob Glickstein (2020-08-09 13:49:15)

>    I heartily agree! And thanks for the suggestion. In order to contribute
>    to such a specification I would first need to educate myself on the
>    pluses and minuses of different rolling hashes (e.g. beginning
>    [2]here). What I have right now in my hashsplit package is a little
>    cargo-culty.

I spent a bit of time researching, and it looks like there's a long
history of cargo culting; go4.org/rollsum looks like it's an exact
transliteration of bup's implementation:

https://github.com/bup/bup/blob/bb092f8a5c148534655b6a709896d853e3587dc8/lib/bup/bupsplit.c

Which in turn is almost exactly the rsync algorithm:

https://rsync.samba.org/tech_report/node3.html

...the only difference being that the bytes are shifted by 31, per the
comment in bup's source:

// According to librsync/rollsum.h:
// "We should make this something other than zero to improve the
// checksum algorithm: tridge suggests a prime number."
// apenwarr: I unscientifically tried 0 and 7919, and they both ended up
// slightly worse than the librsync value of 31 for my arbitrary test data.
#define ROLLSUM_CHAR_OFFSET 31

Frankly, I think we should just stick to this hash for any
formalization: it works fine and there are apparently already a couple
implementations -- for the hash itself I think we should just nab the
rsync doc, stick the char offset constant in those formulae and be done
with it.

Note that bup's splitting function appears to look for a number of
bits that are *set*, rather than bits that are clear.

Documenting the splitting algorithm on top of that is I think
the more significant amount of work, as well as figuring out how much we
want to be configurable, do we want to try to capture the parameters of
existing implementations, etc.

-Ian

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/159711398000.29451.16582719780928063142%40localhost.localdomain.

Reply via email to