Re: [PATCH 2/2] diff.c: get rid of duplicate implementation

2017-10-30 Thread Jeff King
On Mon, Oct 30, 2017 at 10:59:41AM -0700, Jeff King wrote:

> > There's also https://www.strchr.com/hash_functions, which lists DJB2
> > as Bernstein.  Both functions rank somewhere in the middle of that list.
> 
> FWIW, I did some experiments with Murmur3 and SipHash a while back, but
> I don't think I came up with anything faster than the existing code.
> OTOH, moving to SipHash gives us the ability to randomize the hashes,
> which can resist some DoS attacks (although as I've said before,
> computing arbitrary diffs for untrusted strangers is pretty much a
> DoS-in-a-box).

By the way, one of the things that complicates plugging new functions
into xdiff's hashing is that xdl_hash_record() simultaneously computes
the hash _and_ finds the end-of-line marker.

So the "siphash is only 10% slower" number I showed came with quite a
few contortions to do both. Here it is compared to a more naive
application of the siphash code (i.e., memchr to find end-of-line, and
then feed the resulting bytes to siphash):

Test  originHEAD^   
 jk/xdl-siphash-wip
---
4000.1: log -3000 (baseline)  0.05(0.05+0.00)   0.05(0.05+0.00) +0.0%   
 0.05(0.05+0.00) +0.0%
4000.2: log --raw -3000 (tree-only)   0.31(0.27+0.03)   0.31(0.27+0.03) +0.0%   
 0.31(0.27+0.03) +0.0%
4000.3: log -p -3000 (Myers)  2.06(2.01+0.05)   2.30(2.21+0.09) +11.7%  
 2.96(2.91+0.04) +43.7%
4000.4: log -p -3000 --histogram  2.44(2.38+0.06)   2.67(2.60+0.07) +9.4%   
 3.32(3.26+0.06) +36.1%
4000.5: log -p -3000 --patience   2.57(2.47+0.09)   2.90(2.82+0.08) +12.8%  
 3.48(3.43+0.05) +35.4%

There "origin" is the existing djb hash, "HEAD^" is the complicated
"fast" siphash (which I very well may have screwed up), and the final is
the more naive version, which is quite a bit slower.

-Peff


Re: [PATCH 2/2] diff.c: get rid of duplicate implementation

2017-10-30 Thread Jeff King
On Thu, Oct 26, 2017 at 07:43:19PM +0200, René Scharfe wrote:

> Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> > The implementations in diff.c to detect moved lines needs to compare
> > strings and hash strings, which is implemented in that file, as well
> > as in the xdiff library.
> > 
> > Remove the rather recent implementation in diff.c and rely on the well
> > exercised code in the xdiff lib.
> > 
> > With this change the hash used for bucketing the strings for the moved
> > line detection changes from FNV32 (that is provided via the hashmaps
> > memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
> > on the web[1] do not indicate that these hashes are different in
> > performance for readable strings.
> > 
> > [1] 
> > https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
> 
> Awesome comparison!  It calls the variant used in libxdiff DJB2a (which
> uses xor to mix in the new char) instead of DJB2 (which uses addition).
> 
> There's also https://www.strchr.com/hash_functions, which lists DJB2
> as Bernstein.  Both functions rank somewhere in the middle of that list.

FWIW, I did some experiments with Murmur3 and SipHash a while back, but
I don't think I came up with anything faster than the existing code.
OTOH, moving to SipHash gives us the ability to randomize the hashes,
which can resist some DoS attacks (although as I've said before,
computing arbitrary diffs for untrusted strangers is pretty much a
DoS-in-a-box).

Anyway, I rebased them and ran p4000[1], with does show some results:

Test  originjk/xdl-murmur3-wip  
 jk/xdl-siphash-wip
---
4000.1: log -3000 (baseline)  0.05(0.05+0.00)   0.05(0.05+0.00) +0.0%   
 0.05(0.05+0.00) +0.0% 
4000.2: log --raw -3000 (tree-only)   0.30(0.28+0.02)   0.30(0.28+0.01) +0.0%   
 0.30(0.28+0.02) +0.0% 
4000.3: log -p -3000 (Myers)  2.06(1.98+0.08)   1.90(1.85+0.05) -7.8%   
 2.32(2.25+0.07) +12.6%
4000.4: log -p -3000 --histogram  2.50(2.42+0.07)   2.25(2.21+0.04) -10.0%  
 2.70(2.62+0.08) +8.0% 
4000.5: log -p -3000 --patience   2.58(2.52+0.06)   2.47(2.42+0.04) -4.3%   
 2.86(2.77+0.08) +10.9%

So it looks like murmur3 is indeed a little faster. SipHash is slower,
which is too bad (because the randomization would be nice). I _think_
back then I compared to XDL_FAST_HASH, which was a little faster even
than murmur3 (but generated too many collisions, which led to bad
behavior for certain cases).

Anyway, those branches are at https://github.com/peff/git if anybody
wants to look further. They probably need some cleanup. At the very
least, we'd probably need to teach the whitespace-ignoring hash function
to use the same algorithm.

-Peff

[1] Actually, I added "--no-merges" to each command in p4000. It seems
silly as it is, since the point is to compute a bunch of diffs.


Re: [PATCH 2/2] diff.c: get rid of duplicate implementation

2017-10-26 Thread René Scharfe
Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> The implementations in diff.c to detect moved lines needs to compare
> strings and hash strings, which is implemented in that file, as well
> as in the xdiff library.
> 
> Remove the rather recent implementation in diff.c and rely on the well
> exercised code in the xdiff lib.
> 
> With this change the hash used for bucketing the strings for the moved
> line detection changes from FNV32 (that is provided via the hashmaps
> memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
> on the web[1] do not indicate that these hashes are different in
> performance for readable strings.
> 
> [1] 
> https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Awesome comparison!  It calls the variant used in libxdiff DJB2a (which
uses xor to mix in the new char) instead of DJB2 (which uses addition).

There's also https://www.strchr.com/hash_functions, which lists DJB2
as Bernstein.  Both functions rank somewhere in the middle of that list.

> Signed-off-by: Stefan Beller 
> ---
>   diff.c | 82 
> --
>   1 file changed, 4 insertions(+), 78 deletions(-)

Very nice.

René


Re: [PATCH 2/2] diff.c: get rid of duplicate implementation

2017-10-25 Thread Junio C Hamano
Stefan Beller  writes:

> The implementations in diff.c to detect moved lines needs to compare
> strings and hash strings, which is implemented in that file, as well
> as in the xdiff library.
>
> Remove the rather recent implementation in diff.c and rely on the well
> exercised code in the xdiff lib.
> ...
>  static int moved_entry_cmp(const struct diff_options *diffopt,
>  const struct moved_entry *a,
>  const struct moved_entry *b,
>  const void *keydata)
>  {
> - const char *ap = a->es->line, *ae = a->es->line + a->es->len;
> - const char *bp = b->es->line, *be = b->es->line + b->es->len;
> - ...
> + return !xdiff_compare_lines(a->es->line, a->es->len,
> + b->es->line, b->es->len,
> + diffopt->xdl_opts);
>  }

OK, xdiff's xdl_recmatch() takes counted strings, and line[len] is
one byte beyond the end of the line (where LF typically sits), which
is the same convention as how "emitted_symbol" represents a line, so
not just the implementation replaced with a known-working one, but
also the code calls the helper with correct calling convention ;-)

> - ret->ent.hash = get_string_hash(l, o);
> + ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts);

Likewise.  Looks good.

Will queue.