Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-07 Thread René Scharfe
Am 07.03.2018 um 10:36 schrieb Eric Sunshine:
> On Tue, Mar 6, 2018 at 6:05 PM, Jun Wu  wrote:
>> Excerpts from Eric Sunshine's message of 2018-03-06 14:23:46 -0500:
>>> On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
 +  printf "x%.0s" {1..934} >>d # pad common suffix to 1024 bytes
>>>
>>> The expression {x..y} is not portable to non-POSIX shells.
>>
>> Is there a recommended way to generate a repetitive string?
>> Maybe `seq 1000 | sed 's/.*/x/'` ?
> 
> That seems reasonable, although you'd want to use the test suite's
> more portable test_seq rather than seq.

You could also use Perl's repetition operator (x):

perl -e "print \"x\" x 934" >>d

It is a single command (only one fork()/exec()), simple, produces no
extra newline characters and it's already used e.g. in
t5300-pack-object.sh.

René


Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-07 Thread Eric Sunshine
On Tue, Mar 6, 2018 at 6:05 PM, Jun Wu  wrote:
> Excerpts from Eric Sunshine's message of 2018-03-06 14:23:46 -0500:
>> On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
>> > +  printf "x%.0s" {1..934} >>d # pad common suffix to 1024 bytes
>>
>> The expression {x..y} is not portable to non-POSIX shells.
>
> Is there a recommended way to generate a repetitive string?
> Maybe `seq 1000 | sed 's/.*/x/'` ?

That seems reasonable, although you'd want to use the test suite's
more portable test_seq rather than seq.

>> > +   /* prefix - need line count for xecfg->ptrimmed */
>> > +   for (i = 0; ++i < smaller && *ap == *bp;) {
>> > +   lines += (*ap == '\n');
>> > +   ap++, bp++;
>>
>> Is there a good reason for not placing 'ap++, bp++' directly in the 'for'?
>
> "lines += (*ap == '\n');" needs the "ap" before adding. Alternatives are
>
> for (i = 0; ++i < smaller && *ap == *bp; ) /* 1 */
> lines += (*ap++, *bp++) == '\n';
>
> for (i = 0; ++i < smaller && *ap == *bp; ap++, bp++) /* 2 */
> lines += (*(ap - 1) == '\n');
>
> Maybe will pick /* 1 */ to make the code shorter.

I was thinking specifically about #2 when asking the question. The
reason I asked is that, as one reading the code, the
not-quite-idiomatic loop made me stop and wonder if something special
was going on which wasn't obvious at a glance.

And, apparently, I'm still wondering since I'm not groking what you mean by:

"lines += (*ap == '\n');" needs the "ap" before adding

It's quite possible that I'm being dense and overlooking something
blindingly obvious.


Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-06 Thread Jun Wu
Excerpts from Junio C Hamano's message of 2018-03-06 11:29:44 -0800:
> Eric Sunshine  writes:
> 
> > On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
> >> xdiff-interface trims common suffix if ctxlen is 0. Teach it to also
> >> trim common prefix, and trim less lines if ctxlen > 0. So it can benefit
> >> the default diff command, as seen by profiling: [...]
> 
> I vaguely recall that we have tried this in the distant past, found
> that it produced incorrect result, and that is why we limit the
> optimization for no-context case.
> 
> Does anybody have an archive reference?

I think it's d2f8295 ("Re(-re)*fix trim_common_tail()", 2007-12-20).

Yeah, this is more complex than I thought. In Mercurial's use-case, ctxlen
is 0 and context are added in a higher layer so it's fine.

It's clearly hunk shifting causing problems. I'll send a V2 to forbid hunk
shifting if it can possibly lose context lines. That implies counting suffix
lines, which adds some overhead but the overall perf win still seems
worthwhile.


Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-06 Thread Jun Wu
Excerpts from Eric Sunshine's message of 2018-03-06 14:23:46 -0500:
> On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
> > +  printf "x%.0s" {1..934} >>d # pad common suffix to 1024 bytes
> 
> The expression {x..y} is not portable to non-POSIX shells.

Is there a recommended way to generate a repetitive string?
Maybe `seq 1000 | sed 's/.*/x/'` ?

> > +fgrep "@@ -1001 +1000,0 @@"
> > +'
> 
> Style: Mix of tabs and spaces for indentation. Please indent only with
> tabs throughout the patch.

Sorry. Will fix in V2.
 
> > +   /* prefix - need line count for xecfg->ptrimmed */
> > +   for (i = 0; ++i < smaller && *ap == *bp;) {
> > +   lines += (*ap == '\n');
> > +   ap++, bp++;
> 
> Is there a good reason for not placing 'ap++, bp++' directly in the 'for'?

"lines += (*ap == '\n');" needs the "ap" before adding. Alternatives are

for (i = 0; ++i < smaller && *ap == *bp; ) /* 1 */
lines += (*ap++, *bp++) == '\n';

for (i = 0; ++i < smaller && *ap == *bp; ap++, bp++) /* 2 */
lines += (*(ap - 1) == '\n');

Maybe will pick /* 1 */ to make the code shorter.


Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-06 Thread Junio C Hamano
Eric Sunshine  writes:

> On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
>> xdiff-interface trims common suffix if ctxlen is 0. Teach it to also
>> trim common prefix, and trim less lines if ctxlen > 0. So it can benefit
>> the default diff command, as seen by profiling: [...]

I vaguely recall that we have tried this in the distant past, found
that it produced incorrect result, and that is why we limit the
optimization for no-context case.

Does anybody have an archive reference?


Re: [PATCH] xdiff: improve trimming preprocessing

2018-03-06 Thread Eric Sunshine
On Tue, Mar 6, 2018 at 6:53 AM, Jun Wu  wrote:
> xdiff-interface trims common suffix if ctxlen is 0. Teach it to also
> trim common prefix, and trim less lines if ctxlen > 0. So it can benefit
> the default diff command, as seen by profiling: [...]

A few comments (below) based upon a quick scan of the patch...

> Signed-off-by: Jun Wu 
> ---
> diff --git a/t/t4066-diff-trimming.sh b/t/t4066-diff-trimming.sh
> @@ -0,0 +1,49 @@
> +test_expect_success 'setup' '
> +  printf "x\n%.0s" {1..1000} >a &&
> +  printf "x\n%.0s" {1..1001} >b &&
> +  cat >c >c &&\
> +  printf "x%.0s" {1..934} >>d # pad common suffix to 1024 bytes

The expression {x..y} is not portable to non-POSIX shells.

> +test_expect_success 'git diff -U0 shifts hunk towards the end' '
> +   test_expect_code 1 git diff -U0 --no-index a b |\
> +fgrep "@@ -1000,0 +1001 @@" &&
> +   test_expect_code 1 git diff -U0 --no-index b a |\
> +fgrep "@@ -1001 +1000,0 @@"
> +'

Style: Mix of tabs and spaces for indentation. Please indent only with
tabs throughout the patch.

> diff --git a/xdiff-interface.c b/xdiff-interface.c
> @@ -101,42 +101,64 @@ static int xdiff_outf(void *priv_, mmbuffer_t *mb, int 
> nbuf)
> +static void trim_common(mmfile_t *a, mmfile_t *b, long reserved,
> +   xdemitconf_t *xecfg)
>  {
> +   /* prefix - need line count for xecfg->ptrimmed */
> +   for (i = 0; ++i < smaller && *ap == *bp;) {
> +   lines += (*ap == '\n');
> +   ap++, bp++;

Is there a good reason for not placing 'ap++, bp++' directly in the 'for'?

for (i = 0; ++i < smaller && *ap == *bp; ap++, bp++)


[PATCH] xdiff: improve trimming preprocessing

2018-03-06 Thread Jun Wu
xdiff-interface trims common suffix if ctxlen is 0. Teach it to also
trim common prefix, and trim less lines if ctxlen > 0. So it can benefit
the default diff command, as seen by profiling:

  $ GIT_PERF_REPEAT_COUNT=10 ./run origin/master . -- p4000-diff-algorithms.sh
  [...]
  Test  origin/master this tree
  --
  4000.1: log -3000 (baseline)  0.07(0.06+0.01)   0.06(0.05+0.01) -14.3%
  4000.2: log --raw -3000 (tree-only)   0.31(0.28+0.02)   0.31(0.27+0.02) +0.0%
  4000.3: log -p -3000 (Myers)  2.25(2.05+0.17)   1.66(1.52+0.11) -26.2%
  4000.4: log -p -3000 --histogram  2.51(2.36+0.10)   1.90(1.79+0.09) -24.3%
  4000.5: log -p -3000 --patience   2.63(2.44+0.16)   1.70(1.56+0.12) -35.4%

Diff shifting behaviors (hunk merging, indent heuristic, shift towards
the end for repetitive content) are preserved as a best effort by:

  1. Reserve extra 100 lines to not be trimmed. Leave room for shifting.
  2. Calculate common prefix first. Disallow suffix overlap with prefix
 in the smaller file. So diff hunk tends to be shifted down.

Add tests for those. Make sure they fail on the old code.

Backported from a Mercurial patch [1]. Adjusted to git's existing
xdiff-interface.c (which isn't ported to Mercurial).

Note: xdl_trim_ends does a similar job. But too late - it's after line
splitting, hashing, correcting hash values. Those steps have visible
overhead.

[1]: https://phab.mercurial-scm.org/D2686

Signed-off-by: Jun Wu 
---
 t/t4066-diff-trimming.sh | 49 +++
 xdiff-interface.c| 60 +---
 xdiff/xdiff.h|  3 +++
 xdiff/xdiffi.c   |  5 ++--
 xdiff/xemit.c|  3 ++-
 5 files changed, 98 insertions(+), 22 deletions(-)
 create mode 100755 t/t4066-diff-trimming.sh

diff --git a/t/t4066-diff-trimming.sh b/t/t4066-diff-trimming.sh
new file mode 100755
index 0..3d90175ac
--- /dev/null
+++ b/t/t4066-diff-trimming.sh
@@ -0,0 +1,49 @@
+#!/bin/sh
+
+test_description='diff trimming optimization'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+  printf "x\n%.0s" {1..1000} >a &&
+  printf "x\n%.0s" {1..1001} >b &&
+  cat >c c &&\
+  printf "x%.0s" {1..934} >>d # pad common suffix to 1024 bytes
+  try:
+  import foo
+  except ImportError:
+  pass
+  try:
+  import bar
+  except ImportError:
+  pass
+EOF1
+  try:
+  import foo
+  except ImportError:
+  pass
+  try:
+  import baz
+  except ImportError:
+  pass
+  try:
+  import bar
+  except ImportError:
+  pass
+EOF2
+'
+
+test_expect_success 'git diff -U0 shifts hunk towards the end' '
+   test_expect_code 1 git diff -U0 --no-index a b |\
+fgrep "@@ -1000,0 +1001 @@" &&
+   test_expect_code 1 git diff -U0 --no-index b a |\
+fgrep "@@ -1001 +1000,0 @@"
+'
+
+test_expect_success 'git diff -U0 --indent-heuristic' '
+   test_expect_code 1 git diff -U0 --no-index --indent-heuristic c d |\
+fgrep "@@ -4,0 +5,4 @@"
+'
+
+test_done
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 770e1f7f8..a4141e2db 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -101,42 +101,64 @@ static int xdiff_outf(void *priv_, mmbuffer_t *mb, int 
nbuf)
 }
 
 /*
- * Trim down common substring at the end of the buffers,
- * but end on a complete line.
+ * Trim down common prefix and suffix, except for reserved lines.
+ * Align to line boundary.
  */
-static void trim_common_tail(mmfile_t *a, mmfile_t *b)
+static void trim_common(mmfile_t *a, mmfile_t *b, long reserved,
+   xdemitconf_t *xecfg)
 {
const int blk = 1024;
-   long trimmed = 0, recovered = 0;
-   char *ap = a->ptr + a->size;
-   char *bp = b->ptr + b->size;
+   long i = 0, lines = 0;
+   char *ap = a->ptr, *ae = a->ptr + a->size - 1;
+   char *bp = b->ptr, *be = b->ptr + b->size - 1;
long smaller = (a->size < b->size) ? a->size : b->size;
+   size_t prefix = 0, suffix = 0;
 
-   while (blk + trimmed <= smaller && !memcmp(ap - blk, bp - blk, blk)) {
-   trimmed += blk;
-   ap -= blk;
-   bp -= blk;
+   if (smaller == 0)
+   return;
+
+   /* prefix - need line count for xecfg->ptrimmed */
+   for (i = 0; ++i < smaller && *ap == *bp;) {
+   lines += (*ap == '\n');
+   ap++, bp++;
+   }
+   for (i = 0; i <= reserved && --ap >= a->ptr;)
+   i += (*ap == '\n');
+   if (ap > a->ptr) {
+   prefix = ap + 1 - a->ptr;
+   xecfg->ptrimmed = lines + 1 - i;
}
 
-   while (recovered < trimmed)
-   if (ap[recovered++] == '\n')
-   break;
-   a->size -= trimmed - recovered;
-   b->size -= trimmed - recovered;
-}
+   /* suffix - no line count, but do not overlap with pre