Re: [PATCH] git tag --contains : avoid stack overflow

2014-04-17 Thread Johannes Schindelin
Hi Peff,

On Wed, 16 Apr 2014, Jeff King wrote:

 On Wed, Apr 16, 2014 at 04:15:19PM +0200, Stepan Kasal wrote:
 
  From: Jean-Jacques Lafay at Sat, 10 Nov 2012 18:36:10 +0100
  
  In large repos, the recursion implementation of contains(commit,
  commit_list) may result in a stack overflow. Replace the recursion
  with a loop to fix it.
  
  This problem is more apparent on Windows than on Linux, where the
  stack is more limited by default.
 
 I think this is a good thing to be doing, and it looks mostly good to
 me. A few comments:
 
  -static int contains_recurse(struct commit *candidate,
  +/*
  + * Test whether the candidate or one of its parents is contained in the 
  list.
  + * Do not recurse to find out, though, but return -1 if inconclusive.
  + */
  +static int contains_test(struct commit *candidate,
  const struct commit_list *want)
 
 Can we turn this return value into
 
   enum {
   CONTAINS_UNKNOWN = -1,
   CONTAINS_NO = 0,
   CONTAINS_YES = 1,
   } contains_result;
 
 to make the code a little more self-documenting?

Good idea!

 [... detailed instructions what changes are implied by the enum ...]
 
  +expect
  +# ulimit is a bash builtin; we can rely on that in MinGW, but nowhere else
  +test_expect_success MINGW '--contains works in a deep repo' '
  +   ulimit -s 64
 
 It would be nice to test this on Linux.
 
 Can we do something like:
 
   test_lazy_prereq BASH 'bash --version'
 
   test_expect_success BASH '--contains works in a deep repo' '
   ... setup repo ...
   bash -c ulimit -s 64  git tag --contains HEAD actual 
   test_cmp expect actual
   '
 
 As a bonus, then our ulimit call does not pollute the environment of
 subsequent tests.

That's a very good idea! We mulled it over a bit and did not come up with
this excellent solution.

Please see https://github.com/msysgit/git/c63d196 for the fixup, and
https://github.com/msysgit/git/compare/tag-contains%5E...tag-contains for
the updated patch.

Thanks,
Dscho
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git tag --contains : avoid stack overflow

2014-04-17 Thread Jeff King
On Thu, Apr 17, 2014 at 07:31:54PM +0200, Johannes Schindelin wrote:

  bash -c ulimit -s 64  git tag --contains HEAD actual 
 [...]
 Please see https://github.com/msysgit/git/c63d196 for the fixup, and
 https://github.com/msysgit/git/compare/tag-contains%5E...tag-contains for
 the updated patch.

I tried running the test on my Linux box, but it doesn't fail with the
existing recursive code. So I tried a few different stack sizes, like:

  for i in `seq 1 64`; do
bash -c 
  ulimit -s $i 
  ../../git tag --contains HEAD ||
  echo fail $i
  done

The results are strangely non-deterministic, but with -O0, we generally
die reliably below about 60. With -O2, though, it's more like 43. We
can't go _too_ low here, though, as lots of things start breaking around
32.

If we instead bump the size of the history to 2000 commits, then I
reliably fail with a 64k stack (even with -O2, it needs around 80k).

Of course those numbers are all black magic, and are going to vary based
on the system, the compiler, settings, etc. My system is 64-bit, and the
current code needs at least 3 pointers per recursive invocation. So
we're spending ~46K on those variables, not counting any calling
convention overhead (and presumably we at least need a function return
pointer there). So a 32-bit system might actually get by, as it would
need half as much.

So we can bump the depth further; probably 4000 is enough for any system
to fail with a 64k stack. The deeper we make it, the longer it takes to
run the test, though. At 4000, my machine seems to take about 300ms to
run it. That's may be OK.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git tag --contains : avoid stack overflow

2014-04-17 Thread Johannes Schindelin
Hi Peff,

On Thu, 17 Apr 2014, Jeff King wrote:

 On Thu, Apr 17, 2014 at 07:31:54PM +0200, Johannes Schindelin wrote:
 
 bash -c ulimit -s 64  git tag --contains HEAD actual 
  [...]
  Please see https://github.com/msysgit/git/c63d196 for the fixup, and
  https://github.com/msysgit/git/compare/tag-contains%5E...tag-contains for
  the updated patch.
 
 I tried running the test on my Linux box, but it doesn't fail with the
 existing recursive code.

I cannot recall how I came to choose 64, but I *think* I only tested on
Windows, and I *think* I reduced the number of tags in order to make
things faster (Windows is *unbearably* slow with spawn-happy programs such
as Git's tests -- literally every single line in a shell script tests the
patience of this developer, running the complete test suite with 15
parallel threads takes several hours, no kidding).

 The results are strangely non-deterministic, but with -O0, we generally
 die reliably below about 60. With -O2, though, it's more like 43. We
 can't go _too_ low here, though, as lots of things start breaking around
 32.

How about using 40, then? I am more interested in reducing the runtime
than reducing the number of false negatives. The problem will be exercised
enough on Windows, but not if the test suite becomes even slower than it
already is.

Ciao,
Johannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git tag --contains : avoid stack overflow

2014-04-17 Thread Jeff King
On Thu, Apr 17, 2014 at 11:52:56PM +0200, Johannes Schindelin wrote:

  I tried running the test on my Linux box, but it doesn't fail with the
  existing recursive code.
 
 I cannot recall how I came to choose 64, but I *think* I only tested on
 Windows, and I *think* I reduced the number of tags in order to make
 things faster (Windows is *unbearably* slow with spawn-happy programs such
 as Git's tests -- literally every single line in a shell script tests the
 patience of this developer, running the complete test suite with 15
 parallel threads takes several hours, no kidding).

Yeah, I figured speed had something to do with it. However, since you
are using a bash loop to generate the input (and it should all be done
as builtins in bash, I think), and fast-import to create the objects, I
don't think bumping it will actually increase your process count.

  The results are strangely non-deterministic, but with -O0, we generally
  die reliably below about 60. With -O2, though, it's more like 43. We
  can't go _too_ low here, though, as lots of things start breaking around
  32.
 
 How about using 40, then? I am more interested in reducing the runtime
 than reducing the number of false negatives. The problem will be exercised
 enough on Windows, but not if the test suite becomes even slower than it
 already is.

I'm OK with doing that. My biggest concern is that it will cause false
positives on systems that are hungrier for stack space, but we can
address that if it happens.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] git tag --contains : avoid stack overflow

2014-04-16 Thread Stepan Kasal
From: Jean-Jacques Lafay jeanjacques.la...@gmail.com
Date: Sat, 10 Nov 2012 18:36:10 +0100

In large repos, the recursion implementation of contains(commit,
commit_list) may result in a stack overflow. Replace the recursion with
a loop to fix it.

This problem is more apparent on Windows than on Linux, where the stack
is more limited by default.

See also this thread on the msysGit list:

https://groups.google.com/d/topic/msysgit/FqT6boJrb2g/discussion

[jes: re-written to imitate the original recursion more closely]

Thomas Braun pointed out several documentation shortcomings.

Signed-off-by: Jean-Jacques Lafay jeanjacques.la...@gmail.com
Signed-off-by: Johannes Schindelin johannes.schinde...@gmx.de
Tested-by: Stepan Kasal ka...@ucw.cz
Thanks-to: Thomas Braun thomas.br...@byte-physics.de
---
 builtin/tag.c  | 81 --
 t/t7004-tag.sh | 21 +++
 2 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index 74d3780..79c8c28 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -73,11 +73,13 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
return 0;
 }
 
-static int contains_recurse(struct commit *candidate,
+/*
+ * Test whether the candidate or one of its parents is contained in the list.
+ * Do not recurse to find out, though, but return -1 if inconclusive.
+ */
+static int contains_test(struct commit *candidate,
const struct commit_list *want)
 {
-   struct commit_list *p;
-
/* was it previously marked as containing a want commit? */
if (candidate-object.flags  TMP_MARK)
return 1;
@@ -85,26 +87,77 @@ static int contains_recurse(struct commit *candidate,
if (candidate-object.flags  UNINTERESTING)
return 0;
/* or are we it? */
-   if (in_commit_list(want, candidate))
+   if (in_commit_list(want, candidate)) {
+   candidate-object.flags |= TMP_MARK;
return 1;
+   }
 
if (parse_commit(candidate)  0)
return 0;
 
-   /* Otherwise recurse and mark ourselves for future traversals. */
-   for (p = candidate-parents; p; p = p-next) {
-   if (contains_recurse(p-item, want)) {
-   candidate-object.flags |= TMP_MARK;
-   return 1;
-   }
-   }
-   candidate-object.flags |= UNINTERESTING;
-   return 0;
+   return -1;
+}
+
+/*
+ * Mimicking the real stack, this stack lives on the heap, avoiding stack
+ * overflows.
+ *
+ * At each recursion step, the stack items points to the commits whose
+ * ancestors are to be inspected.
+ */
+struct stack {
+   int nr, alloc;
+   struct stack_entry {
+   struct commit *commit;
+   struct commit_list *parents;
+   } *stack;
+};
+
+static void push_to_stack(struct commit *candidate, struct stack *stack)
+{
+   int index = stack-nr++;
+   ALLOC_GROW(stack-stack, stack-nr, stack-alloc);
+   stack-stack[index].commit = candidate;
+   stack-stack[index].parents = candidate-parents;
 }
 
 static int contains(struct commit *candidate, const struct commit_list *want)
 {
-   return contains_recurse(candidate, want);
+   struct stack stack = { 0, 0, NULL };
+   int result = contains_test(candidate, want);
+
+   if (result = 0)
+   return result;
+
+   push_to_stack(candidate, stack);
+   while (stack.nr) {
+   struct stack_entry *entry = stack.stack[stack.nr - 1];
+   struct commit *commit = entry-commit;
+   struct commit_list *parents = entry-parents;
+
+   if (!parents) {
+   commit-object.flags = UNINTERESTING;
+   stack.nr--;
+   }
+   /*
+* If we just popped the stack, parents-item has been marked,
+* therefore contains_test will return a meaningful 0 or 1.
+*/
+   else switch (contains_test(parents-item, want)) {
+   case 1:
+   commit-object.flags |= TMP_MARK;
+   stack.nr--;
+   break;
+   case 0:
+   entry-parents = parents-next;
+   break;
+   default:
+   push_to_stack(parents-item, stack);
+   break;
+   }
+   }
+   free(stack.stack);
+   return contains_test(candidate, want);
 }
 
 static void show_tag_lines(const unsigned char *sha1, int lines)
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index c8d6e9f..edaff13 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1380,4 +1380,25 @@ test_expect_success 'multiple --points-at are OR-ed 
together' '
test_cmp expect actual
 '
 
+expect
+# ulimit is a bash builtin; we can rely on that in 

Re: [PATCH] git tag --contains : avoid stack overflow

2014-04-16 Thread Jeff King
On Wed, Apr 16, 2014 at 04:15:19PM +0200, Stepan Kasal wrote:

 From: Jean-Jacques Lafay jeanjacques.la...@gmail.com
 Date: Sat, 10 Nov 2012 18:36:10 +0100
 
 In large repos, the recursion implementation of contains(commit,
 commit_list) may result in a stack overflow. Replace the recursion with
 a loop to fix it.
 
 This problem is more apparent on Windows than on Linux, where the stack
 is more limited by default.

I think this is a good thing to be doing, and it looks mostly good to
me. A few comments:

 -static int contains_recurse(struct commit *candidate,
 +/*
 + * Test whether the candidate or one of its parents is contained in the list.
 + * Do not recurse to find out, though, but return -1 if inconclusive.
 + */
 +static int contains_test(struct commit *candidate,
   const struct commit_list *want)

Can we turn this return value into

  enum {
CONTAINS_UNKNOWN = -1,
CONTAINS_NO = 0,
CONTAINS_YES = 1,
  } contains_result;

to make the code a little more self-documenting?

  static int contains(struct commit *candidate, const struct commit_list *want)
  {
 - return contains_recurse(candidate, want);
 + struct stack stack = { 0, 0, NULL };
 + int result = contains_test(candidate, want);
 +
 + if (result = 0)
 + return result;

Then this can become:

  if (result != CONTAINS_UNKNOWN)
return result;

 + if (!parents) {
 + commit-object.flags = UNINTERESTING;
 + stack.nr--;
 + }

Shouldn't this be |= when setting the flag?

 + /*
 +  * If we just popped the stack, parents-item has been marked,
 +  * therefore contains_test will return a meaningful 0 or 1.
 +  */
 + else switch (contains_test(parents-item, want)) {
 + case 1:
 + commit-object.flags |= TMP_MARK;
 + stack.nr--;
 + break;
 + case 0:
 + entry-parents = parents-next;
 + break;
 + default:
 + push_to_stack(parents-item, stack);
 + break;
 + }

And if we have an enum, this switch() becomes more readable (the
default here threw me off initially, because it is actually just
looking for -1).

 +expect
 +# ulimit is a bash builtin; we can rely on that in MinGW, but nowhere else
 +test_expect_success MINGW '--contains works in a deep repo' '
 + ulimit -s 64

It would be nice to test this on Linux.

Can we do something like:

  test_lazy_prereq BASH 'bash --version'

  test_expect_success BASH '--contains works in a deep repo' '
... setup repo ...
bash -c ulimit -s 64  git tag --contains HEAD actual 
test_cmp expect actual
  '

As a bonus, then our ulimit call does not pollute the environment of
subsequent tests.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Jeff King
On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote:

 2012/11/11 Jeff King p...@peff.net:
  On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
 
  Ultimately, I have some ideas for doing this in a breadth-first way,
  which would make it more naturally iterative. It would involve having N
  bits of storage per commit to check N tags, but it would mean that we
  could get accurate answers in the face of clock skew (like the
  merge-base calculation, it would merely get slower in the face of skew).
 
 I guess the optimal algorithm may also depend on the commit graph
 general shape, but intuitively, I'd say that the critical factor is
 the number and distribution of tags. As soon as you have a significant
 number of tags (let's say 1% of the commits are tagged, evenly
 distributed), you'll quickly end up with every commit marked as
 containing or not the target commit, so that each additional tag check
 is cheap.
 
 This suggests a complexity of O(number of commits) more often then
 not, however you choose to traverse the graph.

We can do much better than O(number of commits), though, if we stop
traversing down a path when its timestamp shows that it is too old to
contain the commits we are searching for. The problem is that the
timestamps cannot always be trusted, because they are generated on
machines with wrong clocks, or by buggy software. This could be solved
by calculating and caching a generation number, but last time it was
discussed there was a lot of arguing and nothing got done.

Another approach, used by the merge-base calculation, is to treat
parents in a breadth-first way, but sort them by timestamp, and walk
backwards to find the merge-base (and if you get to a merge-base, you
can stop). In that case, bad timestamps may cause you to look at extra
commits (because you process a commit prematurely and end up going
deeper than the merge base), but it can never give you a wrong answer.

Thinking on it more, though, the merge-base style of computation would
mean you always have to walk backwards to the oldest tag. Which is in
the same order of magnitude as the number of commits, assuming you have
tags near the start of history. So I think we will always want to keep a
cutoff, anyway, and there is not much point in switching off of a
depth-first approach (but of course it does make sense to use iteration
instead of recursion to do so).

 At least on my almost-real-life repo*, with ~140,000 commits and
 ~2,000 tags, tag --contains takes a few seconds, of course more than
 the worst-case test on a 40,000 commit / 1 tag repo, but still in the
 same order of magnitude.

Try git rev-list --count --all to get a sense of how long O(# of
commits) takes. Before the depth-first implementation, tag --contains
was O(commits * tags) in the worst case. With depth first, it's
O(commits), and then with the timestamp cutoff, it's really O(commits
since needle), where needle is the oldest thing you're looking for.
Here are those numbers on linux-2.6 (with linux-stable tags):

  [to get a sense of the repo size]
  $ git for-each-ref refs/tags | wc -l
  909
  $ git rev-list --count --all
  363413

  [this is O(commits)]
  $ time git rev-list --count --all
  real0m4.034s
  user0m3.960s
  sys 0m0.056s

  [before depth-first, ffc4b80^; you can see that it is much worse than
   O(commits), though not as bad as the worst case (because finding the
   merge bases for recent tags is not O(commits)]
  $ time git tag --contains HEAD~200
  real0m42.838s
  user0m42.527s
  sys 0m0.156s

  [after depth-first, ffc4b80; you can see that this is O(commits),
   because we will go depth-first down to the roots, but do only a
   single traversal]
  $ time git tag --contains HEAD~200
  real0m3.939s
  user0m3.784s
  sys 0m0.140s

  [with my generation patches; much faster]
  $ time git tag --contains HEAD~200
  real0m0.037s
  user0m0.020s
  sys 0m0.012s

I was thinking we had merged the timestamp cutoff (which has the same
performance characteristics as generations, just with the skew issue) to
master, but it looks like we didn't.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Johannes Schindelin
Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

 On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote:
 
  2012/11/11 Jeff King p...@peff.net:
   On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
  
   Ultimately, I have some ideas for doing this in a breadth-first way,
   which would make it more naturally iterative. It would involve
   having N bits of storage per commit to check N tags, but it would
   mean that we could get accurate answers in the face of clock skew
   (like the merge-base calculation, it would merely get slower in the
   face of skew).
  
  I guess the optimal algorithm may also depend on the commit graph
  general shape, but intuitively, I'd say that the critical factor is
  the number and distribution of tags. As soon as you have a significant
  number of tags (let's say 1% of the commits are tagged, evenly
  distributed), you'll quickly end up with every commit marked as
  containing or not the target commit, so that each additional tag check
  is cheap.
  
  This suggests a complexity of O(number of commits) more often then
  not, however you choose to traverse the graph.
 
 We can do much better than O(number of commits), though, if we stop
 traversing down a path when its timestamp shows that it is too old to
 contain the commits we are searching for. The problem is that the
 timestamps cannot always be trusted, because they are generated on
 machines with wrong clocks, or by buggy software. This could be solved
 by calculating and caching a generation number, but last time it was
 discussed there was a lot of arguing and nothing got done.

Sadly, not only machines with skewed clocks, but in particular buggy
3rd-party SCMs make this more than just problematic. In a git-svn clone
that was used as base for heavy Git development, I encountered quite a lot
of Jan 1, 1970 commits.

It just cannot be helped, we must distrust timestamps completely.

Ciao,
Dscho

Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Jeff King
On Tue, Nov 13, 2012 at 01:16:01AM +, Johannes Schindelin wrote:

  We can do much better than O(number of commits), though, if we stop
  traversing down a path when its timestamp shows that it is too old to
  contain the commits we are searching for. The problem is that the
  timestamps cannot always be trusted, because they are generated on
  machines with wrong clocks, or by buggy software. This could be solved
  by calculating and caching a generation number, but last time it was
  discussed there was a lot of arguing and nothing got done.
 
 Sadly, not only machines with skewed clocks, but in particular buggy
 3rd-party SCMs make this more than just problematic. In a git-svn clone
 that was used as base for heavy Git development, I encountered quite a lot
 of Jan 1, 1970 commits.

Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
5 broken commits in a row for --since). But the big ones are usually
software bugs (the big kernel ones were from broken guilt, I think) or
broken imports (when I published a bunch of skew statistics last year,
the interesting ones were all imports; I don't know if they were
software bugs, or just garbage in, garbage out).

 It just cannot be helped, we must distrust timestamps completely.

Note that name-rev will produce wrong answers in the face of clock skew.
And I think that you even wrote that code. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Johannes Schindelin
Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

 On Tue, Nov 13, 2012 at 01:16:01AM +, Johannes Schindelin wrote:
 
   We can do much better than O(number of commits), though, if we stop
   traversing down a path when its timestamp shows that it is too old to
   contain the commits we are searching for. The problem is that the
   timestamps cannot always be trusted, because they are generated on
   machines with wrong clocks, or by buggy software. This could be solved
   by calculating and caching a generation number, but last time it was
   discussed there was a lot of arguing and nothing got done.
  
  Sadly, not only machines with skewed clocks, but in particular buggy
  3rd-party SCMs make this more than just problematic. In a git-svn clone
  that was used as base for heavy Git development, I encountered quite a lot
  of Jan 1, 1970 commits.
 
 Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
 5 broken commits in a row for --since). But the big ones are usually
 software bugs (the big kernel ones were from broken guilt, I think) or
 broken imports (when I published a bunch of skew statistics last year,
 the interesting ones were all imports; I don't know if they were
 software bugs, or just garbage in, garbage out).
 
  It just cannot be helped, we must distrust timestamps completely.
 
 Note that name-rev will produce wrong answers in the face of clock skew.
 And I think that you even wrote that code. :)

IIRC the cute code to short-circuit using the date is not from me. If it
is, I am very ashamed.

Ciao,
Johannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Jeff King
On Tue, Nov 13, 2012 at 04:01:11AM +, Johannes Schindelin wrote:

  Note that name-rev will produce wrong answers in the face of clock skew.
  And I think that you even wrote that code. :)
 
 IIRC the cute code to short-circuit using the date is not from me. If it
 is, I am very ashamed.

Sorry, but it was:

  $ git blame -L'/commit-date  cutoff/',+1  builtin/name-rev.c
  bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32)
  if (commit-date  cutoff)

But it is never too late to fix it. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
 5 broken commits in a row for --since). But the big ones are usually
 software bugs (the big kernel ones were from broken guilt, I think) or
 broken imports (when I published a bunch of skew statistics last year,
 the interesting ones were all imports; I don't know if they were
 software bugs, or just garbage in, garbage out).

I was hoping that 2e6bdd3 (test-generation: compute generation
numbers and clock skews, 2012-09-04) may be a good first step to
come up with a practical and cheap solution on top of it.

The traversal can be fooled by clock skews when it sees a commit
that has a timestamp that is older than it should, causing it to
give up, incorrectly thinking that there won't be newer commits that
it is interested in behind the problematic commit.

The logic implemented by the change is to identify these problematic
commits, and we could record these commits with the value of the
timestamps they should have had (e.g. the timestamp of the newest
ancestor for each of these commits) in a notes tree.  Then the
traversal logic (commit-list-insert-by-date) could be updated use
that corrected timestamp instead not to be fooled by the clock
skew.

Such a notes tree can be built once and updated by only appending,
as a commit will never acquire more ancestors in its parents chain
once it is made.

Is it too simplistic, or too costly?  In git.git we have three such
commits whose timestamp need to be corrected, while in the Linux
kernel there were 2.2k skewed commits when I counted them a few
months ago.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Johannes Schindelin
Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

 On Tue, Nov 13, 2012 at 04:01:11AM +, Johannes Schindelin wrote:
 
   Note that name-rev will produce wrong answers in the face of clock skew.
   And I think that you even wrote that code. :)
  
  IIRC the cute code to short-circuit using the date is not from me. If it
  is, I am very ashamed.
 
 Sorry, but it was:
 
   $ git blame -L'/commit-date  cutoff/',+1  builtin/name-rev.c
   bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32)
   if (commit-date  cutoff)
 
 But it is never too late to fix it. :)

I will now go and find a hole to hide in. Or alternatively finally go to
sleep.

Ciao,
Johannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-12 Thread Jeff King
On Mon, Nov 12, 2012 at 08:51:37PM -0800, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
  5 broken commits in a row for --since). But the big ones are usually
  software bugs (the big kernel ones were from broken guilt, I think) or
  broken imports (when I published a bunch of skew statistics last year,
  the interesting ones were all imports; I don't know if they were
  software bugs, or just garbage in, garbage out).
 
 I was hoping that 2e6bdd3 (test-generation: compute generation
 numbers and clock skews, 2012-09-04) may be a good first step to
 come up with a practical and cheap solution on top of it.

 The traversal can be fooled by clock skews when it sees a commit
 that has a timestamp that is older than it should, causing it to
 give up, incorrectly thinking that there won't be newer commits that
 it is interested in behind the problematic commit.

I wrote a similar skew-finding tool last year, though some of the
numbers it came up with were different (I remember having many fewer
skewed commits in the kernel repo).

One problem is that it identifies commits which behave badly with
certain algorithms, but it does not identify commits which are wrong.
If I skew backwards, it will find my commit. But if I skew forwards, it
will label my children as wrong.

 The logic implemented by the change is to identify these problematic
 commits, and we could record these commits with the value of the
 timestamps they should have had (e.g. the timestamp of the newest
 ancestor for each of these commits) in a notes tree.  Then the
 traversal logic (commit-list-insert-by-date) could be updated use
 that corrected timestamp instead not to be fooled by the clock
 skew.
 
 Such a notes tree can be built once and updated by only appending,
 as a commit will never acquire more ancestors in its parents chain
 once it is made.
 
 Is it too simplistic, or too costly?  In git.git we have three such
 commits whose timestamp need to be corrected, while in the Linux
 kernel there were 2.2k skewed commits when I counted them a few
 months ago.

This came up in the big generations discussion last summer, and I think
I even implemented a proof of concept. I couldn't find the actual code,
though but only that I got pleasing performance results using a notes
tree to store a list of commits with bogus timestamps:

  http://article.gmane.org/gmane.comp.version-control.git/161101

It is a little wasteful in space if you have a lot of skewed commits
(the notes tree stores a 160-bit hash pointing to a blob object storing
a 32-bit integer).

My personal preference at this point would be:

  1. introduce an auxiliary metadata file that would live alongside the
 pack index and contain generation numbers

  2. generate the metadata file during pack indexing.

  3. If we have a generation metadata file, but a particular object is
 not in it, compute the generation; this should be quick because we
 will hit a file with a stored generation eventually

  4. If we do not have any generation metadata files, or if grafts or
 replace objects are in use, do not use cutoffs in algorithms. Be
 safe but slow.

On the other hand, just switching to doing a single traversal instead of
one merge-base computation per tag already got rid of the really awful
performance cases. Nobody has complained since that went in, so maybe
nobody cares about shaving a few seconds per operation down to a few
tens of milliseconds. The real win was shaving tens of seconds down to a
few seconds.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-11 Thread Jeff King
On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:

 However, I couldn't reproduce it on Linux : where the windows
 implementations crashes at a ~32000 depth (*not* exactly 32768, mind
 you), on linux it happily went through 10 commits. I didn't take
 time to look much further, but maybe on my 64 bit Linux VM, the process
 can afford to reserve a much bigger address range for the stack of each
 thread than the 1Mb given to 32 bit processes on windows.
 Jean-Jacques.
 
 I can reproduce it on Linux (Debian testing amd64) with ulimit -s
 1000 to reduce the stack size from its default value of 8MB.
 
 After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed
 up --contains calculation) the test passes even with the smaller
 stack, but it makes git tag --contains take thrice the time as
 before.

Right, I am not too surprised.  That function replaced the original
algorithm with a much faster depth-first recursive one. I haven't looked
closely yet at Jean-Jacques' iterative adaptation, but that direction
seems like a good fix for now.

Ultimately, I have some ideas for doing this in a breadth-first way,
which would make it more naturally iterative. It would involve having N
bits of storage per commit to check N tags, but it would mean that we
could get accurate answers in the face of clock skew (like the
merge-base calculation, it would merely get slower in the face of skew).

But since I haven't worked on that at all, fixing the depth-first
algorithm to be iterative makes sense to me.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

2012-11-11 Thread Johannes Schindelin
Hi,

On Sun, 11 Nov 2012, Jeff King wrote:

 On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
 
  However, I couldn't reproduce it on Linux : where the windows
  implementations crashes at a ~32000 depth (*not* exactly 32768, mind
  you), on linux it happily went through 10 commits. I didn't take
  time to look much further, but maybe on my 64 bit Linux VM, the
  process can afford to reserve a much bigger address range for the
  stack of each thread than the 1Mb given to 32 bit processes on
  windows.  Jean-Jacques.
  
  I can reproduce it on Linux (Debian testing amd64) with ulimit -s 1000
  to reduce the stack size from its default value of 8MB.
  
  After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed
  up --contains calculation) the test passes even with the smaller
  stack, but it makes git tag --contains take thrice the time as
  before.
 
 Right, I am not too surprised.  That function replaced the original
 algorithm with a much faster depth-first recursive one. I haven't looked
 closely yet at Jean-Jacques' iterative adaptation, but that direction
 seems like a good fix for now.
 
 Ultimately, I have some ideas for doing this in a breadth-first way,
 which would make it more naturally iterative. It would involve having N
 bits of storage per commit to check N tags, but it would mean that we
 could get accurate answers in the face of clock skew (like the
 merge-base calculation, it would merely get slower in the face of skew).
 
 But since I haven't worked on that at all, fixing the depth-first
 algorithm to be iterative makes sense to me.

Have you tried the latest tag-contains branch of
git://github.com/msysgit/git/? It contains a couple of brush-ups and a
re-write of the recursion (which I hope is right, I had only time to work
on it during an unwanted layover at O'Hare). The SHA-1 is
fc4f42787a0dd0022d202627681362081a66ef70.

Ciao,
Johannes

Re: [msysGit] [PATCH] git tag --contains : avoid stack overflow

2012-11-10 Thread Philip Oakley
From: Jean-Jacques Lafay jeanjacques.la...@gmail.com Sent: Saturday, 
November 10, 2012 5:36 PM
In large repos, the recursion implementation of contains(commit, 
commit_list)
may result in a stack overflow. Replace the recursion with a loop to 
fix it.


Signed-off-by: Jean-Jacques Lafay jeanjacques.la...@gmail.com


This is a change to the main git code so it is better to submit it to 
the main git list at git@vger.kernel.org (plain text, no HTML, first 
post usually has a delay ;-)


It sounds reasonable but others may have opinions and comments.

Have you actually suffered from the stack overflow issue? You only 
suggest it as a possibility, rather than a real problem.


Philip


---
builtin/tag.c  | 83 
--

t/t7004-tag.sh | 21 +++
2 files changed, 78 insertions(+), 26 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index 9c3e067..4be67dd 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -73,40 +73,71 @@ static int in_commit_list(const struct commit_list 
*want, struct commit *c)

 return 0;
}

-static int contains_recurse(struct commit *candidate,
- const struct commit_list *want)
-{
- struct commit_list *p;
-
- /* was it previously marked as containing a want commit? */
- if (candidate-object.flags  TMP_MARK)
- return 1;
- /* or marked as not possibly containing a want commit? */
- if (candidate-object.flags  UNINTERESTING)
- return 0;
- /* or are we it? */
- if (in_commit_list(want, candidate))
- return 1;
+struct commit_list_list {
+ struct commit *item;
+ struct commit_list *next_item;
+ struct commit_list_list *next;
+};

- if (parse_commit(candidate)  0)
- return 0;
+static void mark_path(struct commit_list_list *path, int status)
+{
+ struct commit_list_list *temp;
+ while (path) {
+ path-item-object.flags |= status;
+ temp = path;
+ path = temp-next;
+ free(temp);
+ }
+}

- /* Otherwise recurse and mark ourselves for future traversals. */
- for (p = candidate-parents; p; p = p-next) {
- if (contains_recurse(p-item, want)) {
- candidate-object.flags |= TMP_MARK;
+static int contains(struct commit *candidate,
+ const struct commit_list *want)
+{
+ /* previously implemented with a recursion, but could stack overflow 
in large repos */

+ struct commit_list_list *p, *tmp;
+ p = xmalloc(sizeof(struct commit_list_list));
+ p-item = candidate;
+ p-next_item = NULL;
+ p-next = NULL;
+
+ while (p) {
+ candidate = p-item;
+
+ /* is it ok : */
+ /* was it previously marked as containing a want commit? */
+ /* or are we it? */
+ if (candidate-object.flags  TMP_MARK || in_commit_list(want, 
candidate)) {

+ mark_path(p, TMP_MARK);
 return 1;
 }
+ /* is it not ok : */
+ /* was it previously marked as not possibly containing a want 
commit? */

+ /* do we have parents? */
+ if (candidate-object.flags  UNINTERESTING || 
parse_commit(candidate)  0 || !candidate-parents) {

+ candidate-object.flags |= UNINTERESTING;
+ /* then backtrack, marking as UNINTERESTING along the way */
+ while (p  !p-next_item) {
+ p-item-object.flags |= UNINTERESTING;
+ tmp = p;
+ p = tmp-next;
+ free(tmp);
+ }
+ if (p) {
+ p-item = p-next_item-item;
+ p-next_item = p-next_item-next;
+ }
+ } else {
+ /* let's check the parents */
+ tmp = xmalloc(sizeof(struct commit_list_list));
+ tmp-item = candidate-parents-item;
+ tmp-next_item = candidate-parents-next;
+ tmp-next = p;
+ p = tmp;
+ }
 }
- candidate-object.flags |= UNINTERESTING;
 return 0;
}

-static int contains(struct commit *candidate, const struct 
commit_list *want)

-{
- return contains_recurse(candidate, want);
-}
-
static void show_tag_lines(const unsigned char *sha1, int lines)
{
 int i;
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index 5189446..196ac54 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1365,4 +1365,25 @@ test_expect_success 'multiple --points-at are 
OR-ed together' '

 test_cmp expect actual
'

+# what about a deep repo ?
+
+expect
+test_expect_success '--contains works in a deep repo' '
+ i=1
+ while test $i -lt 4
+ do
+ echo commit refs/heads/master
+committer A U Thor aut...@example.com $((10 + $i * 100)) 
+0200

+data EOF
+commit #$i
+EOF
+ test $i == 1  echo from refs/heads/master^0
+ i=$(($i + 1))
+ done | git fast-import
+ git checkout master
+ git tag far-far-away HEAD^
+ git tag --contains HEAD actual 
+ test_cmp expect actual
+'
+
test_done
--
1.8.0.msysgit.0.1.geed93bd.dirty

--


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] [PATCH] git tag --contains : avoid stack overflow

2012-11-10 Thread Pat Thoyts
On 10 November 2012 21:13, Jean-Jacques Lafay
jeanjacques.la...@gmail.com wrote:
 Le samedi 10 novembre 2012 21:00:10 UTC+1, Philip Oakley a écrit :

 From: Jean-Jacques Lafay jeanjacq...@gmail.com Sent: Saturday,
 November 10, 2012 5:36 PM
  In large repos, the recursion implementation of contains(commit,
  commit_list)
  may result in a stack overflow. Replace the recursion with a loop to
  fix it.
 
  Signed-off-by: Jean-Jacques Lafay jeanjacq...@gmail.com

 This is a change to the main git code so it is better to submit it to
 the main git list at g...@vger.kernel.org (plain text, no HTML, first
 post usually has a delay ;-)

 It sounds reasonable but others may have opinions and comments.

 Have you actually suffered from the stack overflow issue? You only
 suggest it as a possibility, rather than a real problem.

 Philip


 Yes, it actually crashed on me, and since the call is made by GitExtension
 while browsing the commit history, it was quickly annoying to have windows
 popping a git.exe stopped working message box at my face every time I
 clicked on a line of the history ;-)  (well, you can sort of work around it
 by having the file tree or diff tab active)

 Coincidentally, as I was having a glance at the recent topics, I saw that
 someone else experienced it.

 However, I couldn't reproduce it on Linux : where the windows
 implementations crashes at a ~32000 depth (*not* exactly 32768, mind you),
 on linux it happily went through 10 commits. I didn't take time to look
 much further, but maybe on my 64 bit Linux VM, the process can afford to
 reserve a much bigger address range for the stack of each thread than the
 1Mb given to 32 bit processes on windows.

 Jean-Jacques.

I checked this on msysGit - the test causes a crash on Windows when
using the 1.8.0 release and recompiling with this patch applied fixes
this. As mentioned, its best to have this reviewed upstream too as its
likely that windows just has a smaller stack so causes the crash
earlier.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html