Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

Jeff King Mon, 12 Nov 2012 22:08:50 -0800

On Mon, Nov 12, 2012 at 08:51:37PM -0800, Junio C Hamano wrote:

> Jeff King <[email protected]> writes:
> 
> > Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
> > 5 broken commits in a row for --since). But the big ones are usually
> > software bugs (the big kernel ones were from broken "guilt", I think) or
> > broken imports (when I published a bunch of skew statistics last year,
> > the interesting ones were all imports; I don't know if they were
> > software bugs, or just garbage in, garbage out).
> 
> I was hoping that 2e6bdd3 (test-generation: compute generation
> numbers and clock skews, 2012-09-04) may be a good first step to
> come up with a practical and cheap solution on top of it.
>
> The traversal can be fooled by clock skews when it sees a commit
> that has a timestamp that is older than it should, causing it to
> give up, incorrectly thinking that there won't be newer commits that
> it is interested in behind the problematic commit.


I wrote a similar skew-finding tool last year, though some of the
numbers it came up with were different (I remember having many fewer
skewed commits in the kernel repo).

One problem is that it identifies commits which behave badly with
certain algorithms, but it does not identify commits which are wrong.
If I skew backwards, it will find my commit. But if I skew forwards, it
will label my children as wrong.

> The logic implemented by the change is to identify these problematic
> commits, and we could record these commits with the value of the
> timestamps they should have had (e.g. the timestamp of the newest
> ancestor for each of these commits) in a notes tree.  Then the
> traversal logic (commit-list-insert-by-date) could be updated use
> that "corrected" timestamp instead not to be fooled by the clock
> skew.
> 
> Such a notes tree can be built once and updated by only "appending",
> as a commit will never acquire more ancestors in its parents chain
> once it is made.
> 
> Is it too simplistic, or too costly?  In git.git we have three such
> commits whose timestamp need to be corrected, while in the Linux
> kernel there were 2.2k skewed commits when I counted them a few
> months ago.

This came up in the big generations discussion last summer, and I think
I even implemented a proof of concept. I couldn't find the actual code,
though but only that I got "pleasing performance results using a notes
tree to store a list of commits with bogus timestamps":

  http://article.gmane.org/gmane.comp.version-control.git/161101

It is a little wasteful in space if you have a lot of skewed commits
(the notes tree stores a 160-bit hash pointing to a blob object storing
a 32-bit integer).

My personal preference at this point would be:

  1. introduce an auxiliary metadata file that would live alongside the
     pack index and contain generation numbers

  2. generate the metadata file during pack indexing.

  3. If we have a generation metadata file, but a particular object is
     not in it, compute the generation; this should be quick because we
     will hit a file with a stored generation eventually

  4. If we do not have any generation metadata files, or if grafts or
     replace objects are in use, do not use cutoffs in algorithms. Be
     safe but slow.

On the other hand, just switching to doing a single traversal instead of
one merge-base computation per tag already got rid of the really awful
performance cases. Nobody has complained since that went in, so maybe
nobody cares about shaving a few seconds per operation down to a few
tens of milliseconds. The real win was shaving tens of seconds down to a
few seconds.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow

Reply via email to