Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-25 Thread Derrick Stolee

On 10/25/2018 5:43 AM, Jeff King wrote:

On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote:


2. INDEGREE: using the indegree_queue priority queue (ordered
 by maximizing the generation number), add one to the in-
 degree of each parent for each commit that is walked. Since
 we walk in order of decreasing generation number, we know
 that discovering an in-degree value of 0 means the value for
 that commit was not initialized, so should be initialized to
 two. (Recall that in-degree value "1" is what we use to say a
 commit is ready for output.) As we iterate the parents of a
 commit during this walk, ensure the EXPLORE walk has walked
 beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.

The INDEGREE walk is an important element for Kahn's algorithm. The final
output order is dictated by peeling commits of "indegree zero" to ensure all
children are output before their parents. (Note: since we use literal 0 to
mean "uninitialized", we peel commits when the indegree slab has value 1.)

This walk replaces the indegree logic from sort_in_topological_order(). That
method performs one walk that fills the indegree slab, then another walk
that peels the commits with indegree 0 and inserts them into a list.

I guess my big question here was: if we have generation numbers, do we
need Kahn's algorithm? That is, in a fully populated set of generation
numbers (i.e., no INFINITY), we could always just pick a commit with the
highest generation number to show.

So if we EXPLORE down to a real generation number in phase 1, why do we
need to care about INDEGREE anymore? Or am I wrong that we always have a
real generation number (i.e., not INFINITY) after EXPLORE? (And if so,
why is exploring to a real generation number a bad idea; presumably
it's due to a worst-case that goes deeper than we'd otherwise need to
here).


The issue is that we our final order (used by level 3) is unrelated to 
generation number. Yes, if we prioritized by generation number then we 
could output the commits in _some_ order that doesn't violate 
topological constraints. However, we are asking for a different 
priority, which is different than the generation number priority.


In the case of "--topo-order", we want to output the commits reachable 
from the second parent of a merge before the commits reachable from the 
first parent. However, in most cases the generation number of the first 
parent is higher than the second parent (there are more things in the 
merge chain than in a small topic that got merged). The INDEGREE is what 
allows us to know when we can peel these commits while still respecting 
the priority we want at the end.


Thanks,
-Stolee


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-25 Thread Jeff King
On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote:

> > > 2. INDEGREE: using the indegree_queue priority queue (ordered
> > > by maximizing the generation number), add one to the in-
> > > degree of each parent for each commit that is walked. Since
> > > we walk in order of decreasing generation number, we know
> > > that discovering an in-degree value of 0 means the value for
> > > that commit was not initialized, so should be initialized to
> > > two. (Recall that in-degree value "1" is what we use to say a
> > > commit is ready for output.) As we iterate the parents of a
> > > commit during this walk, ensure the EXPLORE walk has walked
> > > beyond their generation numbers.
> > I wondered how this would work for INFINITY. We can't know the order of
> > a bunch of INFINITY nodes at all, so we never know when their in-degree
> > values are "done". But if I understand the EXPLORE walk, we'd basically
> > walk all of INFINITY down to something with a real generation number. Is
> > that right?
> > 
> > But after that, I'm not totally clear on why we need this INDEGREE walk.
> 
> The INDEGREE walk is an important element for Kahn's algorithm. The final
> output order is dictated by peeling commits of "indegree zero" to ensure all
> children are output before their parents. (Note: since we use literal 0 to
> mean "uninitialized", we peel commits when the indegree slab has value 1.)
> 
> This walk replaces the indegree logic from sort_in_topological_order(). That
> method performs one walk that fills the indegree slab, then another walk
> that peels the commits with indegree 0 and inserts them into a list.

I guess my big question here was: if we have generation numbers, do we
need Kahn's algorithm? That is, in a fully populated set of generation
numbers (i.e., no INFINITY), we could always just pick a commit with the
highest generation number to show.

So if we EXPLORE down to a real generation number in phase 1, why do we
need to care about INDEGREE anymore? Or am I wrong that we always have a
real generation number (i.e., not INFINITY) after EXPLORE? (And if so,
why is exploring to a real generation number a bad idea; presumably
it's due to a worst-case that goes deeper than we'd otherwise need to
here).

> [...]

Everything else you said here made perfect sense.

-Peff


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-11 Thread Stefan Beller
On Fri, Sep 21, 2018 at 10:39 AM Derrick Stolee via GitGitGadget
 wrote:
>
> From: Derrick Stolee 
[...]
> For the test above, I specifically selected a path that is changed
> frequently, including by merge commits. A less-frequently-changed
> path (such as 'README') has similar end-to-end time since we need
> to walk the same number of commits (before determining we do not
> have 100 hits). However, get get the benefit that the output is

"get get"


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-11 Thread Derrick Stolee

On 10/11/2018 11:35 AM, Jeff King wrote:

On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:


From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
[...]
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,

A minor nit, but this commit message doesn't mention the most basic
thing up front: that its main purpose is to introduce a new algorithm
for topo-order. ;)

It's obvious in the context of reviewing the series, but somebody
reading "git log" later may want a bit more. Perhaps:

   revision.c: implement generation-based topo-order algorithm

as a subject, and/or an introductory paragraph like:

   The current --topo-order algorithm requires walking all commits we
   are going to output up front, topo-sorting them, all before
   outputting the first value. This patch introduces a new algorithm
   which uses stored generation numbers to incrementally walk in
   topo-order, outputting commits as we go.

Other than that, I find this to be a wonderfully explanatory commit
message. :)


Good idea. I'll make that change.




The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.

OK, this makes sense. If we know that everybody else in our queue is at
generation X, then it is safe to output a commit at generation greater
than X.

I think this by itself would allow us to implement "show no parents
before all of its children are shown", right? But --topo-order promises
a bit more: "avoid showing commits no multiple lines of history
intermixed".

I guess also INFINITY generation numbers need more. For a real
generation number, we know that "gen(A) == gen(B)" implies that there is
no ancestry relationship between the two. But not so for INFINITY.


Yeah, to deal with INFINITY (and ZERO, but that won't happen if 
generation_numbers_enabled() returns true), we treat gen(A) == gen(B) as 
a "no information" state. So, to output a commit at generation X, we 
need to have our maximum generation number in the unexplored area to be 
at most X - 1. You'll see strict inequality when checking generations.




2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.


The INDEGREE walk is an important element for Kahn's algorithm. The 
final output order is dictated by peeling commits of "indegree zero" to 
ensure all children are output before their parents. (Note: since we use 
literal 0 to mean "uninitialized", we peel commits when the indegree 
slab has value 1.)


This walk replaces the indegree logic from sort_in_topological_order(). 
That method performs one walk that fills the indegree slab, then another 
walk that peels the commits with indegree 0 and inserts them into a list.



3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.

OK, this makes sense to make --author-date-order, etc, work. Potentially
those numbers might have no relationship at all with the graph
structure, but we promise "no parent before its children are shown", so
this is really just a tie-breaker after the topo-sort anyway. As long as
steps 1 and 2 are correct and produce a complete set of commits for one
"layer", this should be OK.

I guess I'm not 100% convinced that we don't have a case where we
haven't yet parsed or considered some commit that we know cannot have an
ancestry relationship with commits 

Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-11 Thread Jeff King
On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee 
> 
> When running a command like 'git rev-list --topo-order HEAD',
> Git performed the following steps:
> [...]
> In the new algorithm, these three steps correspond to three
> different commit walks. We run these walks simultaneously,

A minor nit, but this commit message doesn't mention the most basic
thing up front: that its main purpose is to introduce a new algorithm
for topo-order. ;)

It's obvious in the context of reviewing the series, but somebody
reading "git log" later may want a bit more. Perhaps:

  revision.c: implement generation-based topo-order algorithm

as a subject, and/or an introductory paragraph like:

  The current --topo-order algorithm requires walking all commits we
  are going to output up front, topo-sorting them, all before
  outputting the first value. This patch introduces a new algorithm
  which uses stored generation numbers to incrementally walk in
  topo-order, outputting commits as we go.

Other than that, I find this to be a wonderfully explanatory commit
message. :)

> The walks are as follows:
> 
> 1. EXPLORE: using the explore_queue priority queue (ordered by
>maximizing the generation number), parse each reachable
>commit until all commits in the queue have generation
>number strictly lower than needed. During this walk, update
>the UNINTERESTING flags as necessary.

OK, this makes sense. If we know that everybody else in our queue is at
generation X, then it is safe to output a commit at generation greater
than X.

I think this by itself would allow us to implement "show no parents
before all of its children are shown", right? But --topo-order promises
a bit more: "avoid showing commits no multiple lines of history
intermixed".

I guess also INFINITY generation numbers need more. For a real
generation number, we know that "gen(A) == gen(B)" implies that there is
no ancestry relationship between the two. But not so for INFINITY.

> 2. INDEGREE: using the indegree_queue priority queue (ordered
>by maximizing the generation number), add one to the in-
>degree of each parent for each commit that is walked. Since
>we walk in order of decreasing generation number, we know
>that discovering an in-degree value of 0 means the value for
>that commit was not initialized, so should be initialized to
>two. (Recall that in-degree value "1" is what we use to say a
>commit is ready for output.) As we iterate the parents of a
>commit during this walk, ensure the EXPLORE walk has walked
>beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.

> 3. TOPO: using the topo_queue priority queue (ordered based on
>the sort_order given, which could be commit-date, author-
>date, or typical topo-order which treats the queue as a LIFO
>stack), remove a commit from the queue and decrement the
>in-degree of each parent. If a parent has an in-degree of
>one, then we add it to the topo_queue. Before we decrement
>the in-degree, however, ensure the INDEGREE walk has walked
>beyond that generation number.

OK, this makes sense to make --author-date-order, etc, work. Potentially
those numbers might have no relationship at all with the graph
structure, but we promise "no parent before its children are shown", so
this is really just a tie-breaker after the topo-sort anyway. As long as
steps 1 and 2 are correct and produce a complete set of commits for one
"layer", this should be OK.

I guess I'm not 100% convinced that we don't have a case where we
haven't yet parsed or considered some commit that we know cannot have an
ancestry relationship with commits we are outputting. But it may have an
author-date-order relationship.

(I'm not at all convinced that this _is_ a problem, and I suspect it
isn't; I'm only suggesting I haven't fully grokked the proof).

> ---
>  object.h   |   4 +-
>  revision.c | 196 +++--
>  revision.h |   2 +
>  3 files changed, 194 insertions(+), 8 deletions(-)

I'll pause here on evaluating the actual code. It looks sane from a
cursory read, but there's no point in digging further until I'm sure I
fully understand the algorithm. I think that needs a little more brain
power from me, and hopefully discussion around my comments above will
help trigger that.

-Peff


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-10-06 Thread Jakub Narebski
Derrick Stolee  writes:

> On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:

> Hello, Git contributors.
>
> I understand that this commit message and patch are pretty
> daunting. There is a lot to read and digest. I would like to see if
> anyone is willing to put the work in to review this patch, as I quite
> like what it does, and the performance numbers below.

I'll try to find time to review v3 of this patch series this week.

>> In my local testing, I used the following Git commands on the
>> Linux repository in three modes: HEAD~1 with no commit-graph,
>> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
>> allows comparing the benefits we get from parsing commits from
>> the commit-graph and then again the benefits we get by
>> restricting the set of commits we walk.
>>
>> Test: git rev-list --topo-order -100 HEAD
>> HEAD~1, no commit-graph: 6.80 s
>> HEAD~1, w/ commit-graph: 0.77 s
>>HEAD, w/ commit-graph: 0.02 s
>>
>> Test: git rev-list --topo-order -100 HEAD -- tools
>> HEAD~1, no commit-graph: 9.63 s
>> HEAD~1, w/ commit-graph: 6.06 s
>>HEAD, w/ commit-graph: 0.06 s
>
> If there is something I can do to make this easier to review, then
> please let me know.
>
> Thanks,
> -Stolee


Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-09-27 Thread Derrick Stolee

On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:

From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.

2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Hello, Git contributors.

I understand that this commit message and patch are pretty daunting. 
There is a lot to read and digest. I would like to see if anyone is 
willing to put the work in to review this patch, as I quite like what it 
does, and the performance numbers below.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
   HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
   HEAD, w/ commit-graph: 0.06 s


If there is something I can do to make this easier to review, then 
please let me know.


Thanks,
-Stolee


[PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD