Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee

On 4/21/2018 4:44 PM, Jakub Narebski wrote:

Jakub Narebski  writes:

Derrick Stolee  writes:

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

[...]


And the thing to measure is a commit count. If possible, it would be
good to count commits walked (commits whose parent list is enumerated)
and commits inspected (commits that were listed as a parent of some
walked commit). Walked commits require a commit parse -- albeit from
the commit-graph instead of the ODB now -- while inspected commits
only check the in-memory cache.

[...]

For git.git and Linux, I like to use the release tags as tests. They
provide a realistic view of the linear history, and maintenance
releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
  * src = 468165c1d = v2.17.0
  * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

  * 468165c1d has level 18418 which it shares with 6 commits
  * 66d2e04ec has level 14776 which it shares with 93 commits
  * gen(468165c1d) - gen(66d2e04ec) = 3642

  algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
  ---+-++--+-+--+---+
  naive  | 48865   | 39599  | 244  | 9200|  |   |
  level  |  3086   |  2492  | 113  |  528| 285  |   |
  FELINE |   283   |   216  |  68  |0|  |  25   |
  lev+FELINE |   282   |   215  |  68  |0|   5  |  24   |
  ---+-++--+-+--+---+
  lev+FEL+mpi|79   |59  |  21  |0|   0  |   0   |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
   note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
   were used for negative-cut; note that those are not disjoint; node can
   be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.


Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.



P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post 

Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-21 Thread Jakub Narebski
Jakub Narebski  writes:
> Derrick Stolee  writes:
>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>
>>> What would you suggest as a good test that could imply performance? The
>>> Google Colab notebook linked to above includes a function to count
>>> number of commits (nodes / vertices in the commit graph) walked,
>>> currently in the worst case scenario.
>>
>> The two main questions to consider are:
>>
>> 1. Can X reach Y?
>
> That is easy to do.  The function generic_is_reachable() does
> that... though using direct translation of the pseudocode for
> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
> doesn't check if vertex was already visited was not good idea for large
> graphs such as Linux kernel commit graph, oops.  That is why
> generic_is_reachable_large() was created.
[...]

>> And the thing to measure is a commit count. If possible, it would be
>> good to count commits walked (commits whose parent list is enumerated)
>> and commits inspected (commits that were listed as a parent of some
>> walked commit). Walked commits require a commit parse -- albeit from
>> the commit-graph instead of the ODB now -- while inspected commits
>> only check the in-memory cache.
[...]
>>
>> For git.git and Linux, I like to use the release tags as tests. They
>> provide a realistic view of the linear history, and maintenance
>> releases have their own history from the major releases.
>
> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
> FELINE index does not bring any improvements over using just level
> (generation number) filter.  But that may be caused by narrowing od
> commit DAG around releases.
>
> I try do do the same between commits in wide part, with many commits
> with the same level (same generation number) both for source and for
> target commit.  Though this may be unfair to level filter, though...
>
>
> Note however that FELINE index is not unabiguous, like generation
> numbers are (modulo decision whether to start at 0 or at 1); it depends
> on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
 * src = 468165c1d = v2.17.0
 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

 * 468165c1d has level 18418 which it shares with 6 commits
 * 66d2e04ec has level 14776 which it shares with 93 commits
 * gen(468165c1d) - gen(66d2e04ec) = 3642

 algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
 ---+-++--+-+--+---+
 naive  | 48865   | 39599  | 244  | 9200|  |   |
 level  |  3086   |  2492  | 113  |  528| 285  |   |
 FELINE |   283   |   216  |  68  |0|  |  25   |
 lev+FELINE |   282   |   215  |  68  |0|   5  |  24   |
 ---+-++--+-+--+---+
 lev+FEL+mpi|79   |59  |  21  |0|   0  |   0   |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
  note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
  were used for negative-cut; note that those are not disjoint; node can
  be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.

P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post intervals, bloom filter for changed files, etc.) be
moved into separate thread?
-- 
Jakub Narębski


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-14 Thread Jakub Narebski
Derrick Stolee  writes:
> On 4/11/2018 3:32 PM, Jakub Narebski wrote:

>> What would you suggest as a good test that could imply performance? The
>> Google Colab notebook linked to above includes a function to count
>> number of commits (nodes / vertices in the commit graph) walked,
>> currently in the worst case scenario.
>
> The two main questions to consider are:
>
> 1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

> 2. What is the set of merge-bases between X and Y?

I don't have an algorithm for that in the Google Colaboratory notebook.
Though I see that there exist algorithms for calculating lowest common
ancestors in DAGs...

I'll have to take a look how Git does that.

>
> And the thing to measure is a commit count. If possible, it would be
> good to count commits walked (commits whose parent list is enumerated)
> and commits inspected (commits that were listed as a parent of some
> walked commit). Walked commits require a commit parse -- albeit from
> the commit-graph instead of the ODB now -- while inspected commits
> only check the in-memory cache.

I don't quite see the distinction.  Whether we access generation number
of a commit (information about level of vertex in graph), or a parent
list (vertex successors / neighbours), it both needs accessing
commit-graph; well, accessing parents may be more costly for octopus
merges (due to having to go through EDGE chunk).

I can easily return the set of visited commits (vertices), or just size
of said set.

>
> For git.git and Linux, I like to use the release tags as tests. They
> provide a realistic view of the linear history, and maintenance
> releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

>> I have tried finding number of false positives for level (generation
>> number) filter and for FELINE index, and number of false negatives for
>> min-post intervals in the spanning tree (for DFS tree) for 1
>> randomly selected pairs of commits... but I don't think this is a good
>> benchmark.
>
> What is a false-positive? A case where gen(X) < gen(Y) but Y cannot
> reach X?

Yes.  (And equivalent for FELINE index, which is a pair of integers).

> I do not think that is a great benchmark, but I guess it is
> something to measure.

I have simply used it to have something to compare.

>> I Linux kernel sources 
>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
>> that has 750832 nodes and 811733 edges, and 563747941392 possible
>> directed pairs, we have for 1 randomly selected pairs of commits:
>>
>>level-filter has91 =  0.91% [all] false positives
>>FELINE index has78 =  0.78% [all] false positives
>>FELINE index has 1.16667 less false positives than level filter
>>
>>min-post spanning-tree intervals has  3641 = 36.41% [all] false
>>negatives
>
> Perhaps something you can do instead of sampling from N^2 commits in
> total is to select a pair of generations (say, G = 2, G' = 20100)
> or regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150)
> and see how many false positives you see by testing all pairs (one
> from each level). The delta between the generations may need to be
> smaller to actually have a large proportion of unreachable pairs. Try
> different levels, since major version releases tend to "pinch" the
> commit graph to a common history.

That's a good idea.

>> For git.git repository (https://github.com/git/git.git) that has 52950
>> nodes and 65887 edges the numbers are slighly more in FELINE index
>> favor (also out of 1 random pairs):
>>
>>level-filter has   504 =  9.11% false positives
>>FELINE index has   125 =  2.26% false positives
>>FELINE index has 4.032 less false positives than level filter
>>
>> This is for FELINE which does not use level / generatio-numbers filter.

Best,
-- 
Jakub Narębski


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-11 Thread Derrick Stolee

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


The two main questions to consider are:

1. Can X reach Y?
2. What is the set of merge-bases between X and Y?

And the thing to measure is a commit count. If possible, it would be 
good to count commits walked (commits whose parent list is enumerated) 
and commits inspected (commits that were listed as a parent of some 
walked commit). Walked commits require a commit parse -- albeit from the 
commit-graph instead of the ODB now -- while inspected commits only 
check the in-memory cache.


For git.git and Linux, I like to use the release tags as tests. They 
provide a realistic view of the linear history, and maintenance releases 
have their own history from the major releases.



I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 1
randomly selected pairs of commits... but I don't think this is a good
benchmark.


What is a false-positive? A case where gen(X) < gen(Y) but Y cannot 
reach X? I do not think that is a great benchmark, but I guess it is 
something to measure.



I Linux kernel sources 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 1 randomly selected pairs of commits:

   level-filter has91 =  0.91% [all] false positives
   FELINE index has78 =  0.78% [all] false positives
   FELINE index has 1.16667 less false positives than level filter

   min-post spanning-tree intervals has  3641 = 36.41% [all] false
   negatives


Perhaps something you can do instead of sampling from N^2 commits in 
total is to select a pair of generations (say, G = 2, G' = 20100) or 
regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and 
see how many false positives you see by testing all pairs (one from each 
level). The delta between the generations may need to be smaller to 
actually have a large proportion of unreachable pairs. Try different 
levels, since major version releases tend to "pinch" the commit graph to 
a common history.



For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 1 random pairs):

   level-filter has   504 =  9.11% false positives
   FELINE index has   125 =  2.26% false positives
   FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.


Thanks,
-Stolee



Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-11 Thread Jakub Narebski
Derrick Stolee  writes:

> On 4/7/2018 12:55 PM, Jakub Narebski wrote:
>> Currently I am at the stage of reproducing results in FELINE paper:
>> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search
>> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
>> J. Zaki (2014).  This paper is available in the PDF form at
>> https://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The Jupyter Notebook (which runs on Google cloud, but can be also run
>> locally) uses Python kernel, NetworkX librabry for graph manipulation,
>> and matplotlib (via NetworkX) for display.
>>
>> Available at:
>> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing
>>
>> I hope that could be of help, or at least interesting
>
> Let me know when you can give numbers (either raw performance or # of
> commits walked) for real-world Git commit graphs. The Linux repo is a
> good example to use for benchmarking, but I also use the Kotlin repo
> sometimes as it has over a million objects and over 250K commits.

As I am curently converting git repository into commit graph, number of
objects doesn't matter.

Though Kotlin is nicely in largish size set, not as large as Linux
kernel which has 750K commits, but mich larger than git.git with 65K
commits.

> Of course, the only important statistic at the end of the day is the
> end-to-end time of a 'git ...' command. Your investigations should
> inform whether it is worth prototyping the feature in the git
> codebase.

What would you suggest as a good test that could imply performance?  The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 1
randomly selected pairs of commits... but I don't think this is a good
benchmark.

I Linux kernel sources 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 1 randomly selected pairs of commits:

  level-filter has91 =  0.91% [all] false positives
  FELINE index has78 =  0.78% [all] false positives
  FELINE index has 1.16667 less false positives than level filter

  min-post spanning-tree intervals has  3641 = 36.41% [all] false
  negatives

For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 1 random pairs):

  level-filter has   504 =  9.11% false positives
  FELINE index has   125 =  2.26% false positives
  FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.

Regards,
--
Jakub Narębski


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-07 Thread Derrick Stolee

On 4/7/2018 12:55 PM, Jakub Narebski wrote:

Currently I am at the stage of reproducing results in FELINE paper:
"Reachability Queries in Very Large Graphs: A Fast Refined Online Search
Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
J. Zaki (2014).  This paper is available in the PDF form at
https://openproceedings.org/EDBT/2014/paper_166.pdf

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

I hope that could be of help, or at least interesting


Let me know when you can give numbers (either raw performance or # of 
commits walked) for real-world Git commit graphs. The Linux repo is a 
good example to use for benchmarking, but I also use the Kotlin repo 
sometimes as it has over a million objects and over 250K commits.


Of course, the only important statistic at the end of the day is the 
end-to-end time of a 'git ...' command. Your investigations should 
inform whether it is worth prototyping the feature in the git codebase.


Thanks,

-Stolee



Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-07 Thread Jakub Narebski
Derrick Stolee  writes:

> On 4/3/2018 2:03 PM, Brandon Williams wrote:
>> On 04/03, Derrick Stolee wrote:
>>> This is the first of several "small" patches that follow the serialized
>>> Git commit graph patch (ds/commit-graph).
>>>
>>> As described in Documentation/technical/commit-graph.txt, the generation
>>> number of a commit is one more than the maximum generation number among
>>> its parents (trivially, a commit with no parents has generation number
>>> one).
[...]
>>> A more substantial refactoring of revision.c is required before making
>>> 'git log --graph' use generation numbers effectively.
>>
>> log --graph should benefit a lot more from this correct?  I know we've
>> talked a bit about negotiation and I wonder if these generation numbers
>> should be able to help out a little bit with that some day.
>
> 'log --graph' should be a HUGE speedup, when it is refactored. Since
> the topo-order can "stream" commits to the pager, it can be very
> responsive to return the graph in almost all conditions. (The case
> where generation numbers are not enough is when filters reduce the set
> of displayed commits to be very sparse, so many commits are walked
> anyway.)

I wonder if next big speedup would be to store [some] topological
ordering of commits in the commit graph... It could be done for example
in two chunks: a mapping to position in topological order, and list of
commits sorted in topological order.

Note also that FELINE index uses (or can use -- but it is supposedly the
optimal choice) position of vertex/node in topological order as one of
the two values in the pair that composes FELINE index.

> If we have generic "can X reach Y?" queries, then we can also use
> generation numbers there to great effect (by not walking commits Z
> with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch
> --contains" thread for ideas.

This is something that is shown in the Google Colab [Jupyter] Notebook
I have mentioned:

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> For negotiation, there are some things we can do here. VSTS uses
> generation numbers as a heuristic for determining "all wants connected
> to haves" which is a condition for halting negotiation. The idea is
> very simple, and I'd be happy to discuss it on a separate thread.

Nice.  How much speedup it gives?

Best regards,
--
Jakub Narębski


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-07 Thread Jakub Narebski
Hello,

Derrick Stolee  writes:

> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used [...].
>
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

I have started working on Jupyter Notebook on Google Colaboratory to
find out how much speedup we can get using generation numbers (level
negative-cut filter), FELINE index (negative-cut filter) and min-post
intervals in some spanning tree (positive-cut filter, if I understand it
correctly the base of GRAIL method) in commit graphs.

Currently I am at the stage of reproducing results in FELINE paper:
"Reachability Queries in Very Large Graphs: A Fast Refined Online Search
Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
J. Zaki (2014).  This paper is available in the PDF form at
https://openproceedings.org/EDBT/2014/paper_166.pdf

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

I hope that could be of help, or at least interesting
--
Jakub Narębski


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Jeff King
On Tue, Apr 03, 2018 at 02:47:27PM -0400, Jeff King wrote:

> On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:
> 
> > If we have generic "can X reach Y?" queries, then we can also use generation
> > numbers there to great effect (by not walking commits Z with gen(Z) <=
> > gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> > ideas.
> 
> I think the gist of it is the patch below. Which I hastily adapted from
> the patch we run at GitHub that uses timestamps as a proxy. So it's
> possible I completely flubbed the logic. I'm assuming unavailable
> generation numbers are set to 0; the logic is actually a bit simpler if
> they end up as (uint32_t)-1.

Oh indeed, that is already the value of your UNDEF. So the patch is more
like this:

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit 
*candidate,
return CONTAINS_YES;
}
 
-   /* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
+   if (cutoff == GENERATION_NUMBER_UNDEF)
+   cutoff = GENERATION_NUMBER_NONE;
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Jeff King
On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:

> If we have generic "can X reach Y?" queries, then we can also use generation
> numbers there to great effect (by not walking commits Z with gen(Z) <=
> gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> ideas.

I think the gist of it is the patch below. Which I hastily adapted from
the patch we run at GitHub that uses timestamps as a proxy. So it's
possible I completely flubbed the logic. I'm assuming unavailable
generation numbers are set to 0; the logic is actually a bit simpler if
they end up as (uint32_t)-1.

Assuming it works, that would cover for-each-ref and tag. You'd probably
want to drop the "with_commit_tag_algo" flag in ref-filter.h, and just
use always use it by default (and that would cover "git branch").

---
diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..6bea6173d1 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit 
*candidate,
return CONTAINS_YES;
}
 
-   /* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation && candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = -1;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation && c->generation < cutoff )
+   cutoff = c->generation;
+   }
+   if (cutoff == -1)
+   cutoff = 0;
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee

On 4/3/2018 2:03 PM, Brandon Williams wrote:

On 04/03, Derrick Stolee wrote:

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

Thanks for ensuring that this is defined and documented somewhere :)


This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

log --graph should benefit a lot more from this correct?  I know we've
talked a bit about negotiation and I wonder if these generation numbers
should be able to help out a little bit with that some day.


'log --graph' should be a HUGE speedup, when it is refactored. Since the 
topo-order can "stream" commits to the pager, it can be very responsive 
to return the graph in almost all conditions. (The case where generation 
numbers are not enough is when filters reduce the set of displayed 
commits to be very sparse, so many commits are walked anyway.)


If we have generic "can X reach Y?" queries, then we can also use 
generation numbers there to great effect (by not walking commits Z with 
gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" 
thread for ideas.


For negotiation, there are some things we can do here. VSTS uses 
generation numbers as a heuristic for determining "all wants connected 
to haves" which is a condition for halting negotiation. The idea is very 
simple, and I'd be happy to discuss it on a separate thread.


Thanks,
-Stolee


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Brandon Williams
On 04/03, Derrick Stolee wrote:
> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
> 
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).

Thanks for ensuring that this is defined and documented somewhere :)

> 
> This series makes the computation of generation numbers part of the
> commit-graph write process.
> 
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a constant-time check in
> queue_has_nonstale() instead of the previous linear-time check.
> 
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
> 
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

log --graph should benefit a lot more from this correct?  I know we've
talked a bit about negotiation and I wonder if these generation numbers
should be able to help out a little bit with that some day.

> 
> This patch series depends on v7 of ds/commit-graph.
> 
> Derrick Stolee (6):
>   object.c: parse commit in graph first
>   commit: add generation number to struct commmit
>   commit-graph: compute generation numbers
>   commit: sort by generation number in paint_down_to_common()
>   commit.c: use generation number to stop merge-base walks
>   commit-graph.txt: update design doc with generation numbers
> 
>  Documentation/technical/commit-graph.txt |  7 +---
>  alloc.c  |  1 +
>  commit-graph.c   | 48 +
>  commit.c | 53 
>  commit.h |  7 +++-
>  object.c |  4 +-
>  6 files changed, 104 insertions(+), 16 deletions(-)
> 
> -- 
> 2.17.0.20.g9f30ba16e1
> 

-- 
Brandon Williams


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee

On 4/3/2018 12:51 PM, Derrick Stolee wrote:

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
   object.c: parse commit in graph first
   commit: add generation number to struct commmit
   commit-graph: compute generation numbers
   commit: sort by generation number in paint_down_to_common()
   commit.c: use generation number to stop merge-base walks
   commit-graph.txt: update design doc with generation numbers


This patch is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/5


[PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee
This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
  object.c: parse commit in graph first
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: sort by generation number in paint_down_to_common()
  commit.c: use generation number to stop merge-base walks
  commit-graph.txt: update design doc with generation numbers

 Documentation/technical/commit-graph.txt |  7 +---
 alloc.c  |  1 +
 commit-graph.c   | 48 +
 commit.c | 53 
 commit.h |  7 +++-
 object.c |  4 +-
 6 files changed, 104 insertions(+), 16 deletions(-)

-- 
2.17.0.20.g9f30ba16e1