Re: [PATCH 0/6] Compute and consume generation numbers
On 4/21/2018 4:44 PM, Jakub Narebski wrote: Jakub Narebskiwrites: Derrick Stolee writes: On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? That is easy to do. The function generic_is_reachable() does that... though using direct translation of the pseudocode for "Algorithm 3: Reachable" from FELINE paper, which is recursive and doesn't check if vertex was already visited was not good idea for large graphs such as Linux kernel commit graph, oops. That is why generic_is_reachable_large() was created. [...] And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. [...] For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the FELINE index does not bring any improvements over using just level (generation number) filter. But that may be caused by narrowing od commit DAG around releases. I try do do the same between commits in wide part, with many commits with the same level (same generation number) both for source and for target commit. Though this may be unfair to level filter, though... Note however that FELINE index is not unabiguous, like generation numbers are (modulo decision whether to start at 0 or at 1); it depends on the topological ordering chosen for the X elements. One can now test reachability on git.git repository; there is a form where one can plug source and destination revisions at https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1 I have tried the case that is quite unfair to the generation numbers filter, namely the check between one of recent tags, and one commit that shares generation number among largest number of other commits. Here level = generation number-1 (as it starts at 0 for root commit, not 1). The results are: * src = 468165c1d = v2.17.0 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec * 468165c1d has level 18418 which it shares with 6 commits * 66d2e04ec has level 14776 which it shares with 93 commits * gen(468165c1d) - gen(66d2e04ec) = 3642 algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | ---+-++--+-+--+---+ naive | 48865 | 39599 | 244 | 9200| | | level | 3086 | 2492 | 113 | 528| 285 | | FELINE | 283 | 216 | 68 |0| | 25 | lev+FELINE | 282 | 215 | 68 |0| 5 | 24 | ---+-++--+-+--+---+ lev+FEL+mpi|79 |59 | 21 |0| 0 | 0 | Here we have: * 'naive' implementation means simple DFS walk, without any filters (cut-offs) * 'level' means using levels / generation numbers based negative-cut filter * 'FELINE' means using FELINE index based negative-cut filter * 'lev+FELINE' means combining generation numbers filter with FELINE filter * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; note that the code does not walk the path after cut, but it is easy to do The stats have the following meaning: * 'access' means accessing the node * 'walk' is actual walking the node * 'maxdepth' is maximum depth of the stack used for DFS * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter were used for negative-cut; note that those are not disjoint; node can be rejected by both level filter and FELINE filter For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: the results are the same, with 5 commits accessed and 6 walked compared to 61574 accessed in naive algorithm. The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads (different child-less commits) and 9 roots, and has average clustering coefficient 0.000409217. Thanks for these results. Now, write a patch. I'm sticking to generation numbers for my patch because of the simplified computation, but you can contribute a FELINE implementation. P.S. Would it be better to move the discussion about possible extensions to the commit-graph in the form of new chunks (topological order, FELINE index, min-post
Re: [PATCH 0/6] Compute and consume generation numbers
Jakub Narebskiwrites: > Derrick Stolee writes: >> On 4/11/2018 3:32 PM, Jakub Narebski wrote: > >>> What would you suggest as a good test that could imply performance? The >>> Google Colab notebook linked to above includes a function to count >>> number of commits (nodes / vertices in the commit graph) walked, >>> currently in the worst case scenario. >> >> The two main questions to consider are: >> >> 1. Can X reach Y? > > That is easy to do. The function generic_is_reachable() does > that... though using direct translation of the pseudocode for > "Algorithm 3: Reachable" from FELINE paper, which is recursive and > doesn't check if vertex was already visited was not good idea for large > graphs such as Linux kernel commit graph, oops. That is why > generic_is_reachable_large() was created. [...] >> And the thing to measure is a commit count. If possible, it would be >> good to count commits walked (commits whose parent list is enumerated) >> and commits inspected (commits that were listed as a parent of some >> walked commit). Walked commits require a commit parse -- albeit from >> the commit-graph instead of the ODB now -- while inspected commits >> only check the in-memory cache. [...] >> >> For git.git and Linux, I like to use the release tags as tests. They >> provide a realistic view of the linear history, and maintenance >> releases have their own history from the major releases. > > Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the > FELINE index does not bring any improvements over using just level > (generation number) filter. But that may be caused by narrowing od > commit DAG around releases. > > I try do do the same between commits in wide part, with many commits > with the same level (same generation number) both for source and for > target commit. Though this may be unfair to level filter, though... > > > Note however that FELINE index is not unabiguous, like generation > numbers are (modulo decision whether to start at 0 or at 1); it depends > on the topological ordering chosen for the X elements. One can now test reachability on git.git repository; there is a form where one can plug source and destination revisions at https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1 I have tried the case that is quite unfair to the generation numbers filter, namely the check between one of recent tags, and one commit that shares generation number among largest number of other commits. Here level = generation number-1 (as it starts at 0 for root commit, not 1). The results are: * src = 468165c1d = v2.17.0 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec * 468165c1d has level 18418 which it shares with 6 commits * 66d2e04ec has level 14776 which it shares with 93 commits * gen(468165c1d) - gen(66d2e04ec) = 3642 algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | ---+-++--+-+--+---+ naive | 48865 | 39599 | 244 | 9200| | | level | 3086 | 2492 | 113 | 528| 285 | | FELINE | 283 | 216 | 68 |0| | 25 | lev+FELINE | 282 | 215 | 68 |0| 5 | 24 | ---+-++--+-+--+---+ lev+FEL+mpi|79 |59 | 21 |0| 0 | 0 | Here we have: * 'naive' implementation means simple DFS walk, without any filters (cut-offs) * 'level' means using levels / generation numbers based negative-cut filter * 'FELINE' means using FELINE index based negative-cut filter * 'lev+FELINE' means combining generation numbers filter with FELINE filter * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; note that the code does not walk the path after cut, but it is easy to do The stats have the following meaning: * 'access' means accessing the node * 'walk' is actual walking the node * 'maxdepth' is maximum depth of the stack used for DFS * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter were used for negative-cut; note that those are not disjoint; node can be rejected by both level filter and FELINE filter For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: the results are the same, with 5 commits accessed and 6 walked compared to 61574 accessed in naive algorithm. The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads (different child-less commits) and 9 roots, and has average clustering coefficient 0.000409217. P.S. Would it be better to move the discussion about possible extensions to the commit-graph in the form of new chunks (topological order, FELINE index, min-post intervals, bloom filter for changed files, etc.) be moved into separate thread? -- Jakub Narębski
Re: [PATCH 0/6] Compute and consume generation numbers
Derrick Stoleewrites: > On 4/11/2018 3:32 PM, Jakub Narebski wrote: >> What would you suggest as a good test that could imply performance? The >> Google Colab notebook linked to above includes a function to count >> number of commits (nodes / vertices in the commit graph) walked, >> currently in the worst case scenario. > > The two main questions to consider are: > > 1. Can X reach Y? That is easy to do. The function generic_is_reachable() does that... though using direct translation of the pseudocode for "Algorithm 3: Reachable" from FELINE paper, which is recursive and doesn't check if vertex was already visited was not good idea for large graphs such as Linux kernel commit graph, oops. That is why generic_is_reachable_large() was created. > 2. What is the set of merge-bases between X and Y? I don't have an algorithm for that in the Google Colaboratory notebook. Though I see that there exist algorithms for calculating lowest common ancestors in DAGs... I'll have to take a look how Git does that. > > And the thing to measure is a commit count. If possible, it would be > good to count commits walked (commits whose parent list is enumerated) > and commits inspected (commits that were listed as a parent of some > walked commit). Walked commits require a commit parse -- albeit from > the commit-graph instead of the ODB now -- while inspected commits > only check the in-memory cache. I don't quite see the distinction. Whether we access generation number of a commit (information about level of vertex in graph), or a parent list (vertex successors / neighbours), it both needs accessing commit-graph; well, accessing parents may be more costly for octopus merges (due to having to go through EDGE chunk). I can easily return the set of visited commits (vertices), or just size of said set. > > For git.git and Linux, I like to use the release tags as tests. They > provide a realistic view of the linear history, and maintenance > releases have their own history from the major releases. Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the FELINE index does not bring any improvements over using just level (generation number) filter. But that may be caused by narrowing od commit DAG around releases. I try do do the same between commits in wide part, with many commits with the same level (same generation number) both for source and for target commit. Though this may be unfair to level filter, though... Note however that FELINE index is not unabiguous, like generation numbers are (modulo decision whether to start at 0 or at 1); it depends on the topological ordering chosen for the X elements. >> I have tried finding number of false positives for level (generation >> number) filter and for FELINE index, and number of false negatives for >> min-post intervals in the spanning tree (for DFS tree) for 1 >> randomly selected pairs of commits... but I don't think this is a good >> benchmark. > > What is a false-positive? A case where gen(X) < gen(Y) but Y cannot > reach X? Yes. (And equivalent for FELINE index, which is a pair of integers). > I do not think that is a great benchmark, but I guess it is > something to measure. I have simply used it to have something to compare. >> I Linux kernel sources >> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) >> that has 750832 nodes and 811733 edges, and 563747941392 possible >> directed pairs, we have for 1 randomly selected pairs of commits: >> >>level-filter has91 = 0.91% [all] false positives >>FELINE index has78 = 0.78% [all] false positives >>FELINE index has 1.16667 less false positives than level filter >> >>min-post spanning-tree intervals has 3641 = 36.41% [all] false >>negatives > > Perhaps something you can do instead of sampling from N^2 commits in > total is to select a pair of generations (say, G = 2, G' = 20100) > or regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) > and see how many false positives you see by testing all pairs (one > from each level). The delta between the generations may need to be > smaller to actually have a large proportion of unreachable pairs. Try > different levels, since major version releases tend to "pinch" the > commit graph to a common history. That's a good idea. >> For git.git repository (https://github.com/git/git.git) that has 52950 >> nodes and 65887 edges the numbers are slighly more in FELINE index >> favor (also out of 1 random pairs): >> >>level-filter has 504 = 9.11% false positives >>FELINE index has 125 = 2.26% false positives >>FELINE index has 4.032 less false positives than level filter >> >> This is for FELINE which does not use level / generatio-numbers filter. Best, -- Jakub Narębski
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? 2. What is the set of merge-bases between X and Y? And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 1 randomly selected pairs of commits... but I don't think this is a good benchmark. What is a false-positive? A case where gen(X) < gen(Y) but Y cannot reach X? I do not think that is a great benchmark, but I guess it is something to measure. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 1 randomly selected pairs of commits: level-filter has91 = 0.91% [all] false positives FELINE index has78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives Perhaps something you can do instead of sampling from N^2 commits in total is to select a pair of generations (say, G = 2, G' = 20100) or regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and see how many false positives you see by testing all pairs (one from each level). The delta between the generations may need to be smaller to actually have a large proportion of unreachable pairs. Try different levels, since major version releases tend to "pinch" the commit graph to a common history. For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 1 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
Derrick Stoleewrites: > On 4/7/2018 12:55 PM, Jakub Narebski wrote: >> Currently I am at the stage of reproducing results in FELINE paper: >> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search >> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed >> J. Zaki (2014). This paper is available in the PDF form at >> https://openproceedings.org/EDBT/2014/paper_166.pdf >> >> The Jupyter Notebook (which runs on Google cloud, but can be also run >> locally) uses Python kernel, NetworkX librabry for graph manipulation, >> and matplotlib (via NetworkX) for display. >> >> Available at: >> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg >> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing >> >> I hope that could be of help, or at least interesting > > Let me know when you can give numbers (either raw performance or # of > commits walked) for real-world Git commit graphs. The Linux repo is a > good example to use for benchmarking, but I also use the Kotlin repo > sometimes as it has over a million objects and over 250K commits. As I am curently converting git repository into commit graph, number of objects doesn't matter. Though Kotlin is nicely in largish size set, not as large as Linux kernel which has 750K commits, but mich larger than git.git with 65K commits. > Of course, the only important statistic at the end of the day is the > end-to-end time of a 'git ...' command. Your investigations should > inform whether it is worth prototyping the feature in the git > codebase. What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 1 randomly selected pairs of commits... but I don't think this is a good benchmark. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 1 randomly selected pairs of commits: level-filter has91 = 0.91% [all] false positives FELINE index has78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 1 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Regards, -- Jakub Narębski
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/7/2018 12:55 PM, Jakub Narebski wrote: Currently I am at the stage of reproducing results in FELINE paper: "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014). This paper is available in the PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf The Jupyter Notebook (which runs on Google cloud, but can be also run locally) uses Python kernel, NetworkX librabry for graph manipulation, and matplotlib (via NetworkX) for display. Available at: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing I hope that could be of help, or at least interesting Let me know when you can give numbers (either raw performance or # of commits walked) for real-world Git commit graphs. The Linux repo is a good example to use for benchmarking, but I also use the Kotlin repo sometimes as it has over a million objects and over 250K commits. Of course, the only important statistic at the end of the day is the end-to-end time of a 'git ...' command. Your investigations should inform whether it is worth prototyping the feature in the git codebase. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
Derrick Stoleewrites: > On 4/3/2018 2:03 PM, Brandon Williams wrote: >> On 04/03, Derrick Stolee wrote: >>> This is the first of several "small" patches that follow the serialized >>> Git commit graph patch (ds/commit-graph). >>> >>> As described in Documentation/technical/commit-graph.txt, the generation >>> number of a commit is one more than the maximum generation number among >>> its parents (trivially, a commit with no parents has generation number >>> one). [...] >>> A more substantial refactoring of revision.c is required before making >>> 'git log --graph' use generation numbers effectively. >> >> log --graph should benefit a lot more from this correct? I know we've >> talked a bit about negotiation and I wonder if these generation numbers >> should be able to help out a little bit with that some day. > > 'log --graph' should be a HUGE speedup, when it is refactored. Since > the topo-order can "stream" commits to the pager, it can be very > responsive to return the graph in almost all conditions. (The case > where generation numbers are not enough is when filters reduce the set > of displayed commits to be very sparse, so many commits are walked > anyway.) I wonder if next big speedup would be to store [some] topological ordering of commits in the commit graph... It could be done for example in two chunks: a mapping to position in topological order, and list of commits sorted in topological order. Note also that FELINE index uses (or can use -- but it is supposedly the optimal choice) position of vertex/node in topological order as one of the two values in the pair that composes FELINE index. > If we have generic "can X reach Y?" queries, then we can also use > generation numbers there to great effect (by not walking commits Z > with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch > --contains" thread for ideas. This is something that is shown in the Google Colab [Jupyter] Notebook I have mentioned: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing > For negotiation, there are some things we can do here. VSTS uses > generation numbers as a heuristic for determining "all wants connected > to haves" which is a condition for halting negotiation. The idea is > very simple, and I'd be happy to discuss it on a separate thread. Nice. How much speedup it gives? Best regards, -- Jakub Narębski
Re: [PATCH 0/6] Compute and consume generation numbers
Hello, Derrick Stoleewrites: > This is the first of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used [...]. > > This does not have a significant performance benefit in repositories > of normal size, but in the Windows repository, some merge-base > calculations improve from 3.1s to 2.9s. A modest speedup, but provides > an actual consumer of generation numbers as a starting point. > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. I have started working on Jupyter Notebook on Google Colaboratory to find out how much speedup we can get using generation numbers (level negative-cut filter), FELINE index (negative-cut filter) and min-post intervals in some spanning tree (positive-cut filter, if I understand it correctly the base of GRAIL method) in commit graphs. Currently I am at the stage of reproducing results in FELINE paper: "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014). This paper is available in the PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf The Jupyter Notebook (which runs on Google cloud, but can be also run locally) uses Python kernel, NetworkX librabry for graph manipulation, and matplotlib (via NetworkX) for display. Available at: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing I hope that could be of help, or at least interesting -- Jakub Narębski
Re: [PATCH 0/6] Compute and consume generation numbers
On Tue, Apr 03, 2018 at 02:47:27PM -0400, Jeff King wrote: > On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote: > > > If we have generic "can X reach Y?" queries, then we can also use generation > > numbers there to great effect (by not walking commits Z with gen(Z) <= > > gen(Y)). Perhaps I should look at that "git branch --contains" thread for > > ideas. > > I think the gist of it is the patch below. Which I hastily adapted from > the patch we run at GitHub that uses timestamps as a proxy. So it's > possible I completely flubbed the logic. I'm assuming unavailable > generation numbers are set to 0; the logic is actually a bit simpler if > they end up as (uint32_t)-1. Oh indeed, that is already the value of your UNDEF. So the patch is more like this: diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..b147b1d0ee 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit,
Re: [PATCH 0/6] Compute and consume generation numbers
On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote: > If we have generic "can X reach Y?" queries, then we can also use generation > numbers there to great effect (by not walking commits Z with gen(Z) <= > gen(Y)). Perhaps I should look at that "git branch --contains" thread for > ideas. I think the gist of it is the patch below. Which I hastily adapted from the patch we run at GitHub that uses timestamps as a proxy. So it's possible I completely flubbed the logic. I'm assuming unavailable generation numbers are set to 0; the logic is actually a bit simpler if they end up as (uint32_t)-1. Assuming it works, that would cover for-each-ref and tag. You'd probably want to drop the "with_commit_tag_algo" flag in ref-filter.h, and just use always use it by default (and that would cover "git branch"). --- diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..6bea6173d1 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation && candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = -1; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation && c->generation < cutoff ) + cutoff = c->generation; + } + if (cutoff == -1) + cutoff = 0; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit,
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/3/2018 2:03 PM, Brandon Williams wrote: On 04/03, Derrick Stolee wrote: This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). Thanks for ensuring that this is defined and documented somewhere :) This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. log --graph should benefit a lot more from this correct? I know we've talked a bit about negotiation and I wonder if these generation numbers should be able to help out a little bit with that some day. 'log --graph' should be a HUGE speedup, when it is refactored. Since the topo-order can "stream" commits to the pager, it can be very responsive to return the graph in almost all conditions. (The case where generation numbers are not enough is when filters reduce the set of displayed commits to be very sparse, so many commits are walked anyway.) If we have generic "can X reach Y?" queries, then we can also use generation numbers there to great effect (by not walking commits Z with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" thread for ideas. For negotiation, there are some things we can do here. VSTS uses generation numbers as a heuristic for determining "all wants connected to haves" which is a condition for halting negotiation. The idea is very simple, and I'd be happy to discuss it on a separate thread. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
On 04/03, Derrick Stolee wrote: > This is the first of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). Thanks for ensuring that this is defined and documented somewhere :) > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used to order commits in the priority > queue in paint_down_to_common(). This allows a constant-time check in > queue_has_nonstale() instead of the previous linear-time check. > > This does not have a significant performance benefit in repositories > of normal size, but in the Windows repository, some merge-base > calculations improve from 3.1s to 2.9s. A modest speedup, but provides > an actual consumer of generation numbers as a starting point. > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. log --graph should benefit a lot more from this correct? I know we've talked a bit about negotiation and I wonder if these generation numbers should be able to help out a little bit with that some day. > > This patch series depends on v7 of ds/commit-graph. > > Derrick Stolee (6): > object.c: parse commit in graph first > commit: add generation number to struct commmit > commit-graph: compute generation numbers > commit: sort by generation number in paint_down_to_common() > commit.c: use generation number to stop merge-base walks > commit-graph.txt: update design doc with generation numbers > > Documentation/technical/commit-graph.txt | 7 +--- > alloc.c | 1 + > commit-graph.c | 48 + > commit.c | 53 > commit.h | 7 +++- > object.c | 4 +- > 6 files changed, 104 insertions(+), 16 deletions(-) > > -- > 2.17.0.20.g9f30ba16e1 > -- Brandon Williams
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/3/2018 12:51 PM, Derrick Stolee wrote: This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (6): object.c: parse commit in graph first commit: add generation number to struct commmit commit-graph: compute generation numbers commit: sort by generation number in paint_down_to_common() commit.c: use generation number to stop merge-base walks commit-graph.txt: update design doc with generation numbers This patch is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/5
[PATCH 0/6] Compute and consume generation numbers
This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (6): object.c: parse commit in graph first commit: add generation number to struct commmit commit-graph: compute generation numbers commit: sort by generation number in paint_down_to_common() commit.c: use generation number to stop merge-base walks commit-graph.txt: update design doc with generation numbers Documentation/technical/commit-graph.txt | 7 +--- alloc.c | 1 + commit-graph.c | 48 + commit.c | 53 commit.h | 7 +++- object.c | 4 +- 6 files changed, 104 insertions(+), 16 deletions(-) -- 2.17.0.20.g9f30ba16e1