[PATCH v2 5/6] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH v2 0/6] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 10 ++- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++- list-objects.h | 4 +- revision.c | 121 + revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 + 11 files changed, 340 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v1: 1: b73b8de98c = 1: 60617681f7 revision: add mark_tree_uninteresting_sparse 2: 9bf04c748b ! 2: 4527addacb list-objects: consume sparse tree walk @@ -116,6 +116,10 @@ + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); ++ ++ if (!tree) ++ continue; ++ + oidset_insert(set, >object.oid); + + if (!(parent->object.flags & UNINTERESTING)) @@ -142,14 +146,14 @@ + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; -+ + +- if (commit->object.flags & UNINTERESTING) { + if (sparse) { + struct tree *tree = get_commit_tree(commit); -+ ++ + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; - -- if (commit->object.flags & UNINTERESTING) { ++ + oidset_insert(, >object.oid); + add_edge_parents(commit, revs, show_edge, ); + } else if (commit->object.flags & UNINTERESTING) { @@ -189,3 +193,17 @@ struct oidset; struct list_objects_filter_options; + +diff --git a/revision.c b/revision.c +--- a/revision.c b/revision.c +@@ + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + ++ if (!tree) ++ continue; ++ + if (tree->object.flags & UNINTERESTING) { + /* + * Remove
[PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 22 ++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, ); + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH v2 2/6] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++--- list-objects.h | 4 ++- revision.c | 3 +++ 7 files changed, 61 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk()) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk()) die("revision walk setup failed"); - mark_edges_uninteresting(, NULL); + mark_edges_uninteresting(, NULL, 0); objects_to_send = get_delta(, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..4fbdeca0a4 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; +
[PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
From: Derrick Stolee Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 1 + t/README | 4 t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget
[PATCH v2 4/6] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 116 ++--- t/t5322-pack-objects-sparse.sh | 21 -- 2 files changed, 121 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..971f1bb095 100644 --- a/revision.c +++ b/revision.c @@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ + string_list_init(>list, 1); +} + +static void
[PATCH v2 3/6] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge, 0); + mark_edges_uninteresting(, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", , +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", , N_("create thin packs")), OPT_BOOL(0, "shallow", , diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse sparse.pack && +
[PATCH 4/5] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 111 ++--- t/t5322-pack-objects-sparse.sh | 21 +-- 2 files changed, 116 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index 3a62c7c187..7e4bfe621a 100644 --- a/revision.c +++ b/revision.c @@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ + string_list_init(>list, 1); +} + +static void
[PATCH 0/5] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Derrick Stolee (5): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 9 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 51 ++- list-objects.h | 4 +- revision.c | 113 +++ revision.h | 2 + t/t5322-pack-objects-sparse.sh | 139 + 10 files changed, 323 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/89 -- gitgitgadget
[PATCH 3/5] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge, 0); + mark_edges_uninteresting(, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", , +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", , N_("create thin packs")), OPT_BOOL(0, "shallow", , diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse sparse.pack && +
[PATCH 5/5] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH 2/5] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 51 ++ list-objects.h | 4 +++- 6 files changed, 54 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk()) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk()) die("revision walk setup failed"); - mark_edges_uninteresting(, NULL); + mark_edges_uninteresting(, NULL, 0); objects_to_send = get_delta(, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..9bb93d1640 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree
[PATCH 1/5] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 22 ++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, ); + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH 1/1] revision.c: use new topo-order logic in tests
From: Derrick Stolee The revision-walk machinery is being rewritten to use generation numbers in the commit-graph when availble. Due to some problematic commit histories, the new logic can be slower than the previous method due to how commit dates and generation numbers interact. Thus, the new logic is not used in comparison queries, such as git log --topo-order A..B The logic for these queries was implemented during the refactor, but is unreachable due to the potential performance penalty. The code came along with a larger block of code that was copied from the old code. When generation numbers are updated to v2 (corrected commit dates), then we will no longer have a performance penalty and this logic is what we will want to use. In the meantime, use the new logic when GIT_TEST_COMMIT_GRAPH is enabled. This will demonstrate that the new logic works for all comparison queries in the test suite, including these variants: git log --topo-order --ancestry-path A..B git log --topo-order A...B Signed-off-by: Derrick Stolee --- revision.c | 4 1 file changed, 4 insertions(+) diff --git a/revision.c b/revision.c index 4ef47d2fb4..d52da6e24f 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reach.h" #include "commit-graph.h" #include "prio-queue.h" +#include "config.h" volatile show_early_output_fn_t show_early_output; @@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; + if (revs->limited && + git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + revs->limited = 0; if (revs->limited) { if (limit_list(revs) < 0) return -1; -- gitgitgadget
[PATCH 0/1] Use new topo-order logic with GIT_TEST_COMMIT_GRAPH
The recent Git test report for v2.20.0-rc0 shows that the logic around UNINTERESTING commits is not covered by the test suite. This is because the code is currently unreachable! See the commit message for details. An alternate approach would be to delete the code around UNINTERESTING commits, but that doesn't seem necessary. Thanks, -Stolee Derrick Stolee (1): revision.c: use new topo-order logic in tests revision.c | 4 1 file changed, 4 insertions(+) base-commit: 561b583749b7428f1790f03164d0d0e75be71d7b Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-83%2Fderrickstolee%2Ftopo-order-test-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-83/derrickstolee/topo-order-test-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/83 -- gitgitgadget
[PATCH v2 1/1] pack-objects: ignore ambiguous object warnings
From: Derrick Stolee A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! I observed a simple push spending three seconds checking these paths. The fix here is similar to 4c30d50 "rev-list: disable object/refname ambiguity check with --stdin". Save the value of the global warn_on_object_refname_ambiguity variable (which is usually set to the boolean config variable core.warnAmbiguousRefs) and change the state to false. Do this only during the get_object_list() method which reads the objects from stdin. Helped-by: Jeff King Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index d1144a8f7e..f703e6df9b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2988,6 +2988,7 @@ static void get_object_list(int ac, const char **av) struct rev_info revs; char line[1000]; int flags = 0; + int save_warning; init_revisions(, NULL); save_commit_buffer = 0; @@ -2996,6 +2997,9 @@ static void get_object_list(int ac, const char **av) /* make sure shallows are read */ is_repository_shallow(the_repository); + save_warning = warn_on_object_refname_ambiguity; + warn_on_object_refname_ambiguity = 0; + while (fgets(line, sizeof(line), stdin) != NULL) { int len = strlen(line); if (len && line[len - 1] == '\n') @@ -3022,6 +3026,8 @@ static void get_object_list(int ac, const char **av) die(_("bad revision '%s'"), line); } + warn_on_object_refname_ambiguity = save_warning; + if (use_bitmap_index && !get_object_list_from_bitmap()) return; -- gitgitgadget
[PATCH v2 0/1] send-pack: set core.warnAmbiguousRefs=false
I've been looking into the performance of git push for very large repos. Our users are reporting that 60-80% of git push time is spent during the "Enumerating objects" phase of git pack-objects. A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! My PerfView trace for a simple push measured 3 seconds spent checking these paths. The fix is to set the global warn_on_object_refname_ambiguity to 0 for the section that is performing these object reads. In addition to this patch submission, we are looking into merging it into our fork sooner [1]. [1] https://github.com/Microsoft/git/pull/67 Changes in V2: Instead of using the "-c" flag from send-pack, just set the global. I left the name of the cover letter the same to not confuse anyone viewing the message without threading. Derrick Stolee (1): pack-objects: ignore ambiguous object warnings builtin/pack-objects.c | 6 ++ 1 file changed, 6 insertions(+) base-commit: cae598d9980661a978e2df4fb338518f7bf09572 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-68/derrickstolee/send-pack-config-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/68 Range-diff vs v1: 1: 1ef2c51550 < -: -- send-pack: set core.warnAmbiguousRefs=false -: -- > 1: 002868ee6b pack-objects: ignore ambiguous object warnings -- gitgitgadget
[PATCH 1/1] send-pack: set core.warnAmbiguousRefs=false
From: Derrick Stolee During a 'git push' command, we run 'git send-pack' inside of our transport helper. This creates a 'git pack-objects' process and passes a list of object ids. If this list is large, then the pack-objects process can spend a lot of time checking the possible refs these strings could represent. Remove this extra check by setting core.warnAmbiguousRefs to false as we call 'git pack-objects'. Signed-off-by: Derrick Stolee --- send-pack.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/send-pack.c b/send-pack.c index e920ca57df..5055150fe1 100644 --- a/send-pack.c +++ b/send-pack.c @@ -64,6 +64,8 @@ static int pack_objects(int fd, struct ref *refs, struct oid_array *extra, struc int i; int rc; + argv_array_push(, "-c"); + argv_array_push(, "core.warnAmbiguousRefs=false"); argv_array_push(, "pack-objects"); argv_array_push(, "--all-progress-implied"); argv_array_push(, "--revs"); -- gitgitgadget
[PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false
I've been looking into the performance of git push for very large repos. Our users are reporting that 60-80% of git push time is spent during the "Enumerating objects" phase of git pack-objects. A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! My PerfView trace for a simple push measured 3 seconds spent checking these paths. The fix for this is simple: set core.warnAmbiguousRefs to false for this specific call of git pack-objects coming from git send-pack. We don't want to default it to false for all calls to git pack-objects, as it is valid to pass ref names instead of object ids. This helps regain these seconds during a push. In addition to this patch submission, we are looking into merging it into our fork sooner [1]. [1] https://github.com/Microsoft/git/pull/67 Derrick Stolee (1): send-pack: set core.warnAmbiguousRefs=false send-pack.c | 2 ++ 1 file changed, 2 insertions(+) base-commit: cae598d9980661a978e2df4fb338518f7bf09572 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-68/derrickstolee/send-pack-config-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/68 -- gitgitgadget
[PATCH v2 0/3] Make add_missing_tags() linear
As reported earlier [1], the add_missing_tags() method in remote.c has quadratic performance. Some of that performance is curbed due to the generation-number cutoff in in_merge_bases_many(). However, that fix doesn't help users without a commit-graph, and it can still be painful if that cutoff is sufficiently low compared to the tags we are using for reachability testing. Add a new method in commit-reach.c called get_reachable_subset() which does a many-to-many reachability test. Starting at the 'from' commits, walk until the generation is below the smallest generation in the 'to' commits, or all 'to' commits have been discovered. This performs only one commit walk for the entire add_missing_tags() method, giving linear performance in the worst case. Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() works independently of its application in add_missing_tags(). Thanks, -Stolee [1] https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/ Derrick Stolee (3): commit-reach: implement get_reachable_subset test-reach: test get_reachable_subset remote: make add_missing_tags() linear commit-reach.c| 70 +++ commit-reach.h| 13 remote.c | 34 - t/helper/test-reach.c | 34 ++--- t/t6600-test-reach.sh | 52 5 files changed, 198 insertions(+), 5 deletions(-) base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/60 Range-diff vs v1: 1: 4c0c5c9143 ! 1: 9e570603bd commit-reach: implement get_reachable_subset @@ -49,7 +49,7 @@ + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, -+ int reachable_flag) ++ unsigned int reachable_flag) +{ + struct commit **item; + struct commit *current; @@ -129,9 +129,12 @@ + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. ++ * ++ * This method uses the PARENT1 and PARENT2 flags during its operation, ++ * so be sure these flags are not set before calling the method. + */ +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, -+ int reachable_flag); ++ unsigned int reachable_flag); + #endif 2: 382f4f4a5b = 2: 52e847b928 test-reach: test get_reachable_subset 3: ecbed3de5c = 3: dfaceb162f remote: make add_missing_tags() linear -- gitgitgadget
[PATCH v2 1/3] commit-reach: implement get_reachable_subset
From: Derrick Stolee The existing reachability algorithms in commit-reach.c focus on finding merge-bases or determining if all commits in a set X can reach at least one commit in a set Y. However, for two commits sets X and Y, we may also care about which commits in Y are reachable from at least one commit in X. Implement get_reachable_subset() which answers this question. Given two arrays of commits, 'from' and 'to', return a commit_list with every commit from the 'to' array that is reachable from at least one commit in the 'from' array. The algorithm is a simple walk starting at the 'from' commits, using the PARENT2 flag to indicate "this commit has already been added to the walk queue". By marking the 'to' commits with the PARENT1 flag, we can determine when we see a commit from the 'to' array. We remove the PARENT1 flag as we add that commit to the result list to avoid duplicates. The order of the resulting list is a reverse of the order that the commits are discovered in the walk. There are a couple shortcuts to avoid walking more than we need: 1. We determine the minimum generation number of commits in the 'to' array. We do not walk commits with generation number below this minimum. 2. We count how many distinct commits are in the 'to' array, and decrement this count when we discover a 'to' commit during the walk. If this number reaches zero, then we can terminate the walk. Tests will be added using the 'test-tool reach' helper in a subsequent commit. Signed-off-by: Derrick Stolee --- commit-reach.c | 70 ++ commit-reach.h | 13 ++ 2 files changed, 83 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a22..8ad5352752 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +unsigned int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } + + while (num_to_find && (current = prio_queue_get()) != NULL) { + struct commit_list *parents; + + if (current->object.flags & PARENT1) { + current->object.flags &= ~PARENT1; + current->object.flags |= reachable_flag; + commit_list_insert(current, _commits); + num_to_find--; + } + + for (parents = current->parents; parents; parents = parents->next) { + struct commit *p = parents->item; + + parse_commit(p); + + if (p->generation < min_generation) + continue; + + if (p->object.flags & PARENT2) + continue; + + p->object.flags |= PARENT2; + prio_queue_put(, p); + } + } + + clear_commit_marks_many(nr_to, to, PARENT1); + clear_commit_marks_many(nr_from, from, PARENT2); + + return found_commits; +} + diff --git a/commit-reach.h b/commit-reach.h index 7d313e2975..bb34af0269 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -74,4 +74,17 @@ int can_all_from_reach_with_flag(struct object_array *from, int can_all_from_reach(struct commit_list *from, struct commit_list *to, int commit_date_cutoff); + +/* + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. + * + * This method uses the PARENT1 and PARENT2
[PATCH v2 3/3] remote: make add_missing_tags() linear
From: Derrick Stolee The add_missing_tags() method currently has quadratic behavior. This is due to a linear number (based on number of tags T) of calls to in_merge_bases_many, which has linear performance (based on number of commits C in the repository). Replace this O(T * C) algorithm with an O(T + C) algorithm by using get_reachable_subset(). We ignore the return list and focus instead on the reachable_flag assigned to the commits we care about, because we need to interact with the tag ref and not just the commit object. Signed-off-by: Derrick Stolee --- remote.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/remote.c b/remote.c index 81f4f01b00..b850f2feb3 100644 --- a/remote.c +++ b/remote.c @@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * sent to the other side. */ if (sent_tips.nr) { + const int reachable_flag = 1; + struct commit_list *found_commits; + struct commit **src_commits; + int nr_src_commits = 0, alloc_src_commits = 16; + ALLOC_ARRAY(src_commits, alloc_src_commits); + for_each_string_list_item(item, _tag) { struct ref *ref = item->util; + struct commit *commit; + + if (is_null_oid(>new_oid)) + continue; + commit = lookup_commit_reference_gently(the_repository, + >new_oid, + 1); + if (!commit) + /* not pushing a commit, which is not an error */ + continue; + + ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits); + src_commits[nr_src_commits++] = commit; + } + + found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr, +src_commits, nr_src_commits, +reachable_flag); + + for_each_string_list_item(item, _tag) { struct ref *dst_ref; + struct ref *ref = item->util; struct commit *commit; if (is_null_oid(>new_oid)) @@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * Is this tag, which they do not have, reachable from * any of the commits we are sending? */ - if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip)) + if (!(commit->object.flags & reachable_flag)) continue; /* Add it in */ @@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds oidcpy(_ref->new_oid, >new_oid); dst_ref->peer_ref = copy_ref(ref); } + + clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag); + free(src_commits); + free_commit_list(found_commits); } + string_list_clear(_tag, 0); free(sent_tips.tip); } -- gitgitgadget
[PATCH v2 2/3] test-reach: test get_reachable_subset
From: Derrick Stolee The get_reachable_subset() method returns the list of commits in the 'to' array that are reachable from at least one commit in the 'from' array. Add tests that check this method works in a few cases: 1. All commits in the 'to' list are reachable. This exercises the early-termination condition. 2. Some commits in the 'to' list are reachable. This exercises the loop-termination condition. 3. No commits in the 'to' list are reachable. This exercises the NULL return condition. Signed-off-by: Derrick Stolee --- t/helper/test-reach.c | 34 t/t6600-test-reach.sh | 52 +++ 2 files changed, 82 insertions(+), 4 deletions(-) diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index 08d2ea68e8..a0272178b7 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av) struct commit *A, *B; struct commit_list *X, *Y; struct object_array X_obj = OBJECT_ARRAY_INIT; - struct commit **X_array; - int X_nr, X_alloc; + struct commit **X_array, **Y_array; + int X_nr, X_alloc, Y_nr, Y_alloc; struct strbuf buf = STRBUF_INIT; struct repository *r = the_repository; @@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av) A = B = NULL; X = Y = NULL; - X_nr = 0; - X_alloc = 16; + X_nr = Y_nr = 0; + X_alloc = Y_alloc = 16; ALLOC_ARRAY(X_array, X_alloc); + ALLOC_ARRAY(Y_array, Y_alloc); while (strbuf_getline(, stdin) != EOF) { struct object_id oid; @@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av) case 'Y': commit_list_insert(c, ); + ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc); + Y_array[Y_nr++] = c; break; default: @@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av) filter.with_commit_tag_algo = 0; printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, X, )); + } else if (!strcmp(av[1], "get_reachable_subset")) { + const int reachable_flag = 1; + int i, count = 0; + struct commit_list *current; + struct commit_list *list = get_reachable_subset(X_array, X_nr, + Y_array, Y_nr, + reachable_flag); + printf("get_reachable_subset(X,Y)\n"); + for (current = list; current; current = current->next) { + if (!(list->item->object.flags & reachable_flag)) + die(_("commit %s is not marked reachable"), + oid_to_hex(>item->object.oid)); + count++; + } + for (i = 0; i < Y_nr; i++) { + if (Y_array[i]->object.flags & reachable_flag) + count--; + } + + if (count < 0) + die(_("too many commits marked reachable")); + + print_sorted_commit_ids(list); } exit(0); diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index ae94b27f70..a0c64e617a 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'get_reachable_subset:all' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-6-6 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 \ + commit-5-6 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:some' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:none' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-9-3 + Y:commit-7-6 + Y:commit-2-8 + EOF + echo "get_reachable_subset(X,Y)" >expect && + test_three_modes get_reachable_subset +' +
[PATCH 2/3] test-reach: test get_reachable_subset
From: Derrick Stolee The get_reachable_subset() method returns the list of commits in the 'to' array that are reachable from at least one commit in the 'from' array. Add tests that check this method works in a few cases: 1. All commits in the 'to' list are reachable. This exercises the early-termination condition. 2. Some commits in the 'to' list are reachable. This exercises the loop-termination condition. 3. No commits in the 'to' list are reachable. This exercises the NULL return condition. Signed-off-by: Derrick Stolee --- t/helper/test-reach.c | 34 t/t6600-test-reach.sh | 52 +++ 2 files changed, 82 insertions(+), 4 deletions(-) diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index 08d2ea68e..a0272178b 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av) struct commit *A, *B; struct commit_list *X, *Y; struct object_array X_obj = OBJECT_ARRAY_INIT; - struct commit **X_array; - int X_nr, X_alloc; + struct commit **X_array, **Y_array; + int X_nr, X_alloc, Y_nr, Y_alloc; struct strbuf buf = STRBUF_INIT; struct repository *r = the_repository; @@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av) A = B = NULL; X = Y = NULL; - X_nr = 0; - X_alloc = 16; + X_nr = Y_nr = 0; + X_alloc = Y_alloc = 16; ALLOC_ARRAY(X_array, X_alloc); + ALLOC_ARRAY(Y_array, Y_alloc); while (strbuf_getline(, stdin) != EOF) { struct object_id oid; @@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av) case 'Y': commit_list_insert(c, ); + ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc); + Y_array[Y_nr++] = c; break; default: @@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av) filter.with_commit_tag_algo = 0; printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, X, )); + } else if (!strcmp(av[1], "get_reachable_subset")) { + const int reachable_flag = 1; + int i, count = 0; + struct commit_list *current; + struct commit_list *list = get_reachable_subset(X_array, X_nr, + Y_array, Y_nr, + reachable_flag); + printf("get_reachable_subset(X,Y)\n"); + for (current = list; current; current = current->next) { + if (!(list->item->object.flags & reachable_flag)) + die(_("commit %s is not marked reachable"), + oid_to_hex(>item->object.oid)); + count++; + } + for (i = 0; i < Y_nr; i++) { + if (Y_array[i]->object.flags & reachable_flag) + count--; + } + + if (count < 0) + die(_("too many commits marked reachable")); + + print_sorted_commit_ids(list); } exit(0); diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index ae94b27f7..a0c64e617 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'get_reachable_subset:all' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-6-6 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 \ + commit-5-6 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:some' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:none' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-9-3 + Y:commit-7-6 + Y:commit-2-8 + EOF + echo "get_reachable_subset(X,Y)" >expect && + test_three_modes get_reachable_subset +' +
[PATCH 3/3] remote: make add_missing_tags() linear
From: Derrick Stolee The add_missing_tags() method currently has quadratic behavior. This is due to a linear number (based on number of tags T) of calls to in_merge_bases_many, which has linear performance (based on number of commits C in the repository). Replace this O(T * C) algorithm with an O(T + C) algorithm by using get_reachable_subset(). We ignore the return list and focus instead on the reachable_flag assigned to the commits we care about, because we need to interact with the tag ref and not just the commit object. Signed-off-by: Derrick Stolee --- remote.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/remote.c b/remote.c index 81f4f01b0..b850f2feb 100644 --- a/remote.c +++ b/remote.c @@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * sent to the other side. */ if (sent_tips.nr) { + const int reachable_flag = 1; + struct commit_list *found_commits; + struct commit **src_commits; + int nr_src_commits = 0, alloc_src_commits = 16; + ALLOC_ARRAY(src_commits, alloc_src_commits); + for_each_string_list_item(item, _tag) { struct ref *ref = item->util; + struct commit *commit; + + if (is_null_oid(>new_oid)) + continue; + commit = lookup_commit_reference_gently(the_repository, + >new_oid, + 1); + if (!commit) + /* not pushing a commit, which is not an error */ + continue; + + ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits); + src_commits[nr_src_commits++] = commit; + } + + found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr, +src_commits, nr_src_commits, +reachable_flag); + + for_each_string_list_item(item, _tag) { struct ref *dst_ref; + struct ref *ref = item->util; struct commit *commit; if (is_null_oid(>new_oid)) @@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * Is this tag, which they do not have, reachable from * any of the commits we are sending? */ - if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip)) + if (!(commit->object.flags & reachable_flag)) continue; /* Add it in */ @@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds oidcpy(_ref->new_oid, >new_oid); dst_ref->peer_ref = copy_ref(ref); } + + clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag); + free(src_commits); + free_commit_list(found_commits); } + string_list_clear(_tag, 0); free(sent_tips.tip); } -- gitgitgadget
[PATCH 0/3] Make add_missing_tags() linear
As reported earlier [1], the add_missing_tags() method in remote.c has quadratic performance. Some of that performance is curbed due to the generation-number cutoff in in_merge_bases_many(). However, that fix doesn't help users without a commit-graph, and it can still be painful if that cutoff is sufficiently low compared to the tags we are using for reachability testing. Add a new method in commit-reach.c called get_reachable_subset() which does a many-to-many reachability test. Starting at the 'from' commits, walk until the generation is below the smallest generation in the 'to' commits, or all 'to' commits have been discovered. This performs only one commit walk for the entire add_missing_tags() method, giving linear performance in the worst case. Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() works independently of its application in add_missing_tags(). Thanks, -Stolee [1] https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/ Derrick Stolee (3): commit-reach: implement get_reachable_subset test-reach: test get_reachable_subset remote: make add_missing_tags() linear commit-reach.c| 70 +++ commit-reach.h| 10 +++ remote.c | 34 - t/helper/test-reach.c | 34 ++--- t/t6600-test-reach.sh | 52 5 files changed, 195 insertions(+), 5 deletions(-) base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/60 -- gitgitgadget
[PATCH 1/3] commit-reach: implement get_reachable_subset
From: Derrick Stolee The existing reachability algorithms in commit-reach.c focus on finding merge-bases or determining if all commits in a set X can reach at least one commit in a set Y. However, for two commits sets X and Y, we may also care about which commits in Y are reachable from at least one commit in X. Implement get_reachable_subset() which answers this question. Given two arrays of commits, 'from' and 'to', return a commit_list with every commit from the 'to' array that is reachable from at least one commit in the 'from' array. The algorithm is a simple walk starting at the 'from' commits, using the PARENT2 flag to indicate "this commit has already been added to the walk queue". By marking the 'to' commits with the PARENT1 flag, we can determine when we see a commit from the 'to' array. We remove the PARENT1 flag as we add that commit to the result list to avoid duplicates. The order of the resulting list is a reverse of the order that the commits are discovered in the walk. There are a couple shortcuts to avoid walking more than we need: 1. We determine the minimum generation number of commits in the 'to' array. We do not walk commits with generation number below this minimum. 2. We count how many distinct commits are in the 'to' array, and decrement this count when we discover a 'to' commit during the walk. If this number reaches zero, then we can terminate the walk. Tests will be added using the 'test-tool reach' helper in a subsequent commit. Signed-off-by: Derrick Stolee --- commit-reach.c | 70 ++ commit-reach.h | 10 2 files changed, 80 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a2..a98532ecc 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } + + while (num_to_find && (current = prio_queue_get()) != NULL) { + struct commit_list *parents; + + if (current->object.flags & PARENT1) { + current->object.flags &= ~PARENT1; + current->object.flags |= reachable_flag; + commit_list_insert(current, _commits); + num_to_find--; + } + + for (parents = current->parents; parents; parents = parents->next) { + struct commit *p = parents->item; + + parse_commit(p); + + if (p->generation < min_generation) + continue; + + if (p->object.flags & PARENT2) + continue; + + p->object.flags |= PARENT2; + prio_queue_put(, p); + } + } + + clear_commit_marks_many(nr_to, to, PARENT1); + clear_commit_marks_many(nr_from, from, PARENT2); + + return found_commits; +} + diff --git a/commit-reach.h b/commit-reach.h index 7d313e297..43bd50a70 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, int can_all_from_reach(struct commit_list *from, struct commit_list *to, int commit_date_cutoff); + +/* + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. + */ +struct commit_list *get_reachable_subset(struct commit
[PATCH 1/1] commit-reach: fix first-parent heuristic
From: Derrick Stolee The algorithm in can_all_from_reach_with_flags() performs a depth- first-search, terminated by generation number, intending to use a hueristic that "important" commits are found in the first-parent history. This heuristic is valuable in scenarios like fetch negotiation. However, there is a problem! After the search finds a target commit, it should pop all commits off the stack and mark them as "can reach". This logic is incorrect, so the algorithm instead walks all reachable commits above the generation-number cutoff. The existing algorithm is still an improvement over the previous algorithm, as the worst-case complexity went from quadratic to linear. The performance measurement at the time was good, but not dramatic. By fixing this heuristic, we reduce the number of walked commits. We can also re-run the performance tests from commit 4fbcca4e "commit-reach: make can_all_from_reach... linear". Performance was measured on the Linux repository using 'test-tool reach can_all_from_reach'. The input included rows seeded by tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as v3.[0-9]*. This mimics a (very large) fetch that says "I have all major v3 releases and want all major v4 releases." The "large" case included X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate tags to the set, which does not greatly increase the number of objects that are considered, but does increase the number of 'from' commits, demonstrating the quadratic nature of the previous code. Small Case: 4fbcca4e~1: 0.85 s 4fbcca4e: 0.26 s (num_walked: 1,011,035) HEAD: 0.14 s (num_walked: 8,601) Large Case: 4fbcca4e~1: 24.0 s 4fbcca4e: 0.12 s (num_walked: 503,925) HEAD: 0.06 s (num_walked: 217,243) Signed-off-by: Derrick Stolee --- commit-reach.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a22..79419be8af 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -593,8 +593,10 @@ int can_all_from_reach_with_flag(struct object_array *from, while (stack) { struct commit_list *parent; - if (stack->item->object.flags & with_flag) { + if (stack->item->object.flags & (with_flag | RESULT)) { pop_commit(); + if (stack) + stack->item->object.flags |= RESULT; continue; } -- gitgitgadget
[PATCH 0/1] commit-reach: fix first-parent heuristic
I originally reported this fix [1] after playing around with the trace2 series for measuring performance. Since trace2 isn't merging quickly, I pulled the performance fix patch out and am sending it on its own. The only difference here is that we don't have the tracing to verify the performance fix in the test script. See the patch message for details about the fix. Thanks, -Stolee [1] https://public-inbox.org/git/20180906151309.66712-7-dsto...@microsoft.com/ [RFC PATCH 6/6] commit-reach: fix first-parent heuristic Derrick Stolee (1): commit-reach: fix first-parent heuristic commit-reach.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-51%2Fderrickstolee%2Ffirst-parent-heuristic-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-51/derrickstolee/first-parent-heuristic-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/51 -- gitgitgadget
[PATCH 3/3] commit-graph: Use commit-graph by default
From: Derrick Stolee The config setting "core.commitGraph" enables using the commit-graph file to accelerate commit walks through parsing speed and generation numbers. The setting "gc.writeCommitGraph" enables writing the commit-graph file on every non-trivial 'git gc' operation. Together, they help users automatically improve their performance. By setting these config variables to true by default, we make the commit-graph feature an "opt-out" feature instead of "opt-in". Signed-off-by: Derrick Stolee --- Documentation/config.txt | 4 ++-- builtin/gc.c | 2 +- commit-graph.c | 6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..dc5ee7c145 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -923,7 +923,7 @@ the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. core.commitGraph:: If true, then git will read the commit-graph file (if it exists) - to parse the graph structure of commits. Defaults to false. See + to parse the graph structure of commits. Defaults to true. See linkgit:git-commit-graph[1] for more information. core.useReplaceRefs:: @@ -1639,7 +1639,7 @@ gc.writeCommitGraph:: If true, then gc will rewrite the commit-graph file when linkgit:git-gc[1] is run. When using linkgit:git-gc[1] '--auto' the commit-graph will be updated if housekeeping is - required. Default is false. See linkgit:git-commit-graph[1] + required. Default is true. See linkgit:git-commit-graph[1] for details. gc.logExpiry:: diff --git a/builtin/gc.c b/builtin/gc.c index 871a56f1c5..77e7413f94 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -41,7 +41,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; -static int gc_write_commit_graph; +static int gc_write_commit_graph = 1; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; diff --git a/commit-graph.c b/commit-graph.c index a686758603..a459272466 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -232,15 +232,15 @@ static int prepare_commit_graph(struct repository *r) { struct alternate_object_database *alt; char *obj_dir; - int config_value; + int config_value = 1; if (r->objects->commit_graph_attempted) return !!r->objects->commit_graph; r->objects->commit_graph_attempted = 1; + repo_config_get_bool(r, "core.commitgraph", _value); if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && - (repo_config_get_bool(r, "core.commitgraph", _value) || - !config_value)) + !config_value) /* * This repository is not configured to use commit graphs, so * do not load one. (But report commit_graph_attempted anyway -- gitgitgadget
[PATCH 1/3] t6501: use --quiet when testing gc stderr
From: Derrick Stolee The test script t6501-freshen-objects.sh has some tests that care if 'git gc' has any output to stderr. This is intended to say that no warnings occurred related to broken links. However, when we have operations that output progress (like writing the commit-graph) this causes the test to fail. Use 'git gc --quiet' to avoid these progress indicators from causing a test failure. Signed-off-by: Derrick Stolee --- t/t6501-freshen-objects.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh index 033871ee5f..0973130f06 100755 --- a/t/t6501-freshen-objects.sh +++ b/t/t6501-freshen-objects.sh @@ -137,7 +137,7 @@ test_expect_success 'do not complain about existing broken links (commit)' ' some message EOF commit=$(git hash-object -t commit -w broken-commit) && - git gc 2>stderr && + git gc --quiet 2>stderr && verbose git cat-file -e $commit && test_must_be_empty stderr ' @@ -147,7 +147,7 @@ test_expect_success 'do not complain about existing broken links (tree)' ' 100644 blob 0003foo EOF tree=$(git mktree --missing stderr && + git gc --quiet 2>stderr && git cat-file -e $tree && test_must_be_empty stderr ' @@ -162,7 +162,7 @@ test_expect_success 'do not complain about existing broken links (tag)' ' this is a broken tag EOF tag=$(git hash-object -t tag -w broken-tag) && - git gc 2>stderr && + git gc --quiet 2>stderr && git cat-file -e $tag && test_must_be_empty stderr ' -- gitgitgadget
[PATCH 0/3] Use commit-graph by default
The commit-graph feature is starting to stabilize. Based on what is in master right now, we have: Git 2.18: * Ability to write commit-graph (requires user interaction). * Commit parsing is faster when commit-graph exists. * Must have core.commitGraph true to use. Git 2.19: * Ability to write commit-graph on GC with gc.writeCommitGraph. * Generation numbers written in commit-graph * A few reachability algorithms make use of generation numbers. (queued for) master: * The test suite passes with GIT_TEST_COMMIT_GRAPH=1 * 'git commit-graph write' has progress indicators. * The commit-graph is automatically disabled when grafts or replace-objects exist. There are some other things coming that are in review (like 'git log --graph' speedups), but it is probably time to consider enabling the commit-graph by default. This series does that. For timing, I'm happy to leave this queued for a merge after the Git 2.20 release. There are enough things in master to justify not enabling this by default until that goes out and more people use it. PATCH 3/3 is rather simple, and is the obvious thing to do to achieve enabling these config values by default. PATCH 1/3 is a required change to make the test suite work with this change. This change isn't needed with GIT_TEST_COMMIT_GRAPH=1 because the commit-graph is up-to-date for these 'git gc' calls, so no progress is output. PATCH 2/3 is also a necessary evil, since we already had to disable GIT_TEST_COMMIT_GRAPH for some tests, we now also need to turn off core.commitGraph. Thanks, -Stolee Derrick Stolee (3): t6501: use --quiet when testing gc stderr t: explicitly turn off core.commitGraph as needed commit-graph: Use commit-graph by default Documentation/config.txt| 4 ++-- builtin/gc.c| 2 +- commit-graph.c | 6 +++--- t/t0410-partial-clone.sh| 3 ++- t/t5307-pack-missing-commit.sh | 3 ++- t/t6011-rev-list-with-bad-commit.sh | 3 ++- t/t6024-recursive-merge.sh | 3 ++- t/t6501-freshen-objects.sh | 6 +++--- 8 files changed, 17 insertions(+), 13 deletions(-) base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-50%2Fderrickstolee%2Fcommit-graph-default-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-50/derrickstolee/commit-graph-default-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/50 -- gitgitgadget
[PATCH 2/3] t: explicitly turn off core.commitGraph as needed
From: Derrick Stolee There are a few tests that already require GIT_TEST_COMMIT_GRAPH=0 as they rely on an interaction with the commits in the object store that is circumvented by parsing commit information from the commit-graph instead. Before enabling core.commitGraph as true by default, explicitly turn the setting off for these tests. Signed-off-by: Derrick Stolee --- t/t0410-partial-clone.sh| 3 ++- t/t5307-pack-missing-commit.sh | 3 ++- t/t6011-rev-list-with-bad-commit.sh | 3 ++- t/t6024-recursive-merge.sh | 3 ++- 4 files changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index cfd0655ea1..f5277fafb1 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -193,7 +193,8 @@ test_expect_success 'rev-list stops traversal at missing and promised commit' ' git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.partialclone "arbitrary string" && - GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects --objects bar >out && + GIT_TEST_COMMIT_GRAPH=0 git -c core.commitGraph=false \ + -C repo rev-list --exclude-promisor-objects --objects bar >out && grep $(git -C repo rev-parse bar) out && ! grep $FOO out ' diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh index dacb440b27..dc4c19d0aa 100755 --- a/t/t5307-pack-missing-commit.sh +++ b/t/t5307-pack-missing-commit.sh @@ -16,7 +16,8 @@ test_expect_success setup ' obj=$(git rev-parse --verify tag3) && fanout=$(expr "$obj" : "\(..\)") && remainder=$(expr "$obj" : "..\(.*\)") && - rm -f ".git/objects/$fanout/$remainder" + rm -f ".git/objects/$fanout/$remainder" && + git config core.commitGraph false ' test_expect_success 'check corruption' ' diff --git a/t/t6011-rev-list-with-bad-commit.sh b/t/t6011-rev-list-with-bad-commit.sh index 545b461e51..da6949743d 100755 --- a/t/t6011-rev-list-with-bad-commit.sh +++ b/t/t6011-rev-list-with-bad-commit.sh @@ -42,7 +42,8 @@ test_expect_success 'corrupt second commit object' \ ' test_expect_success 'rev-list should fail' ' - test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > /dev/null + test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \ + git -c core.commitGraph=false rev-list --all > /dev/null ' test_expect_success 'git repack _MUST_ fail' \ diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh index 27c7de90ce..fccdf96f13 100755 --- a/t/t6024-recursive-merge.sh +++ b/t/t6024-recursive-merge.sh @@ -61,7 +61,8 @@ GIT_AUTHOR_DATE="2006-12-12 23:00:08" git commit -m F ' test_expect_success 'combined merge conflicts' ' - test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G + test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \ + git -c core.commitGraph=false merge -m final G ' cat > expect << EOF -- gitgitgadget
[PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI
Our CI scripts include a step to run the test suite with certain optional variables enabled. Now that all branches should build and have tests succeed with GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX enabled, add those variables to that stage. Note: the GIT_TEST_MULTI_PACK_INDEX variable has not merged all the way down, so will be ignored if this series is merged faster than that one. However, it is safe to make these changes orthogonally as all (known) test breaks with GIT_TEST_MULTI_PACK_INDEX=1 are fixed in the topic that introduces the variable. I also created a build definition on Azure Pipelines that runs the test suite with different subsets of the test variables, split by the following types: 1) Index options 2) Commit-graph 3) Multi-pack-index These builds are found at [1]. There are benefits to testing the variables all together but also separately. I didn't want to create new stages in the CI scripts to avoid consuming extra resources. This series is based on js/vsts-ci to avoid conflicts with that series, but is not necessarily a hard dependence. Thanks, -Stolee [1] https://git.visualstudio.com/git/_build?definitionId=4Build definition that tests Git with different arrangements of GIT_TEST_* variables. Derrick Stolee (1): ci: add optional test variables ci/run-build-and-tests.sh | 2 ++ 1 file changed, 2 insertions(+) base-commit: d82963f34cf6921ed29d1fc2d96b16acf9005159 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-49%2Fderrickstolee%2Fci-vars-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-49/derrickstolee/ci-vars-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/49 -- gitgitgadget
[PATCH 1/1] ci: add optional test variables
From: Derrick Stolee The commit-graph and multi-pack-index features introduce optional data structures that are not required for normal Git operations. It is important to run the normal test suite without them enabled, but it is helpful to also run the test suite using them. Our continuous integration scripts include a second test stage that runs with optional GIT_TEST_* variables enabled. Add the following two variables to that stage: GIT_TEST_COMMIT_GRAPH GIT_TEST_MULTI_PACK_INDEX This will slow down the operation, as we build a commit-graph file after every 'git commit' operation and build a multi-pack-index during every 'git repack' operation. However, it is important that future changes are compatible with these features. Signed-off-by: Derrick Stolee --- ci/run-build-and-tests.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh index e28ac2fb9a..db342bb6a8 100755 --- a/ci/run-build-and-tests.sh +++ b/ci/run-build-and-tests.sh @@ -15,6 +15,8 @@ then export GIT_TEST_FULL_IN_PACK_ARRAY=true export GIT_TEST_OE_SIZE=10 export GIT_TEST_OE_DELTA_SIZE=5 + export GIT_TEST_COMMIT_GRAPH=1 + export GIT_TEST_MULTI_PACK_INDEX=1 make --quiet test fi -- gitgitgadget
[PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
From: Derrick Stolee When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index e18bd530e4..2dcde8a8ac 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(>commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(>commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, >commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(>object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(>commits, revs->sort_order); + if (revs->topo_order) + sort_in_topological_order(>commits, revs->sort_order); + } else if (revs->topo_order) + init_topo_walk(revs); if
[PATCH v4 6/7] revision.c: generation-based topo-order algorithm
From: Derrick Stolee The current --topo-order algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value. This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go. This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n " or filling the first page of a pager. When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this implementation is not enabled by comparisions, such as in 'git rev-list
[PATCH v4 3/7] test-reach: add rev-list tests
From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 9d65b8b946..288f703b7b 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-6-6 +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8...commit-6-6 +' + test_done -- gitgitgadget
[PATCH v4 1/7] prio-queue: add 'peek' operation
From: Derrick Stolee When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Further, add a test that checks the behavior when the compare function is NULL (i.e. the queue becomes a stack). Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 26 ++ t/t0009-prio-queue.sh | 14 ++ 4 files changed, 47 insertions(+), 8 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..5bc9c46ea5 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get()); - else if (!strcmp(*argv, "dump")) { - int *v; - while ((v = prio_queue_get())) - show(v); - } - else { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(); + void *get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { + void *peek; + void *get; + while ((peek = prio_queue_peek())) { + get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } + } else if (!strcmp(*argv, "stack")) { + pq.compare = NULL; + } else { int *v = malloc(sizeof(*v)); *v = atoi(*argv); prio_queue_put(, v); diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh index e56dfce668..3941ad2528 100755 --- a/t/t0009-prio-queue.sh +++ b/t/t0009-prio-queue.sh @@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' ' test_cmp expect actual ' +cat >expect <<'EOF' +3 +2 +6 +4 +5 +1 +8 +EOF +test_expect_success 'stack order' ' + test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual && + test_cmp expect actual +' + test_done -- gitgitgadget
[PATCH v4 7/7] t6012: make rev-list tests more interesting
From: Derrick Stolee As we are working to rewrite some of the revision-walk machinery, there could easily be some interesting interactions between the options that force topological constraints (--topo-order, --date-order, and --author-date-order) along with specifying a path. Add extra tests to t6012-rev-list-simplify.sh to add coverage of these interactions. To ensure interesting things occur, alter the repo data shape to have different orders depending on topo-, date-, or author-date-order. When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering the new logic for topo-order walks using generation numbers. The extra tests can be added indepently. Signed-off-by: Derrick Stolee --- t/t6012-rev-list-simplify.sh | 45 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh index b5a1190ffe..a10f0df02b 100755 --- a/t/t6012-rev-list-simplify.sh +++ b/t/t6012-rev-list-simplify.sh @@ -12,6 +12,22 @@ unnote () { git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g" } +# +# Create a test repo with interesting commit graph: +# +# A--B--G--H--I--K--L +# \ \ / / +# \ \ / / +#C--E---F J +#\_/ +# +# The commits are laid out from left-to-right starting with +# the root commit A and terminating at the tip commit L. +# +# There are a few places where we adjust the commit date or +# author date to make the --topo-order, --date-order, and +# --author-date-order flags produce different output. + test_expect_success setup ' echo "Hi there" >file && echo "initial" >lost && @@ -21,10 +37,18 @@ test_expect_success setup ' git branch other-branch && + git symbolic-ref HEAD refs/heads/unrelated && + git rm -f "*" && + echo "Unrelated branch" >side && + git add side && + test_tick && git commit -m "Side root" && + note J && + git checkout master && + echo "Hello" >file && echo "second" >lost && git add file lost && - test_tick && git commit -m "Modified file and lost" && + test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" && note B && git checkout other-branch && @@ -63,13 +87,6 @@ test_expect_success setup ' test_tick && git commit -a -m "Final change" && note I && - git symbolic-ref HEAD refs/heads/unrelated && - git rm -f "*" && - echo "Unrelated branch" >side && - git add side && - test_tick && git commit -m "Side root" && - note J && - git checkout master && test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated && note K && @@ -103,14 +120,24 @@ check_result () { check_outcome success "$@" } -check_result 'L K J I H G F E D C B A' --full-history +check_result 'L K J I H F E D C G B A' --full-history --topo-order +check_result 'L K I H G F E D C B J A' --full-history +check_result 'L K I H G F E D C B J A' --full-history --date-order +check_result 'L K I H G F E D B C J A' --full-history --author-date-order check_result 'K I H E C B A' --full-history -- file check_result 'K I H E C B A' --full-history --topo-order -- file check_result 'K I H E C B A' --full-history --date-order -- file +check_result 'K I H E B C A' --full-history --author-date-order -- file check_result 'I E C B A' --simplify-merges -- file +check_result 'I E C B A' --simplify-merges --topo-order -- file +check_result 'I E C B A' --simplify-merges --date-order -- file +check_result 'I E B C A' --simplify-merges --author-date-order -- file check_result 'I B A' -- file check_result 'I B A' --topo-order -- file +check_result 'I B A' --date-order -- file +check_result 'I B A' --author-date-order -- file check_result 'H' --first-parent -- another-file +check_result 'H' --first-parent --topo-order -- another-file check_result 'E C B A' --full-history E -- lost test_expect_success 'full history simplification without parent' ' -- gitgitgadget
[PATCH v4 5/7] commit/revisions: bookkeeping before refactoring
From: Derrick Stolee There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding the author_slab definition to commit.h. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Also rename the method to the slightly more generic name process_parents() to make clear that this method does more than add to a list (and no list is required anymore). Helped-by: Jeff King Signed-off-by: Derrick Stolee --- commit.c | 11 +-- commit.h | 8 revision.c | 18 ++ 3 files changed, 23 insertions(+), 14 deletions(-) diff --git a/commit.c b/commit.c index d0f199e122..861a485e93 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ -define_commit_slab(author_date_slab, timestamp_t); +implement_shared_commit_slab(author_date_slab, timestamp_t); -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +683,8 @@ fail_exit: unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index 2b1a734388..977d397356 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +define_shared_commit_slab(author_date_slab, timestamp_t); + +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 2dcde8a8ac..36458265a0 100644 --- a/revision.c +++ b/revision.c @@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li *cache = new_entry; } -static int add_parents_to_list(struct rev_info *revs, struct commit *commit, - struct commit_list **list, struct commit_list **cache_ptr) +static int process_parents(struct rev_info *revs, struct commit *commit, + struct commit_list **list, struct commit_list **cache_ptr) { struct commit_list *parent = commit->parents; unsigned left_flag; @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->first_parent_only) break; @@ -1091,7 +1093,7 @@ static
[PATCH v4 0/7] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. Since revision.c is an incredibly important (and old) portion of the codebase -- and because there are so many orthogonal options in 'struct rev_info' -- I consider this submission to be "RFC quality". That is, I am not confident that I am not missing anything, or that my solution is the best it can be. I did merge this branch with ds/commit-graph-with-grafts and the "DO-NOT-MERGE: write and read commit-graph always" commit that computes a commit-graph with every 'git commit' command. The test suite passed with that change, available on GitHub [3]. To ensure that I cover at least the case I think are interesting, I added tests to t6600-test-reach.sh to verify the walks report the correct results for the three cases there (no commit-graph, full commit-graph, and a partial commit-graph so the walk starts at GENERATION_NUMBER_INFINITY). One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance is worse. This series was based on ds/reachable, but is now based on 'master' to not conflict with 182070 "commit: use timestamp_t for author_date_slab". There is a small conflict with md/filter-trees, because it renamed a flag in revisions.h in the line before I add new flags. Hopefully this conflict is not too difficult to resolve. Changes in V3: I added a new patch that updates the tab-alignment for flags in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the recommended changes to run_three_modes and test_three_modes from Szeder and Junio. Thanks! Changes in V4: I'm sending a V4 to respond to the feedback so far. Still looking forward to more on the really big commit! * Removed the whitespace changes to the flags in revision.c that caused merge pain. * The prio-queue peek function is now covered by tests when in "stack" mode. * The "add_parents_to_list()" function is now renamed to "process_parents()" * Added a new commit that expands test coverage with alternate orders and file history (use GIT_TEST_COMMIT_GRAPH to have t6012-rev-list-simplify.sh cover the new logic). These tests found a problem with author dates (I forgot to record them during the explore walk). * Commit message edits. Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/testA branch containing this series along with commits to compute commit-graph in entire test suite. Cc: avarab@gmail.comCc: szeder@gmail.com Derrick Stolee (7): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring revision.c: generation-based topo-order algorithm t6012: make rev-list tests more interesting commit.c | 11 +- commit.h | 8 ++ object.h
[PATCH v4 2/7] test-reach: add run_three_modes method
From: Derrick Stolee The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..9d65b8b946 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual } +test_three_modes () { + run_three_modes test-tool reach "$@" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 -- gitgitgadget
[PATCH v2 2/3] midx: close multi-pack-index on repack
From: Derrick Stolee When repacking, we may remove pack-files. This invalidates the multi-pack-index (if it exists). Previously, we removed the multi-pack-index file before removing any pack-file. In some cases, the repack command may load the multi-pack-index into memory. This may lead to later in-memory references to the non-existent pack- files. Signed-off-by: Derrick Stolee --- builtin/repack.c | 3 +-- midx.c | 15 --- midx.h | 4 +++- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index c6a7943d5c..44965cbaa3 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -431,8 +431,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) char *fname, *fname_old; if (!midx_cleared) { - /* if we move a packfile, it will invalidated the midx */ - clear_midx_file(get_object_directory()); + clear_midx_file(the_repository); midx_cleared = 1; } diff --git a/midx.c b/midx.c index bf1f511862..22247a30ab 100644 --- a/midx.c +++ b/midx.c @@ -176,9 +176,13 @@ cleanup_fail: return NULL; } -static void close_midx(struct multi_pack_index *m) +void close_midx(struct multi_pack_index *m) { uint32_t i; + + if (!m) + return; + munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; @@ -914,9 +918,14 @@ cleanup: return 0; } -void clear_midx_file(const char *object_dir) +void clear_midx_file(struct repository *r) { - char *midx = get_midx_filename(object_dir); + char *midx = get_midx_filename(r->objects->objectdir); + + if (r->objects && r->objects->multi_pack_index) { + close_midx(r->objects->multi_pack_index); + r->objects->multi_pack_index = NULL; + } if (remove_path(midx)) { UNLEAK(midx); diff --git a/midx.h b/midx.h index ce80b91c68..0f68bccdd5 100644 --- a/midx.h +++ b/midx.h @@ -42,7 +42,9 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name); int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); int write_midx_file(const char *object_dir); -void clear_midx_file(const char *object_dir); +void clear_midx_file(struct repository *r); int verify_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + #endif -- gitgitgadget
[PATCH v2 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX
From: Derrick Stolee The multi-pack-index feature is tested in isolation by t5319-multi-pack-index.sh, but there are many more interesting scenarios in the test suite surrounding pack-file data shapes and interactions. Since the multi-pack-index is an optional data structure, it does not make sense to include it by default in those tests. Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable that enables core.multiPackIndex and writes a multi-pack-index after each 'git repack' command. This adds extra test coverage when needed. There are a few spots in the test suite that need to react to this change: * t5319-multi-pack-index.sh: there is a test that checks that 'git repack' deletes the multi-pack-index. Disable the environment variable to ensure this still happens. * t5310-pack-bitmaps.sh: One test moves a pack-file from the object directory to an alternate. This breaks the multi-pack-index, so delete the multi-pack-index at this point, if it exists. * t9300-fast-import.sh: One test verifies the number of files in the .git/objects/pack directory is exactly 8. Exclude the multi-pack-index from this count so it is still 8 in all cases. Signed-off-by: Derrick Stolee --- builtin/repack.c| 4 midx.c | 9 +++-- midx.h | 2 ++ t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 20 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 44965cbaa3..26dcccdafc 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -553,6 +553,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!no_update_server_info) update_server_info(0); remove_temporary_files(); + + if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) + write_midx_file(get_object_directory()); + string_list_clear(, 0); string_list_clear(, 0); string_list_clear(_packs, 0); diff --git a/midx.c b/midx.c index 22247a30ab..02b2211e31 100644 --- a/midx.c +++ b/midx.c @@ -335,9 +335,14 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i struct multi_pack_index *m; struct multi_pack_index *m_search; int config_value; + static int env_value = -1; - if (repo_config_get_bool(r, "core.multipackindex", _value) || - !config_value) + if (env_value < 0) + env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0); + + if (!env_value && + (repo_config_get_bool(r, "core.multipackindex", _value) || + !config_value)) return 0; for (m_search = r->objects->multi_pack_index; m_search; m_search = m_search->next) diff --git a/midx.h b/midx.h index 0f68bccdd5..f2bb7e681c 100644 --- a/midx.h +++ b/midx.h @@ -3,6 +3,8 @@ #include "repository.h" +#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX" + struct multi_pack_index { struct multi_pack_index *next; diff --git a/t/README b/t/README index 5e48a043ce..9bfdd3004c 100644 --- a/t/README +++ b/t/README @@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack- +index to be written after every 'git repack' command, and overrides the +'core.multiPackIndex' setting to true. + Naming Tests diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 1be3459c5b..82d7f7f6a5 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pa test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' ' mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ && + rm -f .git/objects/pack/multi-pack-index && test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" && echo HEAD | git pack-objects --local --stdout --revs >3b.pack && git index-pack 3b.pack && diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index bd8e841b81..70926b5bc0 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -271,7 +271,7 @@ test_expect_success 'git-fsck incorrect offset' ' test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && - git repack -adf && + GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf && test_path_is_missing $objdir/pack/multi-pack-index ' diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 40fe7e4976..59a13b6a77 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -1558,7 +1558,7 @@ test_expect_success 'O:
[PATCH v2 1/3] midx: fix broken free() in close_midx()
From: Derrick Stolee When closing a multi-pack-index, we intend to close each pack-file and free the struct packed_git that represents it. However, this line was previously freeing the array of pointers, not the pointer itself. This leads to a double-free issue. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 713d6f9dde..bf1f511862 100644 --- a/midx.c +++ b/midx.c @@ -186,7 +186,7 @@ static void close_midx(struct multi_pack_index *m) for (i = 0; i < m->num_packs; i++) { if (m->packs[i]) { close_pack(m->packs[i]); - free(m->packs); + free(m->packs[i]); } } FREE_AND_NULL(m->packs); -- gitgitgadget
[PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable
To increase coverage of the multi-pack-index feature, add a GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_* variables. After creating the environment variable and running the test suite with it enabled, I found a few bugs in the multi-pack-index implementation. These are handled by the first two patches. I have set up a CI build on Azure Pipelines [1] that runs the test suite with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure they work well with the rest of the ongoing work. Eventually, we can add these variables to the Travis CI scripts. [1] https://git.visualstudio.com/git/_build?definitionId=4 Derrick Stolee (3): midx: fix broken free() in close_midx() midx: close multi-pack-index on repack multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX builtin/repack.c| 7 +-- midx.c | 26 -- midx.h | 6 +- t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 37 insertions(+), 11 deletions(-) base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-27/derrickstolee/midx-test/upstream-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/27 Range-diff vs v1: 1: 9fcbbe336d = 1: 8bd672fe26 midx: fix broken free() in close_midx() 2: 725ebadc92 ! 2: 2d8f26679d midx: close multi-pack-index on repack @@ -15,16 +15,15 @@ --- a/builtin/repack.c +++ b/builtin/repack.c @@ + char *fname, *fname_old; if (!midx_cleared) { - /* if we move a packfile, it will invalidated the midx */ -+ if (the_repository->objects) { -+ close_midx(the_repository->objects->multi_pack_index); -+ the_repository->objects->multi_pack_index = NULL; -+ } - clear_midx_file(get_object_directory()); +- /* if we move a packfile, it will invalidated the midx */ +- clear_midx_file(get_object_directory()); ++ clear_midx_file(the_repository); midx_cleared = 1; } + diff --git a/midx.c b/midx.c --- a/midx.c @@ -44,13 +43,34 @@ munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; +@@ + return 0; + } + +-void clear_midx_file(const char *object_dir) ++void clear_midx_file(struct repository *r) + { +- char *midx = get_midx_filename(object_dir); ++ char *midx = get_midx_filename(r->objects->objectdir); ++ ++ if (r->objects && r->objects->multi_pack_index) { ++ close_midx(r->objects->multi_pack_index); ++ r->objects->multi_pack_index = NULL; ++ } + + if (remove_path(midx)) { + UNLEAK(midx); diff --git a/midx.h b/midx.h --- a/midx.h +++ b/midx.h @@ + int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); + int write_midx_file(const char *object_dir); - void clear_midx_file(const char *object_dir); +-void clear_midx_file(const char *object_dir); ++void clear_midx_file(struct repository *r); + int verify_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + 3: 04e3e91082 = 3: 57c64e814c multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX -- gitgitgadget
[PATCH 1/3] midx: fix broken free() in close_midx()
From: Derrick Stolee When closing a multi-pack-index, we intend to close each pack-file and free the struct packed_git that represents it. However, this line was previously freeing the array of pointers, not the pointer itself. This leads to a double-free issue. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index f3e8dbc108..999717b96f 100644 --- a/midx.c +++ b/midx.c @@ -190,7 +190,7 @@ static void close_midx(struct multi_pack_index *m) for (i = 0; i < m->num_packs; i++) { if (m->packs[i]) { close_pack(m->packs[i]); - free(m->packs); + free(m->packs[i]); } } FREE_AND_NULL(m->packs); -- gitgitgadget
[PATCH 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable
To increase coverage of the multi-pack-index feature, add a GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_* variables. After creating the environment variable and running the test suite with it enabled, I found a few bugs in the multi-pack-index implementation. These are handled by the first two patches. I have set up a CI build on Azure Pipelines [1] that runs the test suite with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure they work well with the rest of the ongoing work. Eventually, we can add these variables to the Travis CI scripts. [1] https://git.visualstudio.com/git/_build?definitionId=4 Derrick Stolee (3): midx: fix broken free() in close_midx() midx: close multi-pack-index on repack multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX builtin/repack.c| 8 midx.c | 17 + midx.h | 4 t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 32 insertions(+), 6 deletions(-) base-commit: f84b9b09d40408cf91bbc500d9f190a7866c3e0f Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-27/derrickstolee/midx-test/upstream-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/27 -- gitgitgadget
[PATCH 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX
From: Derrick Stolee The multi-pack-index feature is tested in isolation by t5319-multi-pack-index.sh, but there are many more interesting scenarios in the test suite surrounding pack-file data shapes and interactions. Since the multi-pack-index is an optional data structure, it does not make sense to include it by default in those tests. Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable that enables core.multiPackIndex and writes a multi-pack-index after each 'git repack' command. This adds extra test coverage when needed. There are a few spots in the test suite that need to react to this change: * t5319-multi-pack-index.sh: there is a test that checks that 'git repack' deletes the multi-pack-index. Disable the environment variable to ensure this still happens. * t5310-pack-bitmaps.sh: One test moves a pack-file from the object directory to an alternate. This breaks the multi-pack-index, so delete the multi-pack-index at this point, if it exists. * t9300-fast-import.sh: One test verifies the number of files in the .git/objects/pack directory is exactly 8. Exclude the multi-pack-index from this count so it is still 8 in all cases. Signed-off-by: Derrick Stolee --- builtin/repack.c| 4 midx.c | 9 +++-- midx.h | 2 ++ t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 20 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 7925bb976e..418442bfe2 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -558,6 +558,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!no_update_server_info) update_server_info(0); remove_temporary_files(); + + if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) + write_midx_file(get_object_directory()); + string_list_clear(, 0); string_list_clear(, 0); string_list_clear(_packs, 0); diff --git a/midx.c b/midx.c index fe8532a9d1..aeafb58fa3 100644 --- a/midx.c +++ b/midx.c @@ -338,9 +338,14 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i struct multi_pack_index *m; struct multi_pack_index *m_search; int config_value; + static int env_value = -1; - if (repo_config_get_bool(r, "core.multipackindex", _value) || - !config_value) + if (env_value < 0) + env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0); + + if (!env_value && + (repo_config_get_bool(r, "core.multipackindex", _value) || + !config_value)) return 0; for (m_search = r->objects->multi_pack_index; m_search; m_search = m_search->next) diff --git a/midx.h b/midx.h index af6b5cb58f..bec8f73d28 100644 --- a/midx.h +++ b/midx.h @@ -3,6 +3,8 @@ #include "repository.h" +#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX" + struct multi_pack_index { struct multi_pack_index *next; diff --git a/t/README b/t/README index 3ea6c85460..9d0277c338 100644 --- a/t/README +++ b/t/README @@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack- +index to be written after every 'git repack' command, and overrides the +'core.multiPackIndex' setting to true. + Naming Tests diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 1be3459c5b..82d7f7f6a5 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pa test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' ' mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ && + rm -f .git/objects/pack/multi-pack-index && test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" && echo HEAD | git pack-objects --local --stdout --revs >3b.pack && git index-pack 3b.pack && diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 6f56b38674..4024ff9a39 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -152,7 +152,7 @@ compare_results_with_midx "twelve packs" test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && - git repack -adf && + GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf && test_path_is_missing $objdir/pack/multi-pack-index ' diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 40fe7e4976..59a13b6a77 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -1558,7 +1558,7 @@ test_expect_success 'O: blank lines
[PATCH 2/3] midx: close multi-pack-index on repack
From: Derrick Stolee When repacking, we may remove pack-files. This invalidates the multi-pack-index (if it exists). Previously, we removed the multi-pack-index file before removing any pack-file. In some cases, the repack command may load the multi-pack-index into memory. This may lead to later in-memory references to the non-existent pack- files. Signed-off-by: Derrick Stolee --- builtin/repack.c | 4 midx.c | 6 +- midx.h | 2 ++ 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/builtin/repack.c b/builtin/repack.c index c6a7943d5c..7925bb976e 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!midx_cleared) { /* if we move a packfile, it will invalidated the midx */ + if (the_repository->objects) { + close_midx(the_repository->objects->multi_pack_index); + the_repository->objects->multi_pack_index = NULL; + } clear_midx_file(get_object_directory()); midx_cleared = 1; } diff --git a/midx.c b/midx.c index 999717b96f..fe8532a9d1 100644 --- a/midx.c +++ b/midx.c @@ -180,9 +180,13 @@ cleanup_fail: return NULL; } -static void close_midx(struct multi_pack_index *m) +void close_midx(struct multi_pack_index *m) { uint32_t i; + + if (!m) + return; + munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; diff --git a/midx.h b/midx.h index a210f1af2a..af6b5cb58f 100644 --- a/midx.h +++ b/midx.h @@ -44,4 +44,6 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + #endif -- gitgitgadget
[PATCH v4 1/1] contrib: add coverage-diff script
From: Derrick Stolee We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks covering very unlikely (or near-impossible) situations may not warrant coverage. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. The output uses 'git blame -s' format so you can find the commits responsible and view the line numbers for quick access to the context, but trims leading tabs in the file contents to reduce output width. Signed-off-by: Derrick Stolee --- contrib/coverage-diff.sh | 108 +++ 1 file changed, 108 insertions(+) create mode 100755 contrib/coverage-diff.sh diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh new file mode 100755 index 00..4ec419f900 --- /dev/null +++ b/contrib/coverage-diff.sh @@ -0,0 +1,108 @@ +#!/bin/sh + +# Usage: Run 'contrib/coverage-diff.sh ' from source-root +# after running +# +# make coverage-test +# make coverage-report +# +# while checked out at . This script combines the *.gcov files +# generated by the 'make' commands above with 'git diff ' +# to report new lines that are not covered by the test suite. + +V1=$1 +V2=$2 + +diff_lines () { + perl -e ' + my $line_num; + while (<>) { + # Hunk header? Grab the beginning in postimage. + if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) { + $line_num = $1; + next; + } + + # Have we seen a hunk? Ignore "diff --git" etc. + next unless defined $line_num; + + # Deleted line? Ignore. + if (/^-/) { + next; + } + + # Show only the line number of added lines. + if (/^\+/) { + print "$line_num\n"; + } + # Either common context or added line appear in + # the postimage. Count it. + $line_num++; + } + ' +} + +files=$(git diff --name-only "$V1" "$V2" -- \*.c) + +# create empty file +>coverage-data.txt + +for file in $files +do + git diff "$V1" "$V2" -- "$file" | + diff_lines | + sort >new_lines.txt + + if ! test -s new_lines.txt + then + continue + fi + + hash_file=$(echo $file | sed "s/\//\#/") + + if ! test -s "$hash_file.gcov" + then + continue + fi + + sed -ne '/#:/{ + s/#:// + s/:.*// + s/ //g + p + }' "$hash_file.gcov" | + sort >uncovered_lines.txt + + comm -12 uncovered_lines.txt new_lines.txt | + sed -e 's/$/\)/' | + sed -e 's/^/ /' >uncovered_new_lines.txt + + grep -q '[^[:space:]]' >coverage-data.txt && + git blame -s "$V2" -- "$file" | + sed 's/\t//g' | + grep -f uncovered_new_lines.txt >>coverage-data.txt && + echo >>coverage-data.txt + + rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt +done + +cat coverage-data.txt + +echo "Commits introducing uncovered code:" + +commit_list=$(cat coverage-data.txt | + grep -E '^[0-9a-f]{7,} ' | + awk '{print $1;}' | + sort | + uniq) + +( + for commit in $commit_list + do + git log --no-decorate --pretty=format:'%an %h: %s' -1 $commit + echo + done +) | sort + +rm coverage-data.txt -- gitgitgadget
[PATCH v4 0/1] contrib: Add script to show uncovered "new" lines
We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. For example, I ran this against the 'next' branch (e82ca0) versus 'master' (f84b9b) and got the following output: builtin/commit.c 76f2f5c1e3 builtin/commit.c 1657) write_commit_graph_reachable(get_object_directory(), 0, 0); builtin/fsck.c 66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir"; 66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path; 66ec0390e7 builtin/fsck.c 864) if (run_command(_verify)) 66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH; fsck.c fb8952077d 214) die_errno("Could not read '%s'", path); midx.c 56ee7ff156 949) return 0; cc6af73c02 990) midx_report(_("failed to load pack-index for packfile %s"), cc6af73c02 991) e.p->pack_name); cc6af73c02 992) break; Commits introducing uncovered code: Derrick Stolee 56ee7ff15: multi-pack-index: add 'verify' verb Derrick Stolee 66ec0390e: fsck: verify multi-pack-index Derrick Stolee cc6af73c0: multi-pack-index: verify object offsets Junio C Hamano 76f2f5c1e: Merge branch 'ab/commit-graph-progress' into next René Scharfe fb8952077: fsck: use strbuf_getline() to read skiplist file Thanks, -Stolee CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the performance greatly. Some of the other sed commands needed some massaging, but also added extra cleanup. Thanks for the help! CHANGES IN V4: I reduced the blame output using -s which decreases the width. I include a summary of the commit authors at the end to help people see the lines they wrote. This version is also copied into a build definition in the public Git project on Azure Pipelines [1]. I'll use this build definition to generate the coverage report after each "What's Cooking" email. [1] https://git.visualstudio.com/git/_build?definitionId=5 Derrick Stolee (1): contrib: add coverage-diff script contrib/coverage-diff.sh | 108 +++ 1 file changed, 108 insertions(+) create mode 100755 contrib/coverage-diff.sh base-commit: 1d4361b0f344188ab5eec6dcea01f61a3a3a1670 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-40%2Fderrickstolee%2Fcoverage-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-40/derrickstolee/coverage-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/40 Range-diff vs v3: 1: 21214cc321 ! 1: 6daf310a43 contrib: add coverage-diff script @@ -26,10 +26,10 @@ contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the -test suite. The output uses 'git blame -c' format so you can find the commits -responsible and view the line numbers for quick access to the context. +test suite. The output uses 'git blame -s' format so you can find the commits +responsible and view the line numbers for quick access to the context, but +trims leading tabs in the file contents to reduce output width. -Helped-by: Junio C Hamano Signed-off-by: Derrick Stolee diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh @@ -81,13 +81,16 @@ + ' +} + -+files=$(git diff --name-only $V1 $V2 -- *.c) ++files=$(git diff --name-only "$V1" "$V2" -- \*.c) ++ ++# create empty file ++>coverage-data.txt + +for file in $files +do -+ git diff $V1 $V2 -- $file \ -+ | diff_lines \ -+ | sort >new_lines.txt ++ git diff "$V1" "$V2" -- "$file" | ++ diff_lines | ++ sort >new_lines.txt + + if ! test -s new_lines.txt + then @@ -95,24 +98,50 @@ + fi + + hash_file=$(echo $file | sed "s/\//\#/") ++ ++ if ! test -s "$hash_file.gcov" ++ then ++ continue ++ fi ++ + sed -ne '/#:/{ + s/#:// + s/:.*//
[PATCH v2 1/3] commit-graph: clean up leaked memory during write
From: Derrick Stolee The write_commit_graph() method in commit-graph.c leaks some lits and strings during execution. In addition, a list of strings is leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. Further, if we use a list of pack-files to find the commits, we can leak the packed_git structs after scanning them for commits. Running the following commands demonstrates the leak before and the fix after: * valgrind --leak-check=full ./git commit-graph write --reachable * valgrind --leak-check=full ./git commit-graph write --stdin-packs Signed-off-by: Martin Ågren Signed-off-by: Derrick Stolee --- commit-graph.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 2a24eb8b5a..ceca6026b0 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -693,11 +693,12 @@ static int add_ref_to_list(const char *refname, void write_commit_graph_reachable(const char *obj_dir, int append, int report_progress) { - struct string_list list; + struct string_list list = STRING_LIST_INIT_DUP; - string_list_init(, 1); for_each_ref(add_ref_to_list, ); write_commit_graph(obj_dir, NULL, , append, report_progress); + + string_list_clear(, 0); } void write_commit_graph(const char *obj_dir, @@ -764,6 +765,7 @@ void write_commit_graph(const char *obj_dir, die(_("error opening index for %s"), packname.buf); for_each_object_in_pack(p, add_packed_commits, , 0); close_pack(p); + free(p); } stop_progress(); strbuf_release(); @@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(, report_progress); graph_name = get_commit_graph_filename(obj_dir); - if (safe_create_leading_directories(graph_name)) + if (safe_create_leading_directories(graph_name)) { + UNLEAK(graph_name); die_errno(_("unable to create leading directories of %s"), graph_name); + } hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); @@ -893,9 +897,9 @@ void write_commit_graph(const char *obj_dir, finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC); commit_lock_file(); + free(graph_name); + free(commits.list); free(oids.list); - oids.alloc = 0; - oids.nr = 0; } #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 -- gitgitgadget
[PATCH v2 3/3] commit-graph: reduce initial oid allocation
From: Derrick Stolee While writing a commit-graph file, we store the full list of commits in a flat list. We use this list for sorting and ensuring we are closed under reachability. The initial allocation assumed that (at most) one in four objects is a commit. This is a dramatic over-count for many repos, especially large ones. Since we grow the repo dynamically, reduce this count by a factor of eight. We still set it to a minimum of 1024 before allocating. Signed-off-by: Derrick Stolee --- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index ceca6026b0..e773703e1d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -720,7 +720,7 @@ void write_commit_graph(const char *obj_dir, struct progress *progress = NULL; oids.nr = 0; - oids.alloc = approximate_object_count() / 4; + oids.alloc = approximate_object_count() / 32; oids.progress = NULL; oids.progress_done = 0; -- gitgitgadget
[PATCH v2 0/3] Clean up leaks in commit-graph.c
While looking at the commit-graph code, I noticed some memory leaks. These can be found by running valgrind --leak-check=full ./git commit-graph write --reachable The impact of these leaks are small, as we never call write_commit_graph _reachable in a loop, but it is best to be diligent here. While looking at memory consumption within write_commit_graph(), I noticed that we initialize our oid list with "object count / 4", which seems to be a huge over-count. I reduce this by a factor of eight. I built off of ab/commit-graph-progress, because my patch involves lines close to those changes. V2 includes feedback from V1 along with Martin's additional patches. Thanks, -Stolee Derrick Stolee (2): commit-graph: clean up leaked memory during write commit-graph: reduce initial oid allocation Martin Ågren (1): builtin/commit-graph.c: UNLEAK variables builtin/commit-graph.c | 11 ++- commit-graph.c | 16 ++-- 2 files changed, 16 insertions(+), 11 deletions(-) base-commit: 6b89a34c89fc763292f06012318b852b74825619 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-42/derrickstolee/commit-graph-leak-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/42 Range-diff vs v1: 1: 6906c25415 ! 1: ba65680b3d commit-graph: clean up leaked memory during write @@ -7,17 +7,29 @@ leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. -Running 'valgrind --leak-check=full git commit-graph write ---reachable' demonstrates these leaks and how they are fixed after -this change. +Further, if we use a list of pack-files to find the commits, we +can leak the packed_git structs after scanning them for commits. +Running the following commands demonstrates the leak before and +the fix after: + +* valgrind --leak-check=full ./git commit-graph write --reachable +* valgrind --leak-check=full ./git commit-graph write --stdin-packs + +Signed-off-by: Martin Ågren Signed-off-by: Derrick Stolee diff --git a/commit-graph.c b/commit-graph.c --- a/commit-graph.c +++ b/commit-graph.c @@ - string_list_init(, 1); + void write_commit_graph_reachable(const char *obj_dir, int append, +int report_progress) + { +- struct string_list list; ++ struct string_list list = STRING_LIST_INIT_DUP; + +- string_list_init(, 1); for_each_ref(add_ref_to_list, ); write_commit_graph(obj_dir, NULL, , append, report_progress); + @@ -25,6 +37,14 @@ } void write_commit_graph(const char *obj_dir, +@@ + die(_("error opening index for %s"), packname.buf); + for_each_object_in_pack(p, add_packed_commits, , 0); + close_pack(p); ++ free(p); + } + stop_progress(); + strbuf_release(); @@ compute_generation_numbers(, report_progress); @@ -45,5 +65,8 @@ + free(graph_name); + free(commits.list); free(oids.list); - oids.alloc = 0; - oids.nr = 0; +- oids.alloc = 0; +- oids.nr = 0; + } + + #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 -: -- > 2: 13032d8475 builtin/commit-graph.c: UNLEAK variables 2: e29a0eaf03 = 3: 1002fd34fc commit-graph: reduce initial oid allocation -- gitgitgadget
[PATCH 2/2] commit-graph: reduce initial oid allocation
From: Derrick Stolee While writing a commit-graph file, we store the full list of commits in a flat list. We use this list for sorting and ensuring we are closed under reachability. The initial allocation assumed that (at most) one in four objects is a commit. This is a dramatic over-count for many repos, especially large ones. Since we grow the repo dynamically, reduce this count by a factor of eight. We still set it to a minimum of 1024 before allocating. Signed-off-by: Derrick Stolee --- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 7226bd6b58..a24cceb55f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -721,7 +721,7 @@ void write_commit_graph(const char *obj_dir, struct progress *progress = NULL; oids.nr = 0; - oids.alloc = approximate_object_count() / 4; + oids.alloc = approximate_object_count() / 32; oids.progress = NULL; oids.progress_done = 0; -- gitgitgadget
[PATCH 1/2] commit-graph: clean up leaked memory during write
From: Derrick Stolee The write_commit_graph() method in commit-graph.c leaks some lits and strings during execution. In addition, a list of strings is leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. Running 'valgrind --leak-check=full git commit-graph write --reachable' demonstrates these leaks and how they are fixed after this change. Signed-off-by: Derrick Stolee --- commit-graph.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 2a24eb8b5a..7226bd6b58 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -698,6 +698,8 @@ void write_commit_graph_reachable(const char *obj_dir, int append, string_list_init(, 1); for_each_ref(add_ref_to_list, ); write_commit_graph(obj_dir, NULL, , append, report_progress); + + string_list_clear(, 0); } void write_commit_graph(const char *obj_dir, @@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(, report_progress); graph_name = get_commit_graph_filename(obj_dir); - if (safe_create_leading_directories(graph_name)) + if (safe_create_leading_directories(graph_name)) { + UNLEAK(graph_name); die_errno(_("unable to create leading directories of %s"), graph_name); + } hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); @@ -893,6 +897,8 @@ void write_commit_graph(const char *obj_dir, finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC); commit_lock_file(); + free(graph_name); + free(commits.list); free(oids.list); oids.alloc = 0; oids.nr = 0; -- gitgitgadget
[PATCH 0/2] Clean up leaks in commit-graph.c
While looking at the commit-graph code, I noticed some memory leaks. These can be found by running valgrind --leak-check=full ./git commit-graph write --reachable The impact of these leaks are small, as we never call write_commit_graph _reachable in a loop, but it is best to be diligent here. While looking at memory consumption within write_commit_graph(), I noticed that we initialize our oid list with "object count / 4", which seems to be a huge over-count. I reduce this by a factor of eight. I built off of ab/commit-graph-progress, because my patch involves lines close to those changes. Thanks, -Stolee Derrick Stolee (2): commit-graph: clean up leaked memory during write commit-graph: reduce initial oid allocation commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) base-commit: 6b89a34c89fc763292f06012318b852b74825619 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-42/derrickstolee/commit-graph-leak-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/42 -- gitgitgadget
[PATCH 1/1] read-cache: update index format default to v4
From: Derrick Stolee The index v4 format has been available since 2012 with 9d22778 "reach-cache.c: write prefix-compressed names in the index". Since the format has been stable for so long, almost all versions of Git in use today understand version 4, removing one barrier to upgrade -- that someone may want to downgrade and needs a working repo. Despite being stable for a long time, this index version was never adopted as the default. This prefix-compressed version of the format can get significant space savings on repos with large working directories (which naturally tend to have deep nesting). This version is set as the default for some external tools, such as VFS for Git. Because of this external use, the format has had a lot of "testing in production" and also is subject to continuous integration in these environments. Previously, to test version 4 indexes, we needed to run the test suite with GIT_TEST_INDEX_VERSION=4 (or TEST_GIT_INDEX_VERSION=4). One potential, but short-term, downside is that we lose coverage of the version 3 indexes. The trade-off is that we may want to cover that version using GIT_TEST_INDEX_VERSION=3. Signed-off-by: Derrick Stolee --- read-cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/read-cache.c b/read-cache.c index 372588260e..af6c8f2a67 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1484,7 +1484,7 @@ struct cache_entry *refresh_cache_entry(struct cache_entry *ce, * Index File I/O */ -#define INDEX_FORMAT_DEFAULT 3 +#define INDEX_FORMAT_DEFAULT 4 static unsigned int get_index_format_default(void) { -- gitgitgadget
[PATCH 0/1] read-cache: update index format default to v4
After discussing this with several people off-list, I thought I would open the question up to the list: Should we update the default index version to v4? The .git/index file stores the list of every path tracked by Git in the working directory, including merge information, staged paths, and information about the file system contents (based on modified time). The only major update in v4 is that the paths are prefix-compressed. This compression works best in repos with a lot of paths, especially deep paths. For this reason, we set the index to v4 in VFS for Git. Among VFS for Git contributors, we were talking about how the v4 format is not covered by the test suite by default. We are working to increase the number of CI builds that set extra GIT_TEST_* variables that we need. However, I thought it worth having a discussion of whether this is a good thing to recommend for all users of Git. Personally, I'm not an expert here, but I am happy to start the conversation. Thanks, -Stolee Derrick Stolee (1): read-cache: update index format default to v4 read-cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-41%2Fderrickstolee%2Findex-v4-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-41/derrickstolee/index-v4-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/41 -- gitgitgadget
[PATCH v4 0/1] Properly peel tags in can_all_from_reach_with_flags()
As Peff reported [1], the refactored can_all_from_reach_with_flags() method does not properly peel tags. Since the helper method can_all_from_reach() and code in t/helper/test-reach.c all peel tags before getting to this method, it is not super-simple to create a test that demonstrates this. I modified t/helper/test-reach.c to allow calling can_all_from_reach_with_flags() directly, and added a test in t6600-test-reach.sh that demonstrates the segfault without the fix. For V2, I compared the loop that inspects the 'from' commits in commit ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version here and got the following diff: 3c3 < if (from_one->flags & assign_flag) --- > if (!from_one || from_one->flags & assign_flag) 5c5,7 < from_one = deref_tag(the_repository, from_one, "a from object", 0); --- > > from_one = deref_tag(the_repository, from_one, > "a from object", 0); 14a17,22 > > list[nr_commits] = (struct commit *)from_one; > if (parse_commit(list[nr_commits]) || > list[nr_commits]->generation < min_generation) > return 0; /* is this a leak? */ > nr_commits++; This diff includes the early termination we had before 'deref_tag' and the comment for why we can ignore non-commit objects. [1] https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u Derrick Stolee (1): commit-reach: properly peel tags and clear flags commit-reach.c| 44 +-- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 +++-- 3 files changed, 79 insertions(+), 17 deletions(-) base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-39/derrickstolee/tag-fix-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/39 Range-diff vs v3: 1: 0a1e661271 ! 1: a0a3cf0134 commit-reach: properly peel tags @@ -1,6 +1,6 @@ Author: Derrick Stolee -commit-reach: properly peel tags +commit-reach: properly peel tags and clear flags The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e "commit-reach: make can_all_from_reach... linear" but incorrectly @@ -14,6 +14,19 @@ Correct the issue by peeling tags when investigating the initial list of objects in the 'from' array. +The can_all_from_reach_with_flag() method uses 'assign_flag' as a +value we can use to mark objects temporarily during our commit walk. +The intent is that these flags are removed from all objects before +returning. However, this is not the case. + +The 'from' array could also contain objects that are not commits, and +we mark those objects with 'assign_flag'. Add a loop to the 'cleanup' +section that removes these markers. + +Also, we forgot to free() the memory for 'list', so add that to the +'cleanup' section. Also, use a cleaner mechanism for clearing those +flags. + Signed-off-by: Jeff King Signed-off-by: Derrick Stolee @@ -74,10 +87,18 @@ cleanup: - for (i = 0; i < from->nr; i++) { -+ for (i = 0; i < nr_commits; i++) { - clear_commit_marks(list[i], RESULT); - clear_commit_marks(list[i], assign_flag); - } +- clear_commit_marks(list[i], RESULT); +- clear_commit_marks(list[i], assign_flag); +- } ++ clear_commit_marks_many(nr_commits, list, RESULT | assign_flag); ++ free(list); ++ ++ for (i = 0; i < from->nr; i++) ++ from->objects[i].item->flags &= ~assign_flag; ++ + return result; + } + diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c --- a/t/helper/test-reach.c 2: b2e0ee4978 < -: -- commit-reach: fix memory and flag leaks -- gitgitgadget
[PATCH v4 1/1] commit-reach: properly peel tags and clear flags
From: Derrick Stolee The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e "commit-reach: make can_all_from_reach... linear" but incorrectly assumed that all objects provided were commits. During a fetch negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags to the 'from' array. The current code creates a segfault. Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach' and add a test in t6600-test-reach.sh that demonstrates this segfault. Correct the issue by peeling tags when investigating the initial list of objects in the 'from' array. The can_all_from_reach_with_flag() method uses 'assign_flag' as a value we can use to mark objects temporarily during our commit walk. The intent is that these flags are removed from all objects before returning. However, this is not the case. The 'from' array could also contain objects that are not commits, and we mark those objects with 'assign_flag'. Add a loop to the 'cleanup' section that removes these markers. Also, we forgot to free() the memory for 'list', so add that to the 'cleanup' section. Also, use a cleaner mechanism for clearing those flags. Signed-off-by: Jeff King Signed-off-by: Derrick Stolee --- commit-reach.c| 44 +-- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 +++-- 3 files changed, 79 insertions(+), 17 deletions(-) diff --git a/commit-reach.c b/commit-reach.c index 86715c103c..db84f85986 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -544,20 +544,42 @@ int can_all_from_reach_with_flag(struct object_array *from, { struct commit **list = NULL; int i; + int nr_commits; int result = 1; ALLOC_ARRAY(list, from->nr); + nr_commits = 0; for (i = 0; i < from->nr; i++) { - list[i] = (struct commit *)from->objects[i].item; + struct object *from_one = from->objects[i].item; - if (parse_commit(list[i]) || - list[i]->generation < min_generation) - return 0; + if (!from_one || from_one->flags & assign_flag) + continue; + + from_one = deref_tag(the_repository, from_one, +"a from object", 0); + if (!from_one || from_one->type != OBJ_COMMIT) { + /* no way to tell if this is reachable by +* looking at the ancestry chain alone, so +* leave a note to ourselves not to worry about +* this object anymore. +*/ + from->objects[i].item->flags |= assign_flag; + continue; + } + + list[nr_commits] = (struct commit *)from_one; + if (parse_commit(list[nr_commits]) || + list[nr_commits]->generation < min_generation) { + result = 0; + goto cleanup; + } + + nr_commits++; } - QSORT(list, from->nr, compare_commits_by_gen); + QSORT(list, nr_commits, compare_commits_by_gen); - for (i = 0; i < from->nr; i++) { + for (i = 0; i < nr_commits; i++) { /* DFS from list[i] */ struct commit_list *stack = NULL; @@ -600,10 +622,12 @@ int can_all_from_reach_with_flag(struct object_array *from, } cleanup: - for (i = 0; i < from->nr; i++) { - clear_commit_marks(list[i], RESULT); - clear_commit_marks(list[i], assign_flag); - } + clear_commit_marks_many(nr_commits, list, RESULT | assign_flag); + free(list); + + for (i = 0; i < from->nr; i++) + from->objects[i].item->flags &= ~assign_flag; + return result; } diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index eb21103998..08d2ea68e8 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av) struct object_id oid_A, oid_B; struct commit *A, *B; struct commit_list *X, *Y; + struct object_array X_obj = OBJECT_ARRAY_INIT; struct commit **X_array; int X_nr, X_alloc; struct strbuf buf = STRBUF_INIT; @@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av) while (strbuf_getline(, stdin) != EOF) { struct object_id oid; - struct object *o; + struct object *orig; + struct object *peeled; struct commit *c; if (buf.len < 3) continue; @@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av) if (get_oid_committish(buf.buf + 2, )) die("failed to resolve %s", buf.buf + 2); - o = parse_object(r, ); -
[PATCH v3 5/7] commit/revisions: bookkeeping before refactoring
From: Derrick Stolee There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding the author_slab definition to commit.h. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Signed-off-by: Derrick Stolee --- commit.c | 11 --- commit.h | 8 revision.c | 6 -- 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index d0f199e122..f68e04b2f1 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ -define_commit_slab(author_date_slab, timestamp_t); - -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +681,8 @@ fail_exit: unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index 2b1a734388..ff0eb5f8ef 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +define_commit_slab(author_date_slab, timestamp_t); + +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 2dcde8a8ac..92012d5f45 100644 --- a/revision.c +++ b/revision.c @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->first_parent_only) break; -- gitgitgadget
[PATCH v3 7/7] revision.c: refactor basic topo-order logic
From: Derrick Stolee When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this implementation is not enabled by comparisions, such as in 'git rev-list --topo-order A..B'. This can be updated in the future. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk. Test: git rev-list --topo-order -100 HEAD
[PATCH v3 3/7] test-reach: add rev-list tests
From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 9d65b8b946..288f703b7b 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-6-6 +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8...commit-6-6 +' + test_done -- gitgitgadget
[PATCH v3 6/7] revision.h: add whitespace in flag definitions
From: Derrick Stolee In anticipation of adding longer flag names in the next change, add an extra tab to each flag definition in revision.h. Signed-off-by: Derrick Stolee --- revision.h | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/revision.h b/revision.h index fd4154ff75..e7bd059d80 100644 --- a/revision.h +++ b/revision.h @@ -10,20 +10,20 @@ #include "commit-slab-decl.h" /* Remember to update object flag allocation in object.h */ -#define SEEN (1u<<0) -#define UNINTERESTING (1u<<1) -#define TREESAME (1u<<2) -#define SHOWN (1u<<3) -#define TMP_MARK (1u<<4) /* for isolated cases; clean after use */ -#define BOUNDARY (1u<<5) -#define CHILD_SHOWN(1u<<6) -#define ADDED (1u<<7) /* Parents already parsed and added? */ -#define SYMMETRIC_LEFT (1u<<8) -#define PATCHSAME (1u<<9) -#define BOTTOM (1u<<10) -#define USER_GIVEN (1u<<25) /* given directly by the user */ -#define TRACK_LINEAR (1u<<26) -#define ALL_REV_FLAGS (((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR) +#define SEEN (1u<<0) +#define UNINTERESTING (1u<<1) +#define TREESAME (1u<<2) +#define SHOWN (1u<<3) +#define TMP_MARK (1u<<4) /* for isolated cases; clean after use */ +#define BOUNDARY (1u<<5) +#define CHILD_SHOWN(1u<<6) +#define ADDED (1u<<7) /* Parents already parsed and added? */ +#define SYMMETRIC_LEFT (1u<<8) +#define PATCHSAME (1u<<9) +#define BOTTOM (1u<<10) +#define USER_GIVEN (1u<<25) /* given directly by the user */ +#define TRACK_LINEAR (1u<<26) +#define ALL_REV_FLAGS (((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR) #define DECORATE_SHORT_REFS1 #define DECORATE_FULL_REFS 2 -- gitgitgadget
[PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
From: Derrick Stolee When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index e18bd530e4..2dcde8a8ac 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(>commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(>commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, >commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(>object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(>commits, revs->sort_order); + if (revs->topo_order) + sort_in_topological_order(>commits, revs->sort_order); + } else if (revs->topo_order) + init_topo_walk(revs); if
[PATCH v3 1/7] prio-queue: add 'peek' operation
From: Derrick Stolee When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 10 +++--- 3 files changed, 22 insertions(+), 3 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..e817bbf464 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get()); - else if (!strcmp(*argv, "dump")) { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(); + void *get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { int *v; while ((v = prio_queue_get())) show(v); -- gitgitgadget
[PATCH v3 2/7] test-reach: add run_three_modes method
From: Derrick Stolee The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..9d65b8b946 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual } +test_three_modes () { + run_three_modes test-tool reach "$@" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 -- gitgitgadget
[PATCH v3 0/7] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. Since revision.c is an incredibly important (and old) portion of the codebase -- and because there are so many orthogonal options in 'struct rev_info' -- I consider this submission to be "RFC quality". That is, I am not confident that I am not missing anything, or that my solution is the best it can be. I did merge this branch with ds/commit-graph-with-grafts and the "DO-NOT-MERGE: write and read commit-graph always" commit that computes a commit-graph with every 'git commit' command. The test suite passed with that change, available on GitHub [3]. To ensure that I cover at least the case I think are interesting, I added tests to t6600-test-reach.sh to verify the walks report the correct results for the three cases there (no commit-graph, full commit-graph, and a partial commit-graph so the walk starts at GENERATION_NUMBER_INFINITY). One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance is worse. This series was based on ds/reachable, but is now based on 'master' to not conflict with 182070 "commit: use timestamp_t for author_date_slab". There is a small conflict with md/filter-trees, because it renamed a flag in revisions.h in the line before I add new flags. Hopefully this conflict is not too difficult to resolve. Changes in V3: I added a new patch that updates the tab-alignment for flags in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the recommended changes to run_three_modes and test_three_modes from Szeder and Junio. Thanks! Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/testA branch containing this series along with commits to compute commit-graph in entire test suite. Cc: avarab@gmail.comCc: szeder@gmail.com Derrick Stolee (7): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring revision.h: add whitespace in flag definitions revision.c: refactor basic topo-order logic commit.c | 11 +- commit.h | 8 ++ object.h | 4 +- prio-queue.c | 9 ++ prio-queue.h | 6 + revision.c | 232 - revision.h | 34 +++--- t/helper/test-prio-queue.c | 10 +- t/t6600-test-reach.sh | 96 ++- 9 files changed, 374 insertions(+), 36 deletions(-) base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/25 Range-diff vs v2: 1: cc1ec4c270 = 1: cc1ec4c270
[PATCH v3 1/1] contrib: add coverage-diff script
From: Derrick Stolee We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks covering very unlikely (or near-impossible) situations may not warrant coverage. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. The output uses 'git blame -c' format so you can find the commits responsible and view the line numbers for quick access to the context. Helped-by: Junio C Hamano Signed-off-by: Derrick Stolee --- contrib/coverage-diff.sh | 79 1 file changed, 79 insertions(+) create mode 100755 contrib/coverage-diff.sh diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh new file mode 100755 index 00..48b9a3ae96 --- /dev/null +++ b/contrib/coverage-diff.sh @@ -0,0 +1,79 @@ +#!/bin/sh + +# Usage: Run 'contrib/coverage-diff.sh ' from source-root +# after running +# +# make coverage-test +# make coverage-report +# +# while checked out at . This script combines the *.gcov files +# generated by the 'make' commands above with 'git diff ' +# to report new lines that are not covered by the test suite. + +V1=$1 +V2=$2 + +diff_lines () { + perl -e ' + my $line_num; + while (<>) { + # Hunk header? Grab the beginning in postimage. + if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) { + $line_num = $1; + next; + } + + # Have we seen a hunk? Ignore "diff --git" etc. + next unless defined $line_num; + + # Deleted line? Ignore. + if (/^-/) { + next; + } + + # Show only the line number of added lines. + if (/^\+/) { + print "$line_num\n"; + } + # Either common context or added line appear in + # the postimage. Count it. + $line_num++; + } + ' +} + +files=$(git diff --name-only $V1 $V2 -- *.c) + +for file in $files +do + git diff $V1 $V2 -- $file \ + | diff_lines \ + | sort >new_lines.txt + + if ! test -s new_lines.txt + then + continue + fi + + hash_file=$(echo $file | sed "s/\//\#/") + sed -ne '/#:/{ + s/#:// + s/:.*// + s/ //g + p + }' "$hash_file.gcov" \ + | sort >uncovered_lines.txt + + comm -12 uncovered_lines.txt new_lines.txt \ + | sed -e 's/$/\)/' \ + | sed -e 's/^/\t/' \ + >uncovered_new_lines.txt + + grep -q '[^[:space:]]' < uncovered_new_lines.txt && \ + echo $file && \ + git blame -c $file \ + | grep -f uncovered_new_lines.txt + + rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt +done + -- gitgitgadget
[PATCH v3 0/1] contrib: Add script to show uncovered "new" lines
We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. For example, I ran this against the 'next' branch (22e244b) versus 'master' (150f307) and got the following output: fsck.c fb8952077df (René Scharfe 2018-09-03 14:49:26 + 212) die_errno("Could not read '%s'", path); list-objects-filter-options.c f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 56) if (errbuf) { f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 57) strbuf_init(errbuf, 0); f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 58) strbuf_addstr( f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 62) return 1; list-objects-filter.c 77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700 47) BUG("unknown filter_situation: %d", filter_situation); f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 100)default: f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700 101) BUG("unknown filter_situation: %d", filter_situation); 77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700 152) BUG("unknown filter_situation: %d", filter_situation); 77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700 257) BUG("unknown filter_situation: %d", filter_situation); 77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700 438) BUG("invalid list-objects filter choice: %d", list-objects.c f447a499dbb (Matthew DeVore 2018-08-13 11:14:28 -0700 197) ctx->show_object(obj, base->buf, ctx->show_data); ll-merge.c d64324cb60e (Torsten Bögershausen 2018-09-12 21:32:02 +0200 379) marker_size = DEFAULT_CONFLICT_MARKER_SIZE; midx.c 56ee7ff1565 (Derrick Stolee 2018-09-13 11:02:13 -0700 949) return 0; cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700 990) midx_report(_("failed to load pack-index for packfile %s"), cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700 991) e.p->pack_name); cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700 992) break; remote-curl.c c3b9bc94b9b (Elijah Newren 2018-09-05 10:03:07 -0700 181) options.filter = xstrdup(value); submodule.c df255b8cac7 (Brandon Williams 2018-08-08 15:33:22 -0700 1738) die(_("could not create directory '%s'"), new_gitdir.buf); Using this 'git blame' output, we can quickly inspect whether the uncovered lines are appropriate. For instance: 1. The line in builtin/commit.c is due to writing the commit-graph file when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the test suite. Being uncovered is expected here. 2. The lines in builtin/worktree.c are all related to error conditions. This is acceptable. 3. The line in builtin/rev-list.c is a flag replacement in a block that is otherwise unchanged. It must not be covered by the test suite normally. This could be worth adding a test to ensure the new logic maintains old behavior. 4. The lines in read-cache.c are part of a new block for the condition "if (expand_name_field)" as part of an optimization. These lines should probably be covered before that series is merged to 'next'. I understand that Ben and Duy are continuing work in this direction [1]. I used this approach for 'next' over 'master' and got a larger list, some of which I have already submitted tests to increase coverage [2] or will be covered by topics not in 'next' [3]. Thanks, -Stolee CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the performance greatly. Some of the other sed commands needed some massaging, but also added extra cleanup. Thanks for the help! [1]
[PATCH v3 2/2] commit-reach: fix memory and flag leaks
From: Derrick Stolee The can_all_from_reach_with_flag() method uses 'assign_flag' as a value we can use to mark objects temporarily during our commit walk. The intent is that these flags are removed from all objects before returning. However, this is not the case. The 'from' array could also contain objects that are not commits, and we mark those objects with 'assign_flag'. Add a loop to the 'cleanup' section that removes these markers. Also, we forgot to free() the memory for 'list', so add that to the 'cleanup' section. Signed-off-by: Derrick Stolee --- commit-reach.c | 5 + 1 file changed, 5 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index e748414d04..5a845440a9 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -626,6 +626,11 @@ cleanup: clear_commit_marks(list[i], RESULT); clear_commit_marks(list[i], assign_flag); } + free(list); + + for (i = 0; i < from->nr; i++) + from->objects[i].item->flags &= ~assign_flag; + return result; } -- gitgitgadget
[PATCH v3 0/2] Properly peel tags in can_all_from_reach_with_flags()
As Peff reported [1], the refactored can_all_from_reach_with_flags() method does not properly peel tags. Since the helper method can_all_from_reach() and code in t/helper/test-reach.c all peel tags before getting to this method, it is not super-simple to create a test that demonstrates this. I modified t/helper/test-reach.c to allow calling can_all_from_reach_with_flags() directly, and added a test in t6600-test-reach.sh that demonstrates the segfault without the fix. For V2, I compared the loop that inspects the 'from' commits in commit ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version here and got the following diff: 3c3 < if (from_one->flags & assign_flag) --- > if (!from_one || from_one->flags & assign_flag) 5c5,7 < from_one = deref_tag(the_repository, from_one, "a from object", 0); --- > > from_one = deref_tag(the_repository, from_one, > "a from object", 0); 14a17,22 > > list[nr_commits] = (struct commit *)from_one; > if (parse_commit(list[nr_commits]) || > list[nr_commits]->generation < min_generation) > return 0; /* is this a leak? */ > nr_commits++; This diff includes the early termination we had before 'deref_tag' and the comment for why we can ignore non-commit objects. [1] https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u Derrick Stolee (2): commit-reach: properly peel tags commit-reach: fix memory and flag leaks commit-reach.c| 41 ++--- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 -- 3 files changed, 79 insertions(+), 14 deletions(-) base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-39/derrickstolee/tag-fix-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/39 Range-diff vs v2: 1: 4bf21204dd ! 1: 0a1e661271 commit-reach: properly peel tags @@ -53,8 +53,11 @@ + + list[nr_commits] = (struct commit *)from_one; + if (parse_commit(list[nr_commits]) || -+ list[nr_commits]->generation < min_generation) -+ return 0; /* is this a leak? */ ++ list[nr_commits]->generation < min_generation) { ++ result = 0; ++ goto cleanup; ++ } ++ + nr_commits++; } -: -- > 2: b2e0ee4978 commit-reach: fix memory and flag leaks -- gitgitgadget
[PATCH v3 1/2] commit-reach: properly peel tags
From: Derrick Stolee The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e "commit-reach: make can_all_from_reach... linear" but incorrectly assumed that all objects provided were commits. During a fetch negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags to the 'from' array. The current code creates a segfault. Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach' and add a test in t6600-test-reach.sh that demonstrates this segfault. Correct the issue by peeling tags when investigating the initial list of objects in the 'from' array. Signed-off-by: Jeff King Signed-off-by: Derrick Stolee --- commit-reach.c| 36 +--- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 -- 3 files changed, 74 insertions(+), 14 deletions(-) diff --git a/commit-reach.c b/commit-reach.c index 86715c103c..e748414d04 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -544,20 +544,42 @@ int can_all_from_reach_with_flag(struct object_array *from, { struct commit **list = NULL; int i; + int nr_commits; int result = 1; ALLOC_ARRAY(list, from->nr); + nr_commits = 0; for (i = 0; i < from->nr; i++) { - list[i] = (struct commit *)from->objects[i].item; + struct object *from_one = from->objects[i].item; - if (parse_commit(list[i]) || - list[i]->generation < min_generation) - return 0; + if (!from_one || from_one->flags & assign_flag) + continue; + + from_one = deref_tag(the_repository, from_one, +"a from object", 0); + if (!from_one || from_one->type != OBJ_COMMIT) { + /* no way to tell if this is reachable by +* looking at the ancestry chain alone, so +* leave a note to ourselves not to worry about +* this object anymore. +*/ + from->objects[i].item->flags |= assign_flag; + continue; + } + + list[nr_commits] = (struct commit *)from_one; + if (parse_commit(list[nr_commits]) || + list[nr_commits]->generation < min_generation) { + result = 0; + goto cleanup; + } + + nr_commits++; } - QSORT(list, from->nr, compare_commits_by_gen); + QSORT(list, nr_commits, compare_commits_by_gen); - for (i = 0; i < from->nr; i++) { + for (i = 0; i < nr_commits; i++) { /* DFS from list[i] */ struct commit_list *stack = NULL; @@ -600,7 +622,7 @@ int can_all_from_reach_with_flag(struct object_array *from, } cleanup: - for (i = 0; i < from->nr; i++) { + for (i = 0; i < nr_commits; i++) { clear_commit_marks(list[i], RESULT); clear_commit_marks(list[i], assign_flag); } diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index eb21103998..08d2ea68e8 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av) struct object_id oid_A, oid_B; struct commit *A, *B; struct commit_list *X, *Y; + struct object_array X_obj = OBJECT_ARRAY_INIT; struct commit **X_array; int X_nr, X_alloc; struct strbuf buf = STRBUF_INIT; @@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av) while (strbuf_getline(, stdin) != EOF) { struct object_id oid; - struct object *o; + struct object *orig; + struct object *peeled; struct commit *c; if (buf.len < 3) continue; @@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av) if (get_oid_committish(buf.buf + 2, )) die("failed to resolve %s", buf.buf + 2); - o = parse_object(r, ); - o = deref_tag_noverify(o); + orig = parse_object(r, ); + peeled = deref_tag_noverify(orig); - if (!o) + if (!peeled) die("failed to load commit for input %s resulting in oid %s\n", buf.buf, oid_to_hex()); - c = object_as_type(r, o, OBJ_COMMIT, 0); + c = object_as_type(r, peeled, OBJ_COMMIT, 0); if (!c) die("failed to load commit for input %s resulting in oid %s\n", @@ -85,6 +87,7 @@ int cmd__reach(int ac, const char **av) commit_list_insert(c, ); ALLOC_GROW(X_array, X_nr + 1, X_alloc);
[PATCH v2 5/6] commit/revisions: bookkeeping before refactoring
From: Derrick Stolee There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding the author_slab definition to commit.h. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Signed-off-by: Derrick Stolee --- commit.c | 11 --- commit.h | 8 revision.c | 6 -- 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index d0f199e122..f68e04b2f1 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ -define_commit_slab(author_date_slab, timestamp_t); - -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +681,8 @@ fail_exit: unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index 2b1a734388..ff0eb5f8ef 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +define_commit_slab(author_date_slab, timestamp_t); + +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 2dcde8a8ac..92012d5f45 100644 --- a/revision.c +++ b/revision.c @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->first_parent_only) break; -- gitgitgadget
[PATCH v2 0/6] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. Since revision.c is an incredibly important (and old) portion of the codebase -- and because there are so many orthogonal options in 'struct rev_info' -- I consider this submission to be "RFC quality". That is, I am not confident that I am not missing anything, or that my solution is the best it can be. I did merge this branch with ds/commit-graph-with-grafts and the "DO-NOT-MERGE: write and read commit-graph always" commit that computes a commit-graph with every 'git commit' command. The test suite passed with that change, available on GitHub [3]. To ensure that I cover at least the case I think are interesting, I added tests to t6600-test-reach.sh to verify the walks report the correct results for the three cases there (no commit-graph, full commit-graph, and a partial commit-graph so the walk starts at GENERATION_NUMBER_INFINITY). One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance is worse. This series was based on ds/reachable, but is now based on 'master' to not conflict with 182070 "commit: use timestamp_t for author_date_slab". There is a small conflict with md/filter-trees, because it renamed a flag in revisions.h in the line before I add new flags. Hopefully this conflict is not too difficult to resolve. Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/testA branch containing this series along with commits to compute commit-graph in entire test suite. Derrick Stolee (6): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring revision.c: refactor basic topo-order logic commit.c | 11 +- commit.h | 8 ++ object.h | 4 +- prio-queue.c | 9 ++ prio-queue.h | 6 + revision.c | 232 - revision.h | 6 + t/helper/test-prio-queue.c | 10 +- t/t6600-test-reach.sh | 98 +++- 9 files changed, 361 insertions(+), 23 deletions(-) base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/25 Range-diff vs v1: 1: 5e55669f4d = 1: cc1ec4c270 prio-queue: add 'peek' operation 2: 9628396af1 = 2: 404c918608 test-reach: add run_three_modes method 3: 708b4550a1 = 3: 30dee58c61 test-reach: add rev-list tests 4: 908442417d ! 4: a74ae13d4e revision.c: begin refactoring --topo-order logic @@ -168,4 +168,4 @@ + struct topo_walk_info *topo_walk_info; };
[PATCH v2 3/6] test-reach: add rev-list tests
From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 1b18e12a4e..2fcaa39077 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-6-6" +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes "git rev-list --first-parent --topo-order commit-6-6" +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6" +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6" +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6" +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6" +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6" +' + test_done -- gitgitgadget
[PATCH v2 1/6] prio-queue: add 'peek' operation
From: Derrick Stolee When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 10 +++--- 3 files changed, 22 insertions(+), 3 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..e817bbf464 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get()); - else if (!strcmp(*argv, "dump")) { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(); + void *get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { int *v; while ((v = prio_queue_get())) show(v); -- gitgitgadget
[PATCH v2 6/6] revision.c: refactor basic topo-order logic
From: Derrick Stolee When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this implementation is not enabled by comparisions, such as in 'git rev-list --topo-order A..B'. This can be updated in the future. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk. Test: git rev-list --topo-order -100 HEAD
[PATCH v2 4/6] revision.c: begin refactoring --topo-order logic
From: Derrick Stolee When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index e18bd530e4..2dcde8a8ac 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(>commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(>commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, >commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(>object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(>commits, revs->sort_order); + if (revs->topo_order) + sort_in_topological_order(>commits, revs->sort_order); + } else if (revs->topo_order) + init_topo_walk(revs); if
[PATCH v2 2/6] test-reach: add run_three_modes method
From: Derrick Stolee The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. While inspecting this code, I realized that the final test for 'commit_contains --tag' is silently dropping the '--tag' argument. It should be quoted to include both. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..1b18e12a4e 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual } +test_three_modes () { + run_three_modes "test-tool reach $1" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 @@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' ' EOF echo "commit_contains(_,A,X,_):1" >expect && test_three_modes commit_contains && - test_three_modes commit_contains --tag + test_three_modes "commit_contains --tag" ' test_expect_success 'commit_contains:miss' ' -- gitgitgadget
[PATCH v2 01/11] multi-pack-index: add 'verify' verb
From: Derrick Stolee The multi-pack-index builtin writes multi-pack-index files, and uses a 'write' verb to do so. Add a 'verify' verb that checks this file matches the contents of the pack-indexes it replaces. The current implementation is a no-op, but will be extended in small increments in later commits. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 10 ++ builtin/multi-pack-index.c | 4 +++- midx.c | 13 + midx.h | 1 + t/t5319-multi-pack-index.sh| 8 5 files changed, 35 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 1f97e79912..f7778a2c85 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -27,6 +27,10 @@ write:: When given as the verb, write a new MIDX file to `/packs/multi-pack-index`. +verify:: + When given as the verb, verify the contents of the MIDX file + at `/packs/multi-pack-index`. + EXAMPLES @@ -43,6 +47,12 @@ $ git multi-pack-index write $ git multi-pack-index --object-dir write --- +* Verify the MIDX file for the packfiles in the current .git folder. ++ +--- +$ git multi-pack-index verify +--- + SEE ALSO diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 2633efd95d..fca70f8e4f 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] write"), + N_("git multi-pack-index [--object-dir=] (write|verify)"), NULL }; @@ -42,6 +42,8 @@ int cmd_multi_pack_index(int argc, const char **argv, if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); + if (!strcmp(argv[0], "verify")) + return verify_midx_file(opts.object_dir); die(_("unrecognized verb: %s"), argv[0]); } diff --git a/midx.c b/midx.c index f3e8dbc108..b253bed517 100644 --- a/midx.c +++ b/midx.c @@ -928,3 +928,16 @@ void clear_midx_file(const char *object_dir) free(midx); } + +int verify_midx_error; + +int verify_midx_file(const char *object_dir) +{ + struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); + verify_midx_error = 0; + + if (!m) + return 0; + + return verify_midx_error; +} diff --git a/midx.h b/midx.h index a210f1af2a..ce80b91c68 100644 --- a/midx.h +++ b/midx.h @@ -43,5 +43,6 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(const char *object_dir); +int verify_midx_file(const char *object_dir); #endif diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 6f56b38674..1c4e0e6d31 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -150,6 +150,10 @@ test_expect_success 'write midx with twelve packs' ' compare_results_with_midx "twelve packs" +test_expect_success 'verify multi-pack-index success' ' + git multi-pack-index verify --object-dir=$objdir +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -214,4 +218,8 @@ test_expect_success 'force some 64-bit offsets with pack-objects' ' midx_read_expect 1 63 5 objects64 " large-offsets" ' +test_expect_success 'verify multi-pack-index with 64-bit offsets' ' + git multi-pack-index verify --object-dir=objects64 +' + test_done -- gitgitgadget
[PATCH v2 10/11] multi-pack-index: report progress during 'verify'
From: Derrick Stolee When verifying a multi-pack-index, the only action that takes significant time is checking the object offsets. For example, to verify a multi-pack-index containing 6.2 million objects in the Linux kernel repository takes 1.3 seconds on my machine. 99% of that time is spent looking up object offsets in each of the packfiles and comparing them to the multi-pack-index offset. Add a progress indicator for that section of the 'verify' verb. Signed-off-by: Derrick Stolee --- midx.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/midx.c b/midx.c index 47e7e6113a..4d4c930522 100644 --- a/midx.c +++ b/midx.c @@ -7,6 +7,7 @@ #include "object-store.h" #include "sha1-lookup.h" #include "midx.h" +#include "progress.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -940,6 +941,7 @@ static void midx_report(const char *fmt, ...) int verify_midx_file(const char *object_dir) { uint32_t i; + struct progress *progress = NULL; struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); verify_midx_error = 0; @@ -971,6 +973,7 @@ int verify_midx_file(const char *object_dir) i, oid_to_hex(), oid_to_hex(), i + 1); } + progress = start_progress(_("Verifying object offsets"), m->num_objects); for (i = 0; i < m->num_objects; i++) { struct object_id oid; struct pack_entry e; @@ -995,7 +998,10 @@ int verify_midx_file(const char *object_dir) if (m_offset != p_offset) midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64), i, oid_to_hex(), m_offset, p_offset); + + display_progress(progress, i + 1); } + stop_progress(); return verify_midx_error; } -- gitgitgadget
[PATCH v2 04/11] multi-pack-index: verify packname order
From: Derrick Stolee The final check we make while loading a multi-pack-index is that the packfile names are in lexicographical order. Make this error be a die() instead. In order to test this condition, we need multiple packfiles. Earlier in t5319-multi-pack-index.sh, we tested the interaction with 'git repack' but this limits us to one packfile in our object dir. Move these repack tests until after the 'verify' tests. Signed-off-by: Derrick Stolee --- midx.c | 6 ++ t/t5319-multi-pack-index.sh | 10 ++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/midx.c b/midx.c index 8b054b39ab..e655a15aed 100644 --- a/midx.c +++ b/midx.c @@ -157,12 +157,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local cur_pack_name += strlen(cur_pack_name) + 1; - if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) { - error(_("multi-pack-index pack names out of order: '%s' before '%s'"), + if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) + die(_("multi-pack-index pack names out of order: '%s' before '%s'"), m->pack_names[i - 1], m->pack_names[i]); - goto cleanup_fail; - } } return m; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index c54b6e7188..01a3cd6b00 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -181,6 +181,11 @@ MIDX_BYTE_CHUNK_COUNT=6 MIDX_HEADER_SIZE=12 MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4)) +MIDX_NUM_CHUNKS=5 +MIDX_CHUNK_LOOKUP_WIDTH=12 +MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \ +$MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH)) +MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -212,6 +217,11 @@ test_expect_success 'verify invalid chunk offset' ' "invalid chunk offset (too large)" ' +test_expect_success 'verify packnames out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \ + "pack names out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 09/11] multi-pack-index: verify object offsets
From: Derrick Stolee The 'git multi-pack-index verify' command must verify the object offsets stored in the multi-pack-index are correct. There are two ways the offset chunk can be incorrect: the pack-int-id and the object offset. Replace the BUG() statement with a die() statement, now that we may hit a bad pack-int-id during a 'verify' command on a corrupt multi-pack-index, and it is covered by a test. Signed-off-by: Derrick Stolee --- midx.c | 29 - t/t5319-multi-pack-index.sh | 27 +++ 2 files changed, 55 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 80094c02a7..47e7e6113a 100644 --- a/midx.c +++ b/midx.c @@ -197,7 +197,8 @@ int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id) struct strbuf pack_name = STRBUF_INIT; if (pack_int_id >= m->num_packs) - BUG("bad pack-int-id"); + die(_("bad pack-int-id: %u (%u total packs"), + pack_int_id, m->num_packs); if (m->packs[pack_int_id]) return 0; @@ -970,5 +971,31 @@ int verify_midx_file(const char *object_dir) i, oid_to_hex(), oid_to_hex(), i + 1); } + for (i = 0; i < m->num_objects; i++) { + struct object_id oid; + struct pack_entry e; + off_t m_offset, p_offset; + + nth_midxed_object_oid(, m, i); + if (!fill_midx_entry(, , m)) { + midx_report(_("failed to load pack entry for oid[%d] = %s"), + i, oid_to_hex()); + continue; + } + + if (open_pack_index(e.p)) { + midx_report(_("failed to load pack-index for packfile %s"), + e.p->pack_name); + break; + } + + m_offset = e.offset; + p_offset = find_pack_entry_one(oid.hash, e.p); + + if (m_offset != p_offset) + midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64), + i, oid_to_hex(), m_offset, p_offset); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index a968b9a959..828c240389 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -176,6 +176,7 @@ test_expect_success 'verify bad signature' ' ' HASH_LEN=20 +NUM_OBJECTS=74 MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 @@ -192,6 +193,10 @@ MIDX_OID_FANOUT_WIDTH=4 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN)) +MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN)) +MIDX_OFFSET_WIDTH=8 +MIDX_BYTE_PACK_INT_ID=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 2)) +MIDX_BYTE_OFFSET=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 6)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -243,6 +248,16 @@ test_expect_success 'verify oid lookup out of order' ' "oid lookup out of order" ' +test_expect_success 'verify incorrect pack-int-id' ' + corrupt_midx_and_verify $MIDX_BYTE_PACK_INT_ID "\07" $objdir \ + "bad pack-int-id" +' + +test_expect_success 'verify incorrect offset' ' + corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \ + "incorrect object offset" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -310,4 +325,16 @@ test_expect_success 'verify multi-pack-index with 64-bit offsets' ' git multi-pack-index verify --object-dir=objects64 ' +NUM_OBJECTS=63 +MIDX_OFFSET_OID_FANOUT=$((MIDX_OFFSET_PACKNAMES + 54)) +MIDX_OFFSET_OID_LOOKUP=$((MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) +MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN)) +MIDX_OFFSET_LARGE_OFFSETS=$(($MIDX_OFFSET_OBJECT_OFFSETS + $NUM_OBJECTS * $MIDX_OFFSET_WIDTH)) +MIDX_BYTE_LARGE_OFFSET=$(($MIDX_OFFSET_LARGE_OFFSETS + 3)) + +test_expect_success 'verify incorrect 64-bit offset' ' + corrupt_midx_and_verify $MIDX_BYTE_LARGE_OFFSET "\07" objects64 \ + "incorrect object offset" +' + test_done -- gitgitgadget
[PATCH v2 08/11] multi-pack-index: fix 32-bit vs 64-bit size check
From: Derrick Stolee When loading a 64-bit offset, we intend to check that off_t can store the resulting offset. However, the condition accidentally checks the 32-bit offset to see if it is smaller than a 64-bit value. Fix it, and this will be covered by a test in the 'git multi-pack-index verify' command in a later commit. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 06d5cfc826..80094c02a7 100644 --- a/midx.c +++ b/midx.c @@ -236,7 +236,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos) offset32 = get_be32(offset_data + sizeof(uint32_t)); if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) { - if (sizeof(offset32) < sizeof(uint64_t)) + if (sizeof(off_t) < sizeof(uint64_t)) die(_("multi-pack-index stores a 64-bit offset, but off_t is too small")); offset32 ^= MIDX_LARGE_OFFSET_NEEDED; -- gitgitgadget
[PATCH v2 05/11] multi-pack-index: verify missing pack
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 16 t/t5319-multi-pack-index.sh | 5 + 2 files changed, 21 insertions(+) diff --git a/midx.c b/midx.c index e655a15aed..a02b19efc1 100644 --- a/midx.c +++ b/midx.c @@ -926,13 +926,29 @@ void clear_midx_file(const char *object_dir) int verify_midx_error; +static void midx_report(const char *fmt, ...) +{ + va_list ap; + verify_midx_error = 1; + va_start(ap, fmt); + vfprintf(stderr, fmt, ap); + fprintf(stderr, "\n"); + va_end(ap); +} + int verify_midx_file(const char *object_dir) { + uint32_t i; struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); verify_midx_error = 0; if (!m) return 0; + for (i = 0; i < m->num_packs; i++) { + if (prepare_midx_pack(m, i)) + midx_report("failed to load pack in position %d", i); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 01a3cd6b00..0a566afb05 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -222,6 +222,11 @@ test_expect_success 'verify packnames out of order' ' "pack names out of order" ' +test_expect_success 'verify packnames out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \ + "failed to load pack" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 11/11] fsck: verify multi-pack-index
From: Derrick Stolee When core.multiPackIndex is true, we may have a multi-pack-index in our object directory. Add calls to 'git multi-pack-index verify' at the end of 'git fsck' if so. Signed-off-by: Derrick Stolee --- builtin/fsck.c | 18 ++ t/t5319-multi-pack-index.sh | 13 - 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 63c8578cc1..06eb421720 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -848,5 +848,23 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } } + if (!git_config_get_bool("core.multipackindex", ) && i) { + struct child_process midx_verify = CHILD_PROCESS_INIT; + const char *midx_argv[] = { "multi-pack-index", "verify", NULL, NULL, NULL }; + + midx_verify.argv = midx_argv; + midx_verify.git_cmd = 1; + if (run_command(_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + + prepare_alt_odb(the_repository); + for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + midx_argv[2] = "--object-dir"; + midx_argv[3] = alt->path; + if (run_command(_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + } + } + return errors_found; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 828c240389..bd8e841b81 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -160,12 +160,17 @@ corrupt_midx_and_verify() { DATA="${2:-\0}" && OBJDIR=$3 && GREPSTR="$4" && + COMMAND="$5" && + if test -z "$COMMAND" + then + COMMAND="git multi-pack-index verify --object-dir=$OBJDIR" + fi && FILE=$OBJDIR/pack/multi-pack-index && chmod a+w $FILE && test_when_finished mv midx-backup $FILE && cp $FILE midx-backup && printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc && - test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 2>test_err && + test_must_fail $COMMAND 2>test_err && grep -v "^+" test_err >err && test_i18ngrep "$GREPSTR" err } @@ -258,6 +263,12 @@ test_expect_success 'verify incorrect offset' ' "incorrect object offset" ' +test_expect_success 'git-fsck incorrect offset' ' + corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \ + "incorrect object offset" \ + "git -c core.multipackindex=true fsck" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 07/11] multi-pack-index: verify oid lookup order
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 11 +++ t/t5319-multi-pack-index.sh | 8 2 files changed, 19 insertions(+) diff --git a/midx.c b/midx.c index dfd26b4d74..06d5cfc826 100644 --- a/midx.c +++ b/midx.c @@ -959,5 +959,16 @@ int verify_midx_file(const char *object_dir) i, oid_fanout1, oid_fanout2, i + 1); } + for (i = 0; i < m->num_objects - 1; i++) { + struct object_id oid1, oid2; + + nth_midxed_object_oid(, m, i); + nth_midxed_object_oid(, m, i + 1); + + if (oidcmp(, ) >= 0) + midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"), + i, oid_to_hex(), oid_to_hex(), i + 1); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 47a54e138d..a968b9a959 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -175,6 +175,7 @@ test_expect_success 'verify bad signature' ' "multi-pack-index signature" ' +HASH_LEN=20 MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 @@ -189,6 +190,8 @@ MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652)) MIDX_OID_FANOUT_WIDTH=4 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) +MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) +MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -235,6 +238,11 @@ test_expect_success 'verify oid fanout out of order' ' "oid fanout out of order" ' +test_expect_success 'verify oid lookup out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_LOOKUP "\00" $objdir \ + "oid lookup out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 03/11] multi-pack-index: verify corrupt chunk lookup table
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 3 +++ t/t5319-multi-pack-index.sh | 13 + 2 files changed, 16 insertions(+) diff --git a/midx.c b/midx.c index ec78254bb6..8b054b39ab 100644 --- a/midx.c +++ b/midx.c @@ -100,6 +100,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 + MIDX_CHUNKLOOKUP_WIDTH * i); + if (chunk_offset >= m->data_len) + die(_("invalid chunk offset (too large)")); + switch (chunk_id) { case MIDX_CHUNKID_PACKNAMES: m->chunk_pack_names = m->data + chunk_offset; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index e04b5f43a2..c54b6e7188 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -178,6 +178,9 @@ test_expect_success 'verify bad signature' ' MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 +MIDX_HEADER_SIZE=12 +MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE +MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -199,6 +202,16 @@ test_expect_success 'verify extended chunk count' ' "terminating multi-pack-index chunk id appears earlier than expected" ' +test_expect_success 'verify missing required chunk' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \ + "missing required" +' + +test_expect_success 'verify invalid chunk offset' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \ + "invalid chunk offset (too large)" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 02/11] multi-pack-index: verify bad header
From: Derrick Stolee When verifying if a multi-pack-index file is valid, we want the command to fail to signal an invalid file. Previously, we wrote an error to stderr and continued as if we had no multi-pack-index. Now, die() instead of error(). Add tests that check corrupted headers in a few ways: * Bad signature * Bad file version * Bad hash version * Truncated hash count * Extended hash count Signed-off-by: Derrick Stolee --- midx.c | 18 +-- t/t5319-multi-pack-index.sh | 46 - 2 files changed, 51 insertions(+), 13 deletions(-) diff --git a/midx.c b/midx.c index b253bed517..ec78254bb6 100644 --- a/midx.c +++ b/midx.c @@ -76,24 +76,18 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local m->local = local; m->signature = get_be32(m->data); - if (m->signature != MIDX_SIGNATURE) { - error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"), + if (m->signature != MIDX_SIGNATURE) + die(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"), m->signature, MIDX_SIGNATURE); - goto cleanup_fail; - } m->version = m->data[MIDX_BYTE_FILE_VERSION]; - if (m->version != MIDX_VERSION) { - error(_("multi-pack-index version %d not recognized"), + if (m->version != MIDX_VERSION) + die(_("multi-pack-index version %d not recognized"), m->version); - goto cleanup_fail; - } hash_version = m->data[MIDX_BYTE_HASH_VERSION]; - if (hash_version != MIDX_HASH_VERSION) { - error(_("hash version %u does not match"), hash_version); - goto cleanup_fail; - } + if (hash_version != MIDX_HASH_VERSION) + die(_("hash version %u does not match"), hash_version); m->hash_len = MIDX_HASH_LEN; m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS]; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 1c4e0e6d31..e04b5f43a2 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -154,6 +154,51 @@ test_expect_success 'verify multi-pack-index success' ' git multi-pack-index verify --object-dir=$objdir ' +# usage: corrupt_midx_and_verify +corrupt_midx_and_verify() { + POS=$1 && + DATA="${2:-\0}" && + OBJDIR=$3 && + GREPSTR="$4" && + FILE=$OBJDIR/pack/multi-pack-index && + chmod a+w $FILE && + test_when_finished mv midx-backup $FILE && + cp $FILE midx-backup && + printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc && + test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 2>test_err && + grep -v "^+" test_err >err && + test_i18ngrep "$GREPSTR" err +} + +test_expect_success 'verify bad signature' ' + corrupt_midx_and_verify 0 "\00" $objdir \ + "multi-pack-index signature" +' + +MIDX_BYTE_VERSION=4 +MIDX_BYTE_OID_VERSION=5 +MIDX_BYTE_CHUNK_COUNT=6 + +test_expect_success 'verify bad version' ' + corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ + "multi-pack-index version" +' + +test_expect_success 'verify bad OID version' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_VERSION "\02" $objdir \ + "hash version" +' + +test_expect_success 'verify truncated chunk count' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \ + "missing required" +' + +test_expect_success 'verify extended chunk count' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \ + "terminating multi-pack-index chunk id appears earlier than expected" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -191,7 +236,6 @@ test_expect_success 'multi-pack-index in an alternate' ' compare_results_with_midx "with alternate (remote midx)" - # usage: corrupt_data [] corrupt_data () { file=$1 -- gitgitgadget
[PATCH v2 06/11] multi-pack-index: verify oid fanout order
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 9 + t/t5319-multi-pack-index.sh | 8 2 files changed, 17 insertions(+) diff --git a/midx.c b/midx.c index a02b19efc1..dfd26b4d74 100644 --- a/midx.c +++ b/midx.c @@ -950,5 +950,14 @@ int verify_midx_file(const char *object_dir) midx_report("failed to load pack in position %d", i); } + for (i = 0; i < 255; i++) { + uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]); + uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i + 1]); + + if (oid_fanout1 > oid_fanout2) + midx_report(_("oid fanout out of order: fanout[%d] = %"PRIx32" > %"PRIx32" = fanout[%d]"), + i, oid_fanout1, oid_fanout2, i + 1); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 0a566afb05..47a54e138d 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -186,6 +186,9 @@ MIDX_CHUNK_LOOKUP_WIDTH=12 MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \ $MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH)) MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) +MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652)) +MIDX_OID_FANOUT_WIDTH=4 +MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -227,6 +230,11 @@ test_expect_success 'verify packnames out of order' ' "failed to load pack" ' +test_expect_success 'verify oid fanout out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_FANOUT_ORDER "\01" $objdir \ + "oid fanout out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH v2 00/11] Add 'git multi-pack-index verify' command
The multi-pack-index file provides faster lookups in repos with many packfiles by duplicating the information from multiple pack-indexes into a single file. This series allows us to verify a multi-pack-index using 'git multi-pack-index verify' and 'git fsck' (when core.multiPackIndex is true). The pattern for the tests is similar to that found in t5318-commit-graph.sh. During testing, I found a bug in how we check for the size of off_t (we are not actually checking off_t, but instead uint32_t). See "multi-pack-index: fix 32-bit vs 64-bit size check". Thanks to Ævar [1], I added a commit that provides progress updates when checking object offsets. Based on ds/multi-pack-index [1] https://public-inbox.org/git/20180904202729.13900-1-ava...@gmail.com/T/#u Derrick Stolee (11): multi-pack-index: add 'verify' verb multi-pack-index: verify bad header multi-pack-index: verify corrupt chunk lookup table multi-pack-index: verify packname order multi-pack-index: verify missing pack multi-pack-index: verify oid fanout order multi-pack-index: verify oid lookup order multi-pack-index: fix 32-bit vs 64-bit size check multi-pack-index: verify object offsets multi-pack-index: report progress during 'verify' fsck: verify multi-pack-index Documentation/git-multi-pack-index.txt | 10 ++ builtin/fsck.c | 18 builtin/multi-pack-index.c | 4 +- midx.c | 113 midx.h | 1 + t/t5319-multi-pack-index.sh| 136 - 6 files changed, 262 insertions(+), 20 deletions(-) base-commit: 6a22d521260f86dff8fe6f23ab329cebb62ba4f0 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-34%2Fderrickstolee%2Fmidx%2Fverify-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-34/derrickstolee/midx/verify-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/34 Range-diff vs v1: 1: 8dc38afe2b ! 1: d8ffd84d67 multi-pack-index: add 'verify' verb @@ -47,7 +47,7 @@ static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] write"), -+ N_("git multi-pack-index [--object-dir=] [write|verify]"), ++ N_("git multi-pack-index [--object-dir=] (write|verify)"), NULL }; 2: 787e1fb616 ! 2: 9590895830 multi-pack-index: verify bad header @@ -61,10 +61,10 @@ +# usage: corrupt_midx_and_verify +corrupt_midx_and_verify() { -+ POS=$1 -+ DATA="${2:-\0}" -+ OBJDIR=$3 -+ GREPSTR="$4" ++ POS=$1 && ++ DATA="${2:-\0}" && ++ OBJDIR=$3 && ++ GREPSTR="$4" && + FILE=$OBJDIR/pack/multi-pack-index && + chmod a+w $FILE && + test_when_finished mv midx-backup $FILE && 3: b385aa2abf = 3: 2448173844 multi-pack-index: verify corrupt chunk lookup table 4: 37ee24c82b = 4: 947241bfdc multi-pack-index: verify packname order 5: b747da415c = 5: 4058867380 multi-pack-index: verify missing pack 6: 58e5c09468 = 6: ea1c522702 multi-pack-index: verify oid fanout order 7: b21772d054 = 7: 511791de91 multi-pack-index: verify oid lookup order 8: b08d3f0055 = 8: 210649bf83 multi-pack-index: fix 32-bit vs 64-bit size check 9: e1498aea45 ! 9: ef20193d59 multi-pack-index: verify object offsets @@ -21,7 +21,8 @@ if (pack_int_id >= m->num_packs) - BUG("bad pack-int-id"); -+ die(_("bad pack-int-id")); ++ die(_("bad pack-int-id: %u (%u total packs"), ++ pack_int_id, m->num_packs); if (m->packs[pack_int_id]) return 0; 10: acf8cfd632 = 10: 29ebc17161 multi-pack-index: report progress during 'verify' 11: 09d16aff20 ! 11: 406c88b456 fsck: verify multi-pack-index @@ -40,14 +40,14 @@ --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ - DATA="${2:-\0}" - OBJDIR=$3 - GREPSTR="$4" -+ COMMAND="$5" + DATA="${2:-\0}" && + OBJDIR=$3 && + GREPSTR="$4" && ++ COMMAND="$5" && + if test -z "$COMMAND" + then + COMMAND="git multi-pack-index verify --object-dir=$OBJDIR" -+ fi ++ fi && FILE=$OBJDIR/pack/multi-pack-index && chmod a+w $FILE && test_when_finished mv midx-backup $FILE && -- gitgitgadget
[PATCH v2 1/1] commit-reach: properly peel tags
From: Derrick Stolee The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e "commit-reach: make can_all_from_reach... linear" but incorrectly assumed that all objects provided were commits. During a fetch negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags to the 'from' array. The current code creates a segfault. Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach' and add a test in t6600-test-reach.sh that demonstrates this segfault. Correct the issue by peeling tags when investigating the initial list of objects in the 'from' array. Signed-off-by: Jeff King Signed-off-by: Derrick Stolee --- commit-reach.c| 33 ++--- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 -- 3 files changed, 71 insertions(+), 14 deletions(-) diff --git a/commit-reach.c b/commit-reach.c index 86715c103c..4048a2132a 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -544,20 +544,39 @@ int can_all_from_reach_with_flag(struct object_array *from, { struct commit **list = NULL; int i; + int nr_commits; int result = 1; ALLOC_ARRAY(list, from->nr); + nr_commits = 0; for (i = 0; i < from->nr; i++) { - list[i] = (struct commit *)from->objects[i].item; + struct object *from_one = from->objects[i].item; - if (parse_commit(list[i]) || - list[i]->generation < min_generation) - return 0; + if (!from_one || from_one->flags & assign_flag) + continue; + + from_one = deref_tag(the_repository, from_one, +"a from object", 0); + if (!from_one || from_one->type != OBJ_COMMIT) { + /* no way to tell if this is reachable by +* looking at the ancestry chain alone, so +* leave a note to ourselves not to worry about +* this object anymore. +*/ + from->objects[i].item->flags |= assign_flag; + continue; + } + + list[nr_commits] = (struct commit *)from_one; + if (parse_commit(list[nr_commits]) || + list[nr_commits]->generation < min_generation) + return 0; /* is this a leak? */ + nr_commits++; } - QSORT(list, from->nr, compare_commits_by_gen); + QSORT(list, nr_commits, compare_commits_by_gen); - for (i = 0; i < from->nr; i++) { + for (i = 0; i < nr_commits; i++) { /* DFS from list[i] */ struct commit_list *stack = NULL; @@ -600,7 +619,7 @@ int can_all_from_reach_with_flag(struct object_array *from, } cleanup: - for (i = 0; i < from->nr; i++) { + for (i = 0; i < nr_commits; i++) { clear_commit_marks(list[i], RESULT); clear_commit_marks(list[i], assign_flag); } diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index eb21103998..08d2ea68e8 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av) struct object_id oid_A, oid_B; struct commit *A, *B; struct commit_list *X, *Y; + struct object_array X_obj = OBJECT_ARRAY_INIT; struct commit **X_array; int X_nr, X_alloc; struct strbuf buf = STRBUF_INIT; @@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av) while (strbuf_getline(, stdin) != EOF) { struct object_id oid; - struct object *o; + struct object *orig; + struct object *peeled; struct commit *c; if (buf.len < 3) continue; @@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av) if (get_oid_committish(buf.buf + 2, )) die("failed to resolve %s", buf.buf + 2); - o = parse_object(r, ); - o = deref_tag_noverify(o); + orig = parse_object(r, ); + peeled = deref_tag_noverify(orig); - if (!o) + if (!peeled) die("failed to load commit for input %s resulting in oid %s\n", buf.buf, oid_to_hex()); - c = object_as_type(r, o, OBJ_COMMIT, 0); + c = object_as_type(r, peeled, OBJ_COMMIT, 0); if (!c) die("failed to load commit for input %s resulting in oid %s\n", @@ -85,6 +87,7 @@ int cmd__reach(int ac, const char **av) commit_list_insert(c, ); ALLOC_GROW(X_array, X_nr + 1, X_alloc); X_array[X_nr++] = c; +
[PATCH v2 0/1] Properly peel tags in can_all_from_reach_with_flags()
As Peff reported [1], the refactored can_all_from_reach_with_flags() method does not properly peel tags. Since the helper method can_all_from_reach() and code in t/helper/test-reach.c all peel tags before getting to this method, it is not super-simple to create a test that demonstrates this. I modified t/helper/test-reach.c to allow calling can_all_from_reach_with_flags() directly, and added a test in t6600-test-reach.sh that demonstrates the segfault without the fix. For V2, I compared the loop that inspects the 'from' commits in commit ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version here and got the following diff: 3c3 < if (from_one->flags & assign_flag) --- > if (!from_one || from_one->flags & assign_flag) 5c5,7 < from_one = deref_tag(the_repository, from_one, "a from object", 0); --- > > from_one = deref_tag(the_repository, from_one, > "a from object", 0); 14a17,22 > > list[nr_commits] = (struct commit *)from_one; > if (parse_commit(list[nr_commits]) || > list[nr_commits]->generation < min_generation) > return 0; /* is this a leak? */ > nr_commits++; This diff includes the early termination we had before 'deref_tag' and the comment for why we can ignore non-commit objects. [1] https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u Derrick Stolee (1): commit-reach: properly peel tags commit-reach.c| 33 ++--- t/helper/test-reach.c | 22 +- t/t6600-test-reach.sh | 30 -- 3 files changed, 71 insertions(+), 14 deletions(-) base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-39/derrickstolee/tag-fix-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/39 Range-diff vs v1: 1: 948e28 ! 1: 4bf21204dd commit-reach: properly peel tags @@ -36,9 +36,17 @@ - if (parse_commit(list[i]) || - list[i]->generation < min_generation) - return 0; ++ if (!from_one || from_one->flags & assign_flag) ++ continue; ++ + from_one = deref_tag(the_repository, from_one, + "a from object", 0); + if (!from_one || from_one->type != OBJ_COMMIT) { ++ /* no way to tell if this is reachable by ++ * looking at the ancestry chain alone, so ++ * leave a note to ourselves not to worry about ++ * this object anymore. ++ */ + from->objects[i].item->flags |= assign_flag; + continue; + } @@ -187,7 +195,7 @@ + echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect && + test_three_modes can_all_from_reach_with_flag +' -+ ++ test_expect_success 'commit_contains:hit' ' cat >input <<-\EOF && A:commit-7-7 -- gitgitgadget
[PATCH v2 1/1] contrib: add coverage-diff script
From: Derrick Stolee We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. The output uses 'git blame -c' format so you can find the commits responsible and view the line numbers for quick access to the context. Signed-off-by: Derrick Stolee --- contrib/coverage-diff.sh | 63 1 file changed, 63 insertions(+) create mode 100755 contrib/coverage-diff.sh diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh new file mode 100755 index 00..0f226f038c --- /dev/null +++ b/contrib/coverage-diff.sh @@ -0,0 +1,63 @@ +#!/bin/sh + +# Usage: 'contrib/coverage-diff.sh +# Outputs a list of new lines in version2 compared to version1 that are +# not covered by the test suite. Assumes you ran +# 'make coverage-test coverage-report' from root first, so .gcov files exist. + +V1=$1 +V2=$2 + +diff_lines () { + while read line + do + if echo $line | grep -q -e "^@@ -([0-9]+)(,[0-9]+)? \\+([0-9]+)(,[0-9]+)? @@.*" + then + line_num=$(echo $line \ + | awk 'match($0, "@@ -([0-9]+)(,[0-9]+)? \\+([0-9]+)(,[0-9]+)? @@.*", m) { print m[3] }') + else + echo "$line_num:$line" + if ! echo $line | grep -q -e "^-" + then + line_num=$(($line_num + 1)) + fi + fi + done +} + +files=$(git diff --raw $V1 $V2 \ + | grep \.c$ \ + | awk 'NF>1{print $NF}') + +for file in $files +do + git diff $V1 $V2 -- $file \ + | diff_lines \ + | grep ":+" \ + | sed 's/:/ /g' \ + | awk '{print $1}' \ + | sort \ + >new_lines.txt + + hash_file=$(echo $file | sed "s/\//\#/") + cat "$hash_file.gcov" \ + | grep \#\#\#\#\# \ + | sed 's/#: //g' \ + | sed 's/\:/ /g' \ + | awk '{print $1}' \ + | sort \ + >uncovered_lines.txt + + comm -12 uncovered_lines.txt new_lines.txt \ + | sed -e 's/$/\)/' \ + | sed -e 's/^/\t/' \ + >uncovered_new_lines.txt + + grep -q '[^[:space:]]' < uncovered_new_lines.txt && \ + echo $file && \ + git blame -c $file \ + | grep -f uncovered_new_lines.txt + + rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt +done + -- gitgitgadget
[PATCH v2 0/1] contrib: Add script to show uncovered "new" lines
We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. For example, I ran this against the 'jch' branch (d3c0046) versus 'next' (dd90340) and got the following output: builtin/commit.c 859fdc0c3cf (Derrick Stolee 2018-08-29 05:49:04 -0700 1657) write_commit_graph_reachable(get_object_directory(), 0); builtin/rev-list.c 250edfa8c87 (Harald Nordgren2018-04-18 23:05:35 +0200 431) bisect_flags |= BISECT_FIND_ALL; builtin/worktree.c e5353bef550 (Eric Sunshine 2018-08-28 17:20:19 -0400 60) error_errno(_("failed to delete '%s'"), sb.buf); e19831c94f9 (Eric Sunshine 2018-08-28 17:20:23 -0400 251) die(_("unable to re-add worktree '%s'"), path); 68a6b3a1bd4 (Eric Sunshine 2018-08-28 17:20:24 -0400 793) die(_("cannot move a locked working tree, lock reason: %s\nuse 'move -f -f' to override or unlock first"), f4143101cbb (Eric Sunshine 2018-08-28 17:20:25 -0400 906) die(_("cannot remove a locked working tree, lock reason: %s\nuse 'remove -f -f' to override or unlock first"), read-cache.c 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1754) const unsigned char *cp = (const unsigned char *)name; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1757) previous_len = previous_ce ? previous_ce->ce_namelen : 0; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1758) strip_len = decode_varint(); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1759) if (previous_len < strip_len) { 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1760) if (previous_ce) 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1761) die(_("malformed name field in the index, near path '%s'"), 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1762) previous_ce->name); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1764) die(_("malformed name field in the index in the first path")); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1766) copy_len = previous_len - strip_len; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1767) name = (const char *)cp; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1773) len += copy_len; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1794) if (copy_len) 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1795) memcpy(ce->name, previous_ce->name, copy_len); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1796) memcpy(ce->name + copy_len, name, len + 1 - copy_len); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1797) *ent_size = (name - ((char *)ondisk)) + len + 1 - copy_len; remote-curl.c c3b9bc94b9b (Elijah Newren 2018-09-05 10:03:07 -0700 181) options.filter = xstrdup(value); Using this 'git blame' output, we can quickly inspect whether the uncovered lines are appropriate. For instance: 1. The line in builtin/commit.c is due to writing the commit-graph file when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the test suite. Being uncovered is expected here. 2. The lines in builtin/worktree.c are all related to error conditions. This is acceptable. 3. The line in builtin/rev-list.c is a flag replacement in a block that is otherwise unchanged. It must not be covered by the test suite normally. This could be worth adding a test to ensure the new logic maintains old behavior. 4. The lines in
[PATCH 0/1] contrib: Add script to show uncovered "new" lines
We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. For example, I ran this against the 'jch' branch (d3c0046) versus 'next' (dd90340) and got the following output: builtin/commit.c 859fdc0c3cf (Derrick Stolee 2018-08-29 05:49:04 -0700 1657) write_commit_graph_reachable(get_object_directory(), 0); builtin/rev-list.c 250edfa8c87 (Harald Nordgren2018-04-18 23:05:35 +0200 431) bisect_flags |= BISECT_FIND_ALL; builtin/worktree.c e5353bef550 (Eric Sunshine 2018-08-28 17:20:19 -0400 60) error_errno(_("failed to delete '%s'"), sb.buf); e19831c94f9 (Eric Sunshine 2018-08-28 17:20:23 -0400 251) die(_("unable to re-add worktree '%s'"), path); 68a6b3a1bd4 (Eric Sunshine 2018-08-28 17:20:24 -0400 793) die(_("cannot move a locked working tree, lock reason: %s\nuse 'move -f -f' to override or unlock first"), f4143101cbb (Eric Sunshine 2018-08-28 17:20:25 -0400 906) die(_("cannot remove a locked working tree, lock reason: %s\nuse 'remove -f -f' to override or unlock first"), read-cache.c 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1754) const unsigned char *cp = (const unsigned char *)name; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1757) previous_len = previous_ce ? previous_ce->ce_namelen : 0; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1758) strip_len = decode_varint(); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1759) if (previous_len < strip_len) { 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1760) if (previous_ce) 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1761) die(_("malformed name field in the index, near path '%s'"), 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1762) previous_ce->name); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1764) die(_("malformed name field in the index in the first path")); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1766) copy_len = previous_len - strip_len; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1767) name = (const char *)cp; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1773) len += copy_len; 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1794) if (copy_len) 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1795) memcpy(ce->name, previous_ce->name, copy_len); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1796) memcpy(ce->name + copy_len, name, len + 1 - copy_len); 67922a3 (Nguyễn Thái Ngọc Duy 2018-09-02 15:19:33 +0200 1797) *ent_size = (name - ((char *)ondisk)) + len + 1 - copy_len; remote-curl.c c3b9bc94b9b (Elijah Newren 2018-09-05 10:03:07 -0700 181) options.filter = xstrdup(value); Using this 'git blame' output, we can quickly inspect whether the uncovered lines are appropriate. For instance: 1. The line in builtin/commit.c is due to writing the commit-graph file when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the test suite. Being uncovered is expected here. 2. The lines in builtin/worktree.c are all related to error conditions. This is acceptable. 3. The line in builtin/rev-list.c is a flag replacement in a block that is otherwise unchanged. It must not be covered by the test suite normally. This could be worth adding a test to ensure the new logic maintains old behavior. 4. The lines in