[PATCH v2 01/12] commit-graph: add 'verify' subcommand
If the commit-graph file becomes corrupt, we need a way to verify that its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. Add a simple test that ensures the command returns a zero error code. If no commit-graph file exists, this is an acceptable state. Do not report any errors. During review, we noticed that a FREE_AND_NULL(graph_name) was placed after a possible 'return', and this pattern was also in graph_read(). Fix that case, too. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 6 ++ builtin/commit-graph.c | 40 +- commit-graph.c | 5 + commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 5 files changed, 62 insertions(+), 1 deletion(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..a222cfab08 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph read' [--object-dir ] +'git commit-graph verify' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -52,6 +53,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'verify':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..af3101291f 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,10 +8,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), + N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_verify_usage[] = { + N_("git commit-graph verify [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_verify(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_verify_options[] = { + OPT_STRING(0, "object-dir", &opts.obj_dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_verify_options, +builtin_commit_graph_verify_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); + + if (!graph) + return 0; + + return verify_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv) graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); if (!graph) die("graph file %s does not exist", graph_name); - FREE_AND_NULL(graph_name); printf("header: %08x %d %d %d %d\n", ntohl(*(uint32_t*)graph->data), @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "verify")) + return graph_verify(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..b25aaed128 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -817,3 +817,8 @@ vo
[PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
Add test cases to t5318-commit-graph.sh that corrupt the commit-graph file and check that the 'git commit-graph verify' command fails. These tests verify the header and chunk information is checked carefully. Helped-by: Martin Ågren Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 53 + 1 file changed, 53 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6ca451dfd2..0cb88232fa 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +# usage: corrupt_data [] +corrupt_data() { + file=$1 + pos=$2 + data="${3:-\0}" + printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc +} + +test_expect_success 'detect bad signature' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 0 "\0" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "graph signature" verify-errors +' + +test_expect_success 'detect bad version number' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 4 "\02" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "graph version" verify-errors +' + +test_expect_success 'detect bad hash version' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 5 "\02" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "hash version" verify-errors +' + +test_expect_success 'detect too small chunk-count' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 6 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 2 verify-errors && + grep "missing the OID Lookup chunk" verify-errors && + grep "missing the Commit Data chunk" verify-errors +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v2 00/12] Integrate commit-graph into fsck and gc
I'm sending this v2 re-roll rather quickly after the previous version because Martin provided a framework to add tests to the 'verify' subcommand. I took that framework and added tests for the other checks of the commit-graph data. This also found some interesting things about the command: 1. There were some segfaults because we were not checking for bad data carefully enough. 2. To avoid segfaults, we will now terminate the check early if we find problems earlier in the file, such as in the header, or OID lookup. 3. We were not writing newlines between reports. This now happens by default in graph_report(). The integration into 'fetch' is dropped (thanks Ævar!). Derrick Stolee (12): commit-graph: add 'verify' subcommand commit-graph: verify file header information commit-graph: test that 'verify' finds corruption commit-graph: parse commit from chosen graph commit-graph: verify fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: verify commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/config.txt | 6 + Documentation/git-commit-graph.txt | 14 ++- Documentation/git-fsck.txt | 3 + Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 22 builtin/commit-graph.c | 81 - builtin/fsck.c | 21 builtin/gc.c | 8 ++ commit-graph.c | 199 ++- commit-graph.h | 2 + commit.c | 13 +- commit.h | 1 + t/t5318-commit-graph.sh | 173 +++ 13 files changed, 509 insertions(+), 38 deletions(-) base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e -- 2.16.2.329.gfb62395de6
[PATCH v2 11/12] gc: automatically write commit-graph files
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, write the commit-graph file by default during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Defaults to false while the commit-graph feature matures. We specifically do not want to turn this on by default until the commit-graph feature is fully integrated with history-modifying features like shallow clones. Signed-off-by: Derrick Stolee --- Documentation/config.txt | 6 ++ Documentation/git-gc.txt | 4 builtin/gc.c | 8 3 files changed, 18 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 11f027194e..9a3abd87e7 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1553,6 +1553,12 @@ gc.autoDetach:: Make `git gc --auto` return immediately and run in background if the system supports it. Default is true. +gc.commitGraph:: + If true, then gc will rewrite the commit-graph file after any + change to the object database. If '--auto' is used, then the + commit-graph will not be updated unless the threshold is met. + See linkgit:git-commit-graph[1] for details. + gc.logExpiry:: If the file gc.log exists, then `git gc --auto` won't run unless that file is more than 'gc.logExpiry' old. Default is diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index 571b5a7e3c..17dd654a59 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. +The optional configuration variable 'gc.commitGraph' determines if +'git gc' runs 'git commit-graph write'. This can be set to a boolean +value. This defaults to false. + The optional configuration variable `gc.aggressiveWindow` controls how much time is spent optimizing the delta compression of the objects in the repository when the --aggressive option is specified. The larger diff --git a/builtin/gc.c b/builtin/gc.c index 77fa720bd0..8403445738 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -34,6 +34,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; +static int gc_commit_graph = 0; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; @@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT; static struct argv_array prune = ARGV_ARRAY_INIT; static struct argv_array prune_worktrees = ARGV_ARRAY_INIT; static struct argv_array rerere = ARGV_ARRAY_INIT; +static struct argv_array commit_graph = ARGV_ARRAY_INIT; static struct tempfile *pidfile; static struct lock_file log_lock; @@ -121,6 +123,7 @@ static void gc_config(void) git_config_get_int("gc.aggressivedepth", &aggressive_depth); git_config_get_int("gc.auto", &gc_auto_threshold); git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit); + git_config_get_bool("gc.commitgraph", &gc_commit_graph); git_config_get_bool("gc.autodetach", &detach_auto); git_config_get_expiry("gc.pruneexpire", &prune_expire); git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire); @@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) argv_array_pushl(&prune, "prune", "--expire", NULL); argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", NULL); argv_array_pushl(&rerere, "rerere", "gc", NULL); + argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", NULL); /* default expiry time, overwritten in gc_config */ gc_config(); @@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); + if (gc_commit_graph) + if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, commit_graph.argv[0]); + if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); -- 2.16.2.329.gfb62395de6
[PATCH v2 10/12] commit-graph: add '--reachable' option
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc' which will call 'git commit-graph write --reachable' after performing cleanup of the object database. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 8 ++-- builtin/commit-graph.c | 41 ++ 2 files changed, 43 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index a222cfab08..cc1715a823 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by walking objects only in the specified pack-indexes. (Cannot be combined -with --stdin-commits.) +with --stdin-commits or --reachable.) + With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with ---stdin-packs.) +--stdin-packs or --reachable.) ++ +With the `--reachable` option, generate the new commit graph by walking +commits starting at all refs. (Cannot be combined with --stdin-commits +or --stind-packs.) + With the `--append` option, include all commits that are present in the existing commit-graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index af3101291f..7cb94a4813 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -3,13 +3,14 @@ #include "dir.h" #include "lockfile.h" #include "parse-options.h" +#include "refs.h" #include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph verify [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; @@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int reachable; int stdin_packs; int stdin_commits; int append; @@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv) return 0; } +struct hex_list { + char **hex_strs; + int hex_nr; + int hex_alloc; +}; + +static int add_ref_to_list(const char *refname, + const struct object_id *oid, + int flags, void *cb_data) +{ + struct hex_list *list = (struct hex_list*)cb_data; + + ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc); + list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1); + strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid)); + list->hex_nr++; + return 0; +} + static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; @@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv) OPT_STRING(0, "object-dir", &opts.obj_dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "reachable", &opts.reachable, + N_("start walk at all refs")), OPT_BOOL(0, "stdin-packs", &opts.stdin_packs, N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", &opts.stdin_commits, @@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); - if (opts.stdin_packs && opts.stdin_commits) - die(_("cannot use both --stdin-commits and --stdin-packs")); + if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1) + die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
[PATCH v2 08/12] commit-graph: verify commit contents against odb
When running 'git commit-graph verify', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. Parse the commit from the graph during the initial loop through the object IDs to guarantee we parse from the commit-graph file. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. While testing, we discovered that mutating the integer value for a parent to be outside the accepted range causes a segmentation fault. Add a new check in insert_parent_or_die() that prevents this fault. Check for that error during the test, both in the typical parents and in the list of parents for octopus merges. Signed-off-by: Derrick Stolee --- commit-graph.c | 100 t/t5318-commit-graph.sh | 64 +++ 2 files changed, 164 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 5bb93e533c..a15ad9710d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -237,6 +237,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, { struct commit *c; struct object_id oid; + + if (pos >= g->num_commits) + die("invalide parent position %"PRIu64, pos); + hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); c = lookup_commit(&oid); if (!c) @@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g) return 1; for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i && oidcmp(&prev_oid, &cur_oid) >= 0) @@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(&cur_oid); + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report("failed to parse %s from commit-graph", oid_to_hex(&cur_oid)); } while (cur_fanout_pos < 256) { @@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + if (verify_commit_graph_error) + return 1; + + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(&cur_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) { + graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid)); + continue; + } + + if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report("root tree object ID for commit %s in commit-graph is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"", +oid_to_hex(&cur_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report("commit-graph parent list for commit %s is too long (%d)", +oid_to_hex(&cur_oid), +num_parents); + + if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid)) + graph_report("commit-graph parent for %s is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(&
[PATCH v2 07/12] commit-graph: load a root tree from specific graph
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee --- commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index b0fd1d5320..5bb93e533c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit * return c->maybe_tree; } -struct tree *get_commit_tree_in_graph(const struct commit *c) +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g, +const struct commit *c) { if (c->maybe_tree) return c->maybe_tree; if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) BUG("get_commit_tree_in_graph called from non-commit-graph commit"); - return load_tree_for_commit(commit_graph, (struct commit *)c); + return load_tree_for_commit(g, (struct commit *)c); +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + return get_commit_tree_in_graph_one(commit_graph, c); } static void write_graph_chunk_fanout(struct hashfile *f, -- 2.16.2.329.gfb62395de6
[PATCH v2 06/12] commit: force commit to parse from object database
In anticipation of verifying commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Due to the use of generation numbers, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee --- commit.c | 13 + commit.h | 1 + 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 1d28677dfb..7c92350373 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, &type, &size); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(&item->object.oid)); + oid_to_hex(&item->object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(&item->object.oid)); + oid_to_hex(&item->object.oid)); } ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { @@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.16.2.329.gfb62395de6
[PATCH v2 09/12] fsck: verify commit-graph
If core.commitGraph is true, verify the contents of the commit-graph during 'git fsck' using the 'git commit-graph verify' subcommand. Run this check on all alternates, as well. We use a new process for two reasons: 1. The subcommand decouples the details of loading and verifying a commit-graph file from the other fsck details. 2. The commit-graph verification requires the commits to be loaded in a specific order to guarantee we parse from the commit-graph file for some objects and from the object database for others. Signed-off-by: Derrick Stolee --- Documentation/git-fsck.txt | 3 +++ builtin/fsck.c | 21 + t/t5318-commit-graph.sh| 21 ++--- 3 files changed, 42 insertions(+), 3 deletions(-) diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt index b9f060e3b2..ab9a93fb9b 100644 --- a/Documentation/git-fsck.txt +++ b/Documentation/git-fsck.txt @@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives (i.e., you can just remove them and do an 'rsync' with some other site in the hopes that somebody else has the object you have corrupted). +If core.commitGraph is true, the commit-graph file will also be inspected +using 'git commit-graph verify'. See linkgit:git-commit-graph[1]. + Extracted Diagnostics - diff --git a/builtin/fsck.c b/builtin/fsck.c index ef78c6c00c..a6d5045b77 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -16,6 +16,7 @@ #include "streaming.h" #include "decorate.h" #include "packfile.h" +#include "run-command.h" #define REACHABLE 0x0001 #define SEEN 0x0002 @@ -45,6 +46,7 @@ static int name_objects; #define ERROR_REACHABLE 02 #define ERROR_PACK 04 #define ERROR_REFS 010 +#define ERROR_COMMIT_GRAPH 020 static const char *describe_object(struct object *obj) { @@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } check_connectivity(); + + if (core_commit_graph) { + struct child_process commit_graph_verify = CHILD_PROCESS_INIT; + const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL }; + commit_graph_verify.argv = verify_argv; + commit_graph_verify.git_cmd = 1; + + if (run_command(&commit_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + + prepare_alt_odb(); + for (alt = alt_odb_list; alt; alt = alt->next) { + verify_argv[2] = "--object-dir"; + verify_argv[3] = alt->path; + if (run_command(&commit_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + } + } + return errors_found; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 5ab268a024..91c8406d97 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' ' graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2 +test_expect_success 'build graph using --reachable' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph write --reachable && + test_path_is_file $objdir/info/commit-graph && + graph_read_expect "11" "large_edges" +' + +graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1 +graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2 + test_expect_success 'setup bare repo' ' cd "$TRASH_DIRECTORY" && git clone --bare --no-local full bare && @@ -335,7 +345,7 @@ test_expect_success 'detect OID not in object database' ' cd "$TRASH_DIRECTORY/full" && cp $objdir/info/commit-graph commit-graph-backup && test_when_finished mv commit-graph-backup $objdir/info/commit-graph && - corrupt_data $objdir/info/commit-graph 1134 "\01" && + corrupt_data $objdir/info/commit-graph 1134 "\00" && test_must_fail git commit-graph verify 2>err && grep -v "^\+" err > verify-errors && test_line_count = 3 verify-errors && @@ -348,7 +358,7 @@ test_expect_success 'detect incorrect tree OID' ' cd "$TRASH_DIRECTORY/full" && cp $objdir/info/commit-graph commit-graph-backup && test_when_finished mv commit-graph-backup $objdir/info/commit-graph &
[PATCH v2 12/12] commit-graph: update design document
The commit-graph feature is now integrated with 'fsck' and 'gc', so remove those items from the "Future Work" section of the commit-graph design document. Also remove the section on lazy-loading trees, as that was completed in an earlier patch series. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 22 -- 1 file changed, 22 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index e1a883eb46..c664acbd76 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -118,9 +118,6 @@ Future Work - The commit graph feature currently does not honor commit grafts. This can be remedied by duplicating or refactoring the current graft logic. -- The 'commit-graph' subcommand does not have a "verify" mode that is - necessary for integration with fsck. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered @@ -130,25 +127,6 @@ Future Work - 'log --topo-order' - 'tag --merged' -- Currently, parse_commit_gently() requires filling in the root tree - object for a commit. This passes through lookup_tree() and consequently - lookup_object(). Also, it calls lookup_commit() when loading the parents. - These method calls check the ODB for object existence, even if the - consumer does not need the content. For example, we do not need the - tree contents when computing merge bases. Now that commit parsing is - removed from the computation time, these lookup operations are the - slowest operations keeping graph walks from being fast. Consider - loading these objects without verifying their existence in the ODB and - only loading them fully when consumers need them. Consider a method - such as "ensure_tree_loaded(commit)" that fully loads a tree before - using commit->tree. - -- The current design uses the 'commit-graph' subcommand to generate the graph. - When this feature stabilizes enough to recommend to most users, we should - add automatic graph writes to common operations that create many commits. - For example, one could compute a graph on 'clone', 'fetch', or 'repack' - commands. - - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if the user is willing to trust the file, because verifying the file is correct -- 2.16.2.329.gfb62395de6
[PATCH v2 05/12] commit-graph: verify fanout and lookup table
While running 'git commit-graph verify', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. Add tests that check these corruptions are caught by the verify subcommand. Most of the tests check the full output matches the exact error we inserted, but since our OID order test triggers incorrect fanout values (with possibly different numbers of output lines) we focus only that the correct error is written in that case. Signed-off-by: Derrick Stolee --- commit-graph.c | 36 t/t5318-commit-graph.sh | 31 +++ 2 files changed, 67 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 4dfff7e752..b0fd1d5320 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -848,6 +848,9 @@ static void graph_report(const char *fmt, ...) int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report("no commit-graph file loaded"); return 1; @@ -862,5 +865,38 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report("commit-graph is missing the Commit Data chunk"); + if (verify_commit_graph_error) + return 1; + + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i && oidcmp(&prev_oid, &cur_oid) >= 0) + graph_report("commit-graph has incorrect oid order: %s then %s", +oid_to_hex(&prev_oid), +oid_to_hex(&cur_oid)); + + oidcpy(&prev_oid, &cur_oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 0cb88232fa..6fb306b0da 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -293,4 +293,35 @@ test_expect_success 'detect too small chunk-count' ' grep "missing the Commit Data chunk" verify-errors ' +test_expect_success 'detect incorrect chunk lookup value' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 25 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "improper chunk offset" verify-errors +' + +test_expect_success 'detect incorrect fanout' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 128 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "fanout value" verify-errors +' + +test_expect_success 'detect incorrect OID order' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 1272 "\01" && + test_must_fail git commit-graph verify 2>err && + grep "incorrect oid order" err +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v2 04/12] commit-graph: parse commit from chosen graph
Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee --- commit-graph.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index d2db20e49a..4dfff7e752 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; @@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item) return 0; if (item->object.parsed) return 1; + + if (find_commit_in_graph(item, g, &pos)) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ + if (!core_commit_graph) + return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.16.2.329.gfb62395de6
Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation
On 5/11/2018 2:03 PM, Jeff King wrote: Commit 941ba8db57 (Eliminate recursion in setting/clearing marks in commit list, 2012-01-14) used a clever double-loop to avoid allocations for single-parent chains of history. However, it did so only when following parents of parents (which was an uncommon case), and _always_ incurred at least one allocation to populate the list of pending parents in the first place. We can turn this into zero-allocation in the common case by iterating directly over the initial parent list, and then following up on any pending items we might have discovered. This change appears to improve performance, but I was unable to measure any difference between this commit and the one ahead, even when merging ds/generation-numbers (which significantly reduces the other costs). I was testing 'git status' and 'git rev-list --boundary master...origin/master' in the Linux repo with my copy of master 70,000+ commits behind origin/master. It's still a good change, but I was hoping to find a measurable benefit :( Signed-off-by: Jeff King --- Again, try "-w" for more readability. revision.c | 44 +--- 1 file changed, 25 insertions(+), 19 deletions(-) diff --git a/revision.c b/revision.c index 89ff9a99ce..cbe041128e 100644 --- a/revision.c +++ b/revision.c @@ -115,32 +115,38 @@ static void commit_stack_clear(struct commit_stack *stack) stack->nr = stack->alloc = 0; } -void mark_parents_uninteresting(struct commit *commit) +static void mark_one_parent_uninteresting(struct commit *commit, + struct commit_stack *pending) { - struct commit_stack pending = COMMIT_STACK_INIT; struct commit_list *l; + if (commit->object.flags & UNINTERESTING) + return; + commit->object.flags |= UNINTERESTING; + + /* +* Normally we haven't parsed the parent +* yet, so we won't have a parent of a parent +* here. However, it may turn out that we've +* reached this commit some other way (where it +* wasn't uninteresting), in which case we need +* to mark its parents recursively too.. +*/ for (l = commit->parents; l; l = l->next) - commit_stack_push(&pending, l->item); + commit_stack_push(pending, l->item); +} - while (pending.nr > 0) { - struct commit *commit = commit_stack_pop(&pending); +void mark_parents_uninteresting(struct commit *commit) +{ + struct commit_stack pending = COMMIT_STACK_INIT; + struct commit_list *l; - if (commit->object.flags & UNINTERESTING) - return; - commit->object.flags |= UNINTERESTING; + for (l = commit->parents; l; l = l->next) + mark_one_parent_uninteresting(l->item, &pending); - /* -* Normally we haven't parsed the parent -* yet, so we won't have a parent of a parent -* here. However, it may turn out that we've -* reached this commit some other way (where it -* wasn't uninteresting), in which case we need -* to mark its parents recursively too.. -*/ - for (l = commit->parents; l; l = l->next) - commit_stack_push(&pending, l->item); - } + while (pending.nr > 0) + mark_one_parent_uninteresting(commit_stack_pop(&pending), + &pending); commit_stack_clear(&pending); }
Re: [PATCH 0/4] a few mark_parents_uninteresting cleanups
On 5/11/2018 2:00 PM, Jeff King wrote: This is a follow-up to the discussion from February: https://public-inbox.org/git/1519522496-73090-1-git-send-email-dsto...@microsoft.com/ There I theorized that some of these extra has_object_file() checks were unnecessary. After poking around a bit, I've convinced myself that this is the case, so here are some patches. After Stolee's fix in ebbed3ba04 (revision.c: reduce object database queries, 2018-02-24), I doubt this will provide any noticeable speedup. IMHO the value is just in simplifying the logic. The first two patches are the actual has_object_file simplifications. The second two are an attempt to fix some head-scratching I had while reading mark_parents_uninteresting(). I hope the result is easier to follow, but I may have just shuffled one confusion for another. :) [1/4]: mark_tree_contents_uninteresting(): drop missing object check [2/4]: mark_parents_uninteresting(): drop missing object check [3/4]: mark_parents_uninteresting(): replace list with stack [4/4]: mark_parents_uninteresting(): avoid most allocation revision.c | 90 ++ 1 file changed, 50 insertions(+), 40 deletions(-) -Peff This series looks good to me. I found Patch 3 hard to read in the diff, so I just looked at the final result and the new arrangement is very clear about how it should behave. Reviewed-by: Derrick Stolee
Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
On 5/12/2018 10:00 AM, Jakub Narebski wrote: Derrick Stolee writes: On 5/4/2018 3:40 PM, Jakub Narebski wrote: With early parts of commit-graph feature (ds/commit-graph and ds/lazy-load-trees) close to being merged into "master", see https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/ I think it would be good idea to think what other data could be added there to make Git even faster. Before thinking about adding more data to the commit-graph, I think instead we need to finish taking advantage of the data that is already there. This means landing the generation number patch [1] (I think this is close, so I'll send a v6 this week if there is no new feedback.) and the auto-compute patch [2] (this could use more feedback, but I'll send a v1 based on the RFC feedback if no one chimes in). [1] https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/ [PATCH v5 00/11] Compute and consume generation numbers [2] https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/ [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Right, so the RFC might be a bit premature; I wanted the discussion to be out there to think about when adding new uses of existing features. DIGRESSION: it is commendable that you are sending patches in small, easy digestible chunks / patch series. It is much easier to review 10+ series than 80+ behemoth (though I understand it is not always possible to subdivide patch series into smaller self-contained sub-series). On the other hand, it would be nice to have some roadmap about series and features to be sent in the future, if possible. Something like what was done when 'git rebase --interactive' was getting builtinified: moved (in parts) to C. It would be great to have such roadmap with milestones achieved and milestones to be done in the cover letter for series. I suppose that is what I intended in the "Future Work" section of Documentation/technical/commit-graph.txt. It gives a set of things that need to be done in order to make this a default feature, not just an expert-level feature. When I wrote the section, I was focused entirely on "commit-graph version 1.0" and I didn't know how long that would take. The series have been getting interest and review (in great part to your interest, Jakub) so they have been moving to 'next' faster than I anticipated. I'll plan on writing a more detailed list of future directions, but for now I'll list the things I know about and how I see them in priority order: Commit-graph v1.0: * ds/generation-numbers * 'verify' and fsck/gc integration * correct behavior with shallow clones, replace-objects, and grafts Commit-graph v1.1: * Place commit-graph storage in the_repository * 'git tag --merged' use generation numbers * 'git log --graph' use generation numbers Commit-graph v1.X: * Protocol v2 verb for sending commit-graph * Bloom filters for changed paths The big wins remaining from this data are `git tag --merged` and `git log --graph`. The `tag` scenario is probably easier: this can be done by replacing the revision-walk underlying the call to use paint_down_to_common() instead. Requires adding an external method to commit.c, but not too much code. I wonder if there is some significant reason behind `git tag --merged` using its own codepath, beside historical reasons. Maybe performance is better with current code? Utilizing paint_down_to_common() there, beside reducing amount of code you would have to modify, would also unify code (and possibly reduce amount of code). That's very nice. The tougher challenge is `git log --graph`. The revision walk machinery currently uses two precompute phases before iterating results to the pager: limit_list() and sort_in_topological_order(); these correspond to two phases of Kahn's algorithm for topo-sort (compute in-degrees, then walk by peeling commits with in-degree zero). This requires O(N) time, where N is the number of reachable commits. Instead, we could make this be O(W) time to output one page of results, where W is (roughly) the number of reachable commits with generation number above the last reported result. A reminder: Kahn's algorithm (described for example in [1] and [3]) looks like this: L ← Empty list that will contain the sorted elements S ← Collection of all nodes with no incoming edge while S is non-empty do remove a node 'n' from S add 'n' to tail of L for each parent 'm' of 'n' do decrease the in-degree of 'm' if 'm' has in-degree of 0 then insert 'm' into S [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-soluti
Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation
On 5/14/2018 9:09 AM, Jeff King wrote: On Mon, May 14, 2018 at 08:47:33AM -0400, Derrick Stolee wrote: On 5/11/2018 2:03 PM, Jeff King wrote: Commit 941ba8db57 (Eliminate recursion in setting/clearing marks in commit list, 2012-01-14) used a clever double-loop to avoid allocations for single-parent chains of history. However, it did so only when following parents of parents (which was an uncommon case), and _always_ incurred at least one allocation to populate the list of pending parents in the first place. We can turn this into zero-allocation in the common case by iterating directly over the initial parent list, and then following up on any pending items we might have discovered. This change appears to improve performance, but I was unable to measure any difference between this commit and the one ahead, even when merging ds/generation-numbers (which significantly reduces the other costs). I was testing 'git status' and 'git rev-list --boundary master...origin/master' in the Linux repo with my copy of master 70,000+ commits behind origin/master. I think you'd want to go the other way: this is marking uninteresting commits, so you'd want origin/master..master, which would make those 70k commits uninteresting. Thanks for the tip. Running 'git rev-list origin/master..master' 100 times gave the following times, on average (with ds/generation-numbers): PATCH 3/4: 0.19 s PATCH 4/4: 0.16 s Rel %: -16% I still doubt it will create that much difference, though. It's more or less saving a single malloc per commit we traverse. Certainly without the .commits file that's a drop in the bucket compared to inflating the object data, but I wouldn't be surprised if we end up with a few mallocs even after your patches. Anyway, thanks for poking at it. :) -Peff
Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand
On 5/12/2018 9:31 AM, Martin Ågren wrote: On 11 May 2018 at 23:15, Derrick Stolee wrote: graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); if (!graph) die("graph file %s does not exist", graph_name); - FREE_AND_NULL(graph_name); This is probably because of something I said, but this does not look correct. The `die()` would typically print "(null)" or segfault. If the `die()` means we don't free `graph_name`, that should be fine. You're still leaking `graph` here (possibly nothing this patch should worry about) and in `graph_verify()`. UNLEAK-ing it immediately before calling `verify_commit_graph()` should be ok. I also think punting on this UNLEAK-business entirely would be ok; I was just a bit surprised to see one variable getting freed and the other one ignored. Thanks, Martin. I was just blindly searching for FREE_AND_NULL() and shouldn't have been so careless. -Stolee
Re: [PATCH v2 02/12] commit-graph: verify file header information
On 5/12/2018 9:35 AM, Martin Ågren wrote: +static int verify_commit_graph_error; + +static void graph_report(const char *fmt, ...) +{ + va_list ap; + struct strbuf sb = STRBUF_INIT; + verify_commit_graph_error = 1; + + va_start(ap, fmt); + strbuf_vaddf(&sb, fmt, ap); + + fprintf(stderr, "%s\n", sb.buf); + strbuf_release(&sb); + va_end(ap); +} That's a good idea. Makes that patch a bit less trivial and this one a bit less difficult.
Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb
On 5/12/2018 5:17 PM, Martin Ågren wrote: On 11 May 2018 at 23:15, Derrick Stolee wrote: When running 'git commit-graph verify', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. Parse the commit from the graph during the initial loop through the object IDs to guarantee we parse from the commit-graph file. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. While testing, we discovered that mutating the integer value for a parent to be outside the accepted range causes a segmentation fault. Add a new check in insert_parent_or_die() that prevents this fault. Check for that error during the test, both in the typical parents and in the list of parents for octopus merges. This paragraph and the corresponding fix and test feel like a separate patch to me. (The commit message of it could be "To test the next patch, we threw invalid data at `git commit-graph verify, and it crashed in pre-existing code, so let's fix that first" -- there is definitely a connection.) Is this important enough to fast-track to master in time for 2.18? My guess would be no. + + if (pos >= g->num_commits) + die("invalide parent position %"PRIu64, pos); s/invalide/invalid/ @@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g) return 1; for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i && oidcmp(&prev_oid, &cur_oid) >= 0) @@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(&cur_oid); + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report("failed to parse %s from commit-graph", oid_to_hex(&cur_oid)); } Could this end up giving ridiculous amounts of output? It would depend on the input, I guess. while (cur_fanout_pos < 256) { @@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + if (verify_commit_graph_error) + return 1; Well, here we give up before running into *too* much problem. + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(&cur_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) { + graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid)); + continue; + } + + if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report("root tree object ID for commit %s in commit-graph is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"", +oid_to_hex(&cur_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report("commit-graph parent list for commit %s is too long (%d)", +oid_to_hex(&cur_oid), +num_parents); + + if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid)) + graph_report("commit-graph parent for %s is %s != %s", +oid_to_hex(&cur_oid), +
Re: [PATCH v2 14/14] commit.h: delete 'util' field in struct commit
On 5/14/2018 1:30 PM, Duy Nguyen wrote: On Mon, May 14, 2018 at 6:07 PM, Duy Nguyen wrote: On Mon, May 14, 2018 at 04:52:29PM +0900, Junio C Hamano wrote: Nguyễn Thái Ngọc Duy writes: diff --git a/commit.h b/commit.h index 838f6a6b26..70371e111e 100644 --- a/commit.h +++ b/commit.h @@ -18,12 +18,16 @@ struct commit_list { struct commit { struct object object; - void *util; unsigned int index; timestamp_t date; struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + /* +* Do not add more fields here unless it's _very_ often +* used. Use commit-slab to associate more data with a commit +* instead. +*/ }; That's a logical consequence and a natural endgame of this pleasent-to-read series. THanks. Unfortunately we are gaining more and more stuff in "struct commit" with recent topics, and we may want to see if we can evict some of them out to shrink it again. Sigh.. ds/lazy-load-trees already enters 'next' so a fixup patch is something like this. Sorry I take this patch back. I didn't realize it was a rename and the old field named 'tree' was already there (I vaguely recalled some "tree" in this struct but didn't stop to think about it). Moving graph_pos out is an option, but only 32-bit arch gains from it (64-bit arch will have 4 bytes padding anyway) so probably not worth the effort. "generation" field should probably be moved out though. I'm happy to take a look at this patch and figure out the best way to integrate it with the changes I've been doing. I disagree with the removal of "generation". My intention is to make the commit-graph feature a standard feature that most repos (of reasonable size) want to have. In that case, 'graph_pos' and 'generation' will be set during every parse and used by every commit walk. This is just my gut reaction. As I investigate these changes, I'll try to see what performance hit is caused by converting the graph_pos and/or generation to commit slabs. (Again, I'm assuming the slabs will make this slower. I may be wrong here.) Thanks, -Stolee
Re: worktrees vs. alternates
On 5/16/2018 6:33 AM, Ævar Arnfjörð Bjarmason wrote: [big snip] And here's where this isn't at all like "worktree", each of those 100 will have their own "master" branch, and they can all create 100 different branches called "topic" that can be different. This is the biggest difference. You cannot have the same ref checked out in multiple worktrees, as they both may edit that ref. The alternates allow you to share data in a "read only" fashion. If you have one repo that is the "base" repo that manages that objects dir, then that is probably a good way to reduce the duplication. I'm not familiar with what happens when a "child" repo does 'git gc' or 'git repack', will it delete the local objects that is sees exist in the alternate? GVFS uses alternates in this same way: we create a drive-wide "shared object cache" that GVFS manages. We put our prefetch packs filled with commits and trees in there, and any loose objects that are downloaded via the object virtualization are placed as loose objects in the alternate. We also store the multi-pack-index and commit-graph in that alternate. This means that the only objects in each src dir are those created by the developer doing their normal work. Thanks, -Stolee
Re: Git log range reverse bug
Hi Mendi, On 5/16/2018 2:19 PM, Mehdi Zeinali wrote: Git Version: Version: 2.14.2 When reversing a range in git log, it does not start from the expected commit: $ git show 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b commit 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b Author: Some Name Date: Mon Nov 3 19:01:53 2014 + . . . $ git show Author: Some Other Name Date: Wed May 16 16:49:10 2018 + . . . $ git log --reverse 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b..HEAD This command is asking for the commits reachable from HEAD but NOT reachable from 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b. To see 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b in the results, you would need to add "--boundary" to the command. That may still not show 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b as the first commit, as there may be multiple, earlier boundary commits. Thanks, -Stolee
Re: What's cooking in git.git (May 2018, #02; Thu, 17)
On 5/17/2018 2:01 AM, Junio C Hamano wrote: * ds/generation-numbers (2018-05-02) 11 commits - commit-graph.txt: update design document - merge: check config before loading commits - commit: use generation number in remove_redundant() - commit: add short-circuit to paint_down_to_common() - commit: use generation numbers for in_merge_bases() - ref-filter: use generation number for --contains - commit-graph: always load commit-graph information - commit: use generations in paint_down_to_common() - commit-graph: compute generation numbers - commit: add generation number to struct commmit - ref-filter: fix outdated comment on in_commit_list (this branch is used by ds/commit-graph-lockfile-fix; uses ds/lazy-load-trees.) A recently added "commit-graph" datafile has learned to store pre-computed generation numbers to speed up the decisions to stop history traversal. Is this ready for 'next' with ds/commit-graph-lockfile-fix? A commit with triple 'm' needs its title amended, though. With the lockfile fix, it should be ready. I've been giving this significant testing on my machine and a few other developers here. The next version of GVFS is shipping with this code and with GVFS controlling the maintenance of the commit-graph file. That code has been cooking with our CI builds for a while, with full functional tests against the Windows repository. The only bugs we've found are the fix in "merge: check config before loading commits" and in ds/commit-graph-lockfile-fix. Sorry for the triple-m. Thanks, -Stolee
commit-graph: change in "best" merge-base when ambiguous
Hello all, While working on the commit-graph feature, I made a test commit that sets core.commitGraph and gc.commitGraph to true by default AND runs 'git commit-graph write --reachable' after each 'git commit' command. This helped me find instances in the test suite where the commit-graph feature changes existing functionality. Most of these were in regards to grafts, replace-objects, and shallow-clones (as expected) or when trying to find a corrupt or hidden commit (the commit-graph hides this corrupt/missing data). However, there was one interesting case that I'd like to mention on-list. In t6024-recursive-merge.sh, we have the following commit structure: # 1 - A - D - F # \ X / # B X # X \ # 2 - C - E - G When merging F to G, there are two "best" merge-bases, A and C. With core.commitGraph=false, 'git merge-base F G' returns A, while it returns C when core.commitGraph=true. This is due to the new walk order when using generation numbers, although I have not dug deep into the code to point out exactly where the choice between A and C is made. Likely it's just whatever order they are inserted into a list. In the Discussion section of the `git merge-base` docs [1], we have the following: When the history involves criss-cross merges, there can be more than one best common ancestor for two commits. For example, with this topology: ---1---o---A \ / X / \ ---2---o---o---B both 1 and 2 are merge-bases of A and B. Neither one is better than the other (both are best merge bases). When the --all option is not given, it is unspecified which best one is output. This means our official documentation mentions that we do not have a concrete way to differentiate between these choices. This makes me think that this change in behavior is not a bug, but it _is_ a change in behavior. It's worth mentioning, but I don't think there is any value in making sure `git merge-base` returns the same output. Does anyone disagree? Is this something we should solidify so we always have a "definitive" merge-base? The biggest reason I think we should avoid sticking to the existing behavior is that the current behavior depends on the walk order. That means we would not be able to concretely define a tie-breaker without changing the existing behavior anyway. Thanks, -Stolee [1] https://git-scm.com/docs/git-merge-base#_discussion
Re: commit-graph: change in "best" merge-base when ambiguous
On 5/22/2018 1:39 AM, Michael Haggerty wrote: On 05/21/2018 08:10 PM, Derrick Stolee wrote: [...] In the Discussion section of the `git merge-base` docs [1], we have the following: When the history involves criss-cross merges, there can be more than one best common ancestor for two commits. For example, with this topology: ---1---o---A \ / X / \ ---2---o---o---B both 1 and 2 are merge-bases of A and B. Neither one is better than the other (both are best merge bases). When the --all option is not given, it is unspecified which best one is output. This means our official documentation mentions that we do not have a concrete way to differentiate between these choices. This makes me think that this change in behavior is not a bug, but it _is_ a change in behavior. It's worth mentioning, but I don't think there is any value in making sure `git merge-base` returns the same output. Does anyone disagree? Is this something we should solidify so we always have a "definitive" merge-base? [...] This may be beyond the scope of what you are working on, but there are significant advantages to selecting a "best" merge base from among the candidates. Long ago [1] I proposed that the "best" merge base is the merge base candidate that minimizes the number of non-merge commits that are in git rev-list $candidate..$branch that are already in master: git rev-list $master (assuming merging branch into master), which is equivalent to choosing the merge base that minimizes git rev-list --count $candidate..$branch In fact, this criterion is symmetric if you exchange branch ↔ master, which is a nice property, and indeed generalizes pretty simply to computing the merge base of more than two commits. In that email I also included some data showing that the "best" merge base almost always results in either the same or a shorter diff than the more or less arbitrary algorithm that we currently use. Sometimes the difference in diff length is dramatic. To me it feels like the best *deterministic* merge base would be based on the above criterion, maybe with first-parent reachability, commit times, and SHA-1s used (in that order) to break ties. Thanks, everyone, for your perspective on this. I'm walking away with these conclusions: 1. While this is a change in behavior, it is not a regression. We do not need to act immediately to preserve old behavior in these ambiguous cases. 2. We should (eventually) define tie-breaking conditions. I like Michael's suggestion above. Thanks, -Stolee
[PATCH v3 07/20] commit-graph: verify catches corrupt signature
This is the first of several commits that add a test to check that 'git commit-graph verify' catches corruption in the commit-graph file. The first test checks that the command catches an error in the file signature. This is a check that exists in the existing commit-graph reading code. Add a helper method 'corrupt_graph_and_verify' to the test script t5318-commit-graph.sh. This helper corrupts the commit-graph file at a certain location, runs 'git commit-graph verify', and reports the output to the 'err' file. This data is filtered to remove the lines added by 'test_must_fail' when the test is run verbosely. Then, the output is checked to contain a specific error message. Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 43 +++ 1 file changed, 43 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6ca451dfd2..bd64481c7a 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' ' test_cmp expect output ' +# the verify tests below expect the commit-graph to contain +# exactly the commits reachable from the commits/8 branch. +# If the file changes the set of commits in the list, then the +# offsets into the binary file will result in different edits +# and the tests will likely break. + test_expect_success 'git commit-graph verify' ' cd "$TRASH_DIRECTORY/full" && + git rev-parse commits/8 | git commit-graph write --stdin-commits && git commit-graph verify >output ' +GRAPH_BYTE_VERSION=4 +GRAPH_BYTE_HASH=5 + +# usage: corrupt_graph_and_verify +# Manipulates the commit-graph file at the position +# by inserting the data, then runs 'git commit-graph verify' +# and places the output in the file 'err'. Test 'err' for +# the given string. +corrupt_graph_and_verify() { + pos=$1 + data="${2:-\0}" + grepstr=$3 + cd "$TRASH_DIRECTORY/full" && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + cp $objdir/info/commit-graph commit-graph-backup && + printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" conv=notrunc && + test_must_fail git commit-graph verify 2>test_err && + grep -v "^+" test_err >err + grep "$grepstr" err +} + +test_expect_success 'detect bad signature' ' + corrupt_graph_and_verify 0 "\0" \ + "graph signature" +' + +test_expect_success 'detect bad version' ' + corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \ + "graph version" +' + +test_expect_success 'detect bad hash version' ' + corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \ + "hash version" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 03/20] commit-graph: parse commit from chosen graph
Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee --- commit-graph.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 82295f0975..78ba0edc80 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; @@ -318,9 +318,21 @@ int parse_commit_in_graph(struct commit *item) return 0; if (item->object.parsed) return 1; + + if (find_commit_in_graph(item, g, &pos)) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ + if (!core_commit_graph) + return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.16.2.329.gfb62395de6
[PATCH v3 04/20] commit: force commit to parse from object database
In anticipation of verifying commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Due to the use of generation numbers, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee --- commit.c | 9 +++-- commit.h | 1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/commit.c b/commit.c index 1d28677dfb..6eaed0174c 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,7 +403,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, &type, &size); if (!buffer) @@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.16.2.329.gfb62395de6
[PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
Thanks for all the feedback on v2. I've tried to make this round's review a bit easier by splitting up the commits into smaller pieces. Also, the test script now has less boilerplate and uses variables and clear arithmetic to explain which bytes are being modified. One other change worth mentioning: in "commit-graph: add '--reachable' option" I put the ref-iteration into a new external 'write_commit_graph_reachable()' method inside commit-graph.c. This makes the 'gc: automatically write commit-graph files' a simpler change. Thanks, -Stolee Derrick Stolee (20): commit-graph: UNLEAK before die() commit-graph: fix GRAPH_MIN_SIZE commit-graph: parse commit from chosen graph commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: add 'verify' subcommand commit-graph: verify catches corrupt signature commit-graph: verify required chunks are present commit-graph: verify corrupt OID fanout and lookup commit-graph: verify objects exist commit-graph: verify root tree OIDs commit-graph: verify parent list commit-graph: verify generation number commit-graph: verify commit date commit-graph: test for corrupted octopus edge commit-graph: verify contents match checksum fsck: verify commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/config.txt | 6 + Documentation/git-commit-graph.txt | 14 +- Documentation/git-fsck.txt | 3 + Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 22 --- builtin/commit-graph.c | 59 +++- builtin/fsck.c | 21 +++ builtin/gc.c | 6 + commit-graph.c | 234 +-- commit-graph.h | 3 + commit.c | 9 +- commit.h | 1 + t/t5318-commit-graph.sh | 196 ++ 13 files changed, 539 insertions(+), 39 deletions(-) base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e -- 2.16.2.329.gfb62395de6
[PATCH v3 06/20] commit-graph: add 'verify' subcommand
If the commit-graph file becomes corrupt, we need a way to verify that its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. Add a simple test that ensures the command returns a zero error code. If no commit-graph file exists, this is an acceptable state. Do not report any errors. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 6 ++ builtin/commit-graph.c | 38 ++ commit-graph.c | 26 ++ commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 5 files changed, 82 insertions(+) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..a222cfab08 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph read' [--object-dir ] +'git commit-graph verify' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -52,6 +53,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'verify':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index f0875b8bf3..0433dd6e20 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,10 +8,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), + N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_verify_usage[] = { + N_("git commit-graph verify [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_verify(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_verify_options[] = { + OPT_STRING(0, "object-dir", &opts.obj_dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_verify_options, +builtin_commit_graph_verify_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); + + if (!graph) + return 0; + + return verify_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -163,6 +199,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "verify")) + return graph_verify(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index 25893ec096..55b41664ee 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -836,3 +836,29 @@ void write_commit_graph(const char *obj_dir, oids.alloc = 0; oids.nr = 0; } + +static int verify_commit_graph_error; + +static void graph_report(const char *fmt, ...) +{ + va_list ap; + struct strbuf sb = STRBUF_INIT; + verify_commit_graph_error = 1; + + va_start(ap, fmt); + strbuf_vaddf(&sb, fmt, ap); + + fprintf(stderr, "%s\n", sb.buf); + strbuf_release(&sb); + va_end(ap); +} + +int verify_commit_graph(struct commit_graph *g) +{ + if (!g) { + graph_report("no commit-graph file loaded"); + return
[PATCH v3 12/20] commit-graph: verify parent list
The commit-graph file stores parents in a two-column portion of the commit data chunk. If there is only one parent, then the second column stores 0x to indicate no second parent. The 'verify' subcommand checks the parent list for the commit loaded from the commit-graph and the one parsed from the object database. Test these checks for corrupt parents, too many parents, and wrong parents. The octopus merge will be tested in a later commit. Signed-off-by: Derrick Stolee --- commit-graph.c | 25 + t/t5318-commit-graph.sh | 18 ++ 2 files changed, 43 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 19ea369fc6..fff22dc0c3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -921,6 +921,7 @@ int verify_commit_graph(struct commit_graph *g) for (i = 0; i < g->num_commits; i++) { struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); @@ -938,6 +939,30 @@ int verify_commit_graph(struct commit_graph *g) oid_to_hex(&cur_oid), oid_to_hex(get_commit_tree_oid(graph_commit)), oid_to_hex(get_commit_tree_oid(odb_commit))); + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + if (odb_parents == NULL) { + graph_report("commit-graph parent list for commit %s is too long", +oid_to_hex(&cur_oid)); + break; + } + + if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid)) + graph_report("commit-graph parent for %s is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(&graph_parents->item->object.oid), + oid_to_hex(&odb_parents->item->object.oid)); + + graph_parents = graph_parents->next; + odb_parents = odb_parents->next; + } + + if (odb_parents != NULL) + graph_report("commit-graph parent list for commit %s terminates early", +oid_to_hex(&cur_oid)); } return verify_commit_graph_error; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 21cc8e82f3..12f0d7f54d 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -269,6 +269,9 @@ GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8` GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10` GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS` GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET +GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN` +GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4` +GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -348,4 +351,19 @@ test_expect_success 'detect incorrect tree OID' ' "root tree OID for commit" ' +test_expect_success 'detect incorrect parent int-id' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_PARENT "\01" \ + "invalid parent" +' + +test_expect_success 'detect extra parent int-id' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_EXTRA_PARENT "\00" \ + "is too long" +' + +test_expect_success 'detect incorrect tree OID' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_WRONG_PARENT "\01" \ + "commit-graph parent for" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 01/20] commit-graph: UNLEAK before die()
Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..f0875b8bf3 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv) graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); - if (!graph) + if (!graph) { + UNLEAK(graph_name); die("graph file %s does not exist", graph_name); + } + FREE_AND_NULL(graph_name); printf("header: %08x %d %d %d %d\n", -- 2.16.2.329.gfb62395de6
[PATCH v3 14/20] commit-graph: verify commit date
Signed-off-by: Derrick Stolee --- commit-graph.c | 6 ++ t/t5318-commit-graph.sh | 6 ++ 2 files changed, 12 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index ead92460c1..d2b291aca2 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -981,6 +981,12 @@ int verify_commit_graph(struct commit_graph *g) oid_to_hex(&cur_oid), graph_commit->generation, max_generation + 1); + + if (graph_commit->date != odb_commit->date) + graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime, +oid_to_hex(&cur_oid), +graph_commit->date, +odb_commit->date); } return verify_commit_graph_error; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 673b0d37d5..58adb8246d 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -273,6 +273,7 @@ GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN` GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4` GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3` GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8` +GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -372,4 +373,9 @@ test_expect_success 'detect incorrect generation number' ' "generation" ' +test_expect_success 'detect incorrect commit date' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATE "\01" \ + "commit date" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 11/20] commit-graph: verify root tree OIDs
The 'verify' subcommand must compare the commit content parsed from the commit-graph and compare it against the content in the object database. Use lookup_commit() and parse_commit_in_graph_one() to parse the commits from the graph and compare against a commit that is loaded separately and parsed directly from the object database. Add checks for the root tree OID. Signed-off-by: Derrick Stolee --- commit-graph.c | 17 - t/t5318-commit-graph.sh | 7 +++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 0420ebcd87..19ea369fc6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g) return verify_commit_graph_error; for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i && oidcmp(&prev_oid, &cur_oid) >= 0) @@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(&cur_oid); + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report("failed to parse %s from commit-graph", +oid_to_hex(&cur_oid)); } while (cur_fanout_pos < 256) { @@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g) return verify_commit_graph_error; for (i = 0; i < g->num_commits; i++) { - struct commit *odb_commit; + struct commit *graph_commit, *odb_commit; hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + graph_commit = lookup_commit(&cur_oid); odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); if (parse_commit_internal(odb_commit, 0, 0)) { graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid)); continue; } + + if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report("root tree OID for commit %s in commit-graph is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); } return verify_commit_graph_error; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 996a016239..21cc8e82f3 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255` GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256` GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8` GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10` +GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS` +GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -341,4 +343,9 @@ test_expect_success 'detect OID not in object database' ' "from object database" ' +test_expect_success 'detect incorrect tree OID' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_TREE "\01" \ + "root tree OID for commit" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 10/20] commit-graph: verify objects exist
In the 'verify' subcommand, load commits directly from the object database to ensure they exist. Parse by skipping the commit-graph. Signed-off-by: Derrick Stolee --- commit-graph.c | 20 t/t5318-commit-graph.sh | 7 +++ 2 files changed, 27 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index cbd1aae514..0420ebcd87 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, { struct commit *c; struct object_id oid; + + if (pos >= g->num_commits) + die("invalid parent position %"PRIu64, pos); + hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); c = lookup_commit(&oid); if (!c) @@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + if (verify_commit_graph_error) + return verify_commit_graph_error; + + for (i = 0; i < g->num_commits; i++) { + struct commit *odb_commit; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) { + graph_report("failed to parse %s from object database", +oid_to_hex(&cur_oid)); + continue; + } + } + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index c050ef980b..996a016239 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +NUM_COMMITS=9 HASH_LEN=20 GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 @@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4` GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255` GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256` GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8` +GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' ' "incorrect OID order" ' +test_expect_success 'detect OID not in object database' ' + corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \ + "from object database" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 19/20] gc: automatically write commit-graph files
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, write the commit-graph file by default during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Defaults to false while the commit-graph feature matures. We specifically do not want to turn this on by default until the commit-graph feature is fully integrated with history-modifying features like shallow clones. Signed-off-by: Derrick Stolee --- Documentation/config.txt | 6 ++ Documentation/git-gc.txt | 4 builtin/gc.c | 6 ++ t/t5318-commit-graph.sh | 14 ++ 4 files changed, 30 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 11f027194e..9a3abd87e7 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1553,6 +1553,12 @@ gc.autoDetach:: Make `git gc --auto` return immediately and run in background if the system supports it. Default is true. +gc.commitGraph:: + If true, then gc will rewrite the commit-graph file after any + change to the object database. If '--auto' is used, then the + commit-graph will not be updated unless the threshold is met. + See linkgit:git-commit-graph[1] for details. + gc.logExpiry:: If the file gc.log exists, then `git gc --auto` won't run unless that file is more than 'gc.logExpiry' old. Default is diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index 571b5a7e3c..17dd654a59 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. +The optional configuration variable 'gc.commitGraph' determines if +'git gc' runs 'git commit-graph write'. This can be set to a boolean +value. This defaults to false. + The optional configuration variable `gc.aggressiveWindow` controls how much time is spent optimizing the delta compression of the objects in the repository when the --aggressive option is specified. The larger diff --git a/builtin/gc.c b/builtin/gc.c index 77fa720bd0..efd214a59f 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -20,6 +20,7 @@ #include "argv-array.h" #include "commit.h" #include "packfile.h" +#include "commit-graph.h" #define FAILED_RUN "failed to run %s" @@ -34,6 +35,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; +static int gc_commit_graph = 0; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; @@ -121,6 +123,7 @@ static void gc_config(void) git_config_get_int("gc.aggressivedepth", &aggressive_depth); git_config_get_int("gc.auto", &gc_auto_threshold); git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit); + git_config_get_bool("gc.commitgraph", &gc_commit_graph); git_config_get_bool("gc.autodetach", &detach_auto); git_config_get_expiry("gc.pruneexpire", &prune_expire); git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire); @@ -480,6 +483,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); + if (gc_commit_graph) + write_commit_graph_reachable(get_object_directory(), 0); + if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a659620332..d20b17586f 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full repo' ' test_cmp expect output ' +test_expect_success 'check that gc clears commit-graph' ' + cd "$TRASH_DIRECTORY/full" && + git commit --allow-empty -m "blank" && + git commit-graph write --reachable && + cp $objdir/info/commit-graph commit-graph-before-gc && + git reset --hard HEAD~1 && + git config gc.commitGraph true && + git gc && + cp $objdir/info/commit-graph commit-graph-after-gc && + ! test_cmp commit-graph-before-gc commit-graph-after-gc && + git commit-graph write --reachable && + test_cmp commit-graph-after-gc $objdir/info/commit-graph +' + # the verify tests below expect the commit-graph to contain # exactly the commits reachable from the commits/8 branch. # If the file changes the set of commits in the list, then the -- 2.16.2.329.gfb62395de6
[PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
In the commit-graph file, the OID fanout chunk provides an index into the OID lookup. The 'verify' subcommand should find incorrect values in the fanout. Similarly, the 'verify' subcommand should find out-of-order values in the OID lookup. Signed-off-by: Derrick Stolee --- commit-graph.c | 36 t/t5318-commit-graph.sh | 22 ++ 2 files changed, 58 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 06e3e4f9ba..cbd1aae514 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...) int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report("no commit-graph file loaded"); return 1; @@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report("commit-graph is missing the Commit Data chunk"); + if (verify_commit_graph_error) + return verify_commit_graph_error; + + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i && oidcmp(&prev_oid, &cur_oid) >= 0) + graph_report("commit-graph has incorrect OID order: %s then %s", +oid_to_hex(&prev_oid), +oid_to_hex(&cur_oid)); + + oidcpy(&prev_oid, &cur_oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 4ef3fe3dc2..c050ef980b 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +HASH_LEN=20 GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 GRAPH_BYTE_CHUNK_COUNT=6 @@ -258,6 +259,12 @@ GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \ 1 \* $GRAPH_CHUNK_LOOKUP_WIDTH` GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \ 2 \* $GRAPH_CHUNK_LOOKUP_WIDTH` +GRAPH_FANOUT_OFFSET=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \ + $GRAPH_CHUNK_LOOKUP_WIDTH \* $GRAPH_CHUNK_LOOKUP_ROWS` +GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4` +GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255` +GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256` +GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -312,4 +319,19 @@ test_expect_success 'detect missing commit data chunk' ' "missing the Commit Data chunk" ' +test_expect_success 'detect incorrect fanout' ' + corrupt_graph_and_verify $GRAPH_BYTE_FANOUT1 "\01" \ + "fanout value" +' + +test_expect_success 'detect incorrect fanout' ' + corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \ + "fanout value" +' + +test_expect_success 'detect incorrect OID order' ' + corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ORDER "\01" \ + "incorrect OID order" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 16/20] commit-graph: verify contents match checksum
The commit-graph file ends with a SHA1 hash of the previous contents. If a commit-graph file has errors but the checksum hash is correct, then we know that the problem is a bug in Git and not simply file corruption after-the-fact. Compute the checksum right away so it is the first error that appears, and make the message translatable since this error can be "corrected" by a user by simply deleting the file and recomputing. The rest of the errors are useful only to developers. Be sure to continue checking the rest of the file data if the checksum is wrong. This is important for our tests, as we break the checksum as we modify bytes of the commit-graph file. Signed-off-by: Derrick Stolee --- commit-graph.c | 16 ++-- t/t5318-commit-graph.sh | 6 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index d2b291aca2..a33600c584 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; } +#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 static int verify_commit_graph_error; static void graph_report(const char *fmt, ...) @@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...) int verify_commit_graph(struct commit_graph *g) { uint32_t i, cur_fanout_pos = 0; - struct object_id prev_oid, cur_oid; + struct object_id prev_oid, cur_oid, checksum; + struct hashfile *f; + int devnull; if (!g) { graph_report("no commit-graph file loaded"); @@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g) if (verify_commit_graph_error) return verify_commit_graph_error; + devnull = open("/dev/null", O_WRONLY); + f = hashfd(devnull, NULL); + hashwrite(f, g->data, g->data_len - g->hash_len); + finalize_hashfile(f, checksum.hash, CSUM_CLOSE); + if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) { + graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt")); + verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH; + } + for (i = 0; i < g->num_commits; i++) { struct commit *graph_commit; @@ -916,7 +928,7 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } - if (verify_commit_graph_error) + if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH) return verify_commit_graph_error; for (i = 0; i < g->num_commits; i++) { diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 240aef6add..2680a2ebff 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -279,6 +279,7 @@ GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16` GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \ $GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS` GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4` +GRAPH_BYTE_FOOTER=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4 \* $NUM_OCTOPUS_EDGES` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopus merge' ' "invalid parent" ' +test_expect_success 'detect invalid checksum hash' ' + corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \ + "incorrect checksum" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE
The GRAPH_MIN_SIZE macro should be the smallest size of a parsable commit-graph file. However, the minimum number of chunks was wrong. It is possible to write a commit-graph file with zero commits, and that violates this macro's value. Rewrite the macro, and use extra macros to better explain the magic constants. Signed-off-by: Derrick Stolee --- commit-graph.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..82295f0975 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -33,10 +33,11 @@ #define GRAPH_LAST_EDGE 0x8000 +#define GRAPH_HEADER_SIZE 8 #define GRAPH_FANOUT_SIZE (4 * 256) #define GRAPH_CHUNKLOOKUP_WIDTH 12 -#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ - GRAPH_OID_LEN + 8) +#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \ + + GRAPH_FANOUT_SIZE + GRAPH_OID_LEN) char *get_commit_graph_filename(const char *obj_dir) { -- 2.16.2.329.gfb62395de6
[PATCH v3 05/20] commit-graph: load a root tree from specific graph
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee --- commit-graph.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 78ba0edc80..25893ec096 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -358,14 +358,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit * return c->maybe_tree; } -struct tree *get_commit_tree_in_graph(const struct commit *c) +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g, +const struct commit *c) { if (c->maybe_tree) return c->maybe_tree; if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) - BUG("get_commit_tree_in_graph called from non-commit-graph commit"); + BUG("get_commit_tree_in_graph_one called from non-commit-graph commit"); + + return load_tree_for_commit(g, (struct commit *)c); +} - return load_tree_for_commit(commit_graph, (struct commit *)c); +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + return get_commit_tree_in_graph_one(commit_graph, c); } static void write_graph_chunk_fanout(struct hashfile *f, -- 2.16.2.329.gfb62395de6
[PATCH v3 08/20] commit-graph: verify required chunks are present
The commit-graph file requires the following three chunks: * OID Fanout * OID Lookup * Commit Data If any of these are missing, then the 'verify' subcommand should report a failure. This includes the chunk IDs malformed or the chunk count is truncated. Signed-off-by: Derrick Stolee --- commit-graph.c | 9 + t/t5318-commit-graph.sh | 29 + 2 files changed, 38 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 55b41664ee..06e3e4f9ba 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -860,5 +860,14 @@ int verify_commit_graph(struct commit_graph *g) return 1; } + verify_commit_graph_error = 0; + + if (!g->chunk_oid_fanout) + graph_report("commit-graph is missing the OID Fanout chunk"); + if (!g->chunk_oid_lookup) + graph_report("commit-graph is missing the OID Lookup chunk"); + if (!g->chunk_commit_data) + graph_report("commit-graph is missing the Commit Data chunk"); + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index bd64481c7a..4ef3fe3dc2 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -249,6 +249,15 @@ test_expect_success 'git commit-graph verify' ' GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 +GRAPH_BYTE_CHUNK_COUNT=6 +GRAPH_CHUNK_LOOKUP_OFFSET=8 +GRAPH_CHUNK_LOOKUP_WIDTH=12 +GRAPH_CHUNK_LOOKUP_ROWS=5 +GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET +GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \ + 1 \* $GRAPH_CHUNK_LOOKUP_WIDTH` +GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \ + 2 \* $GRAPH_CHUNK_LOOKUP_WIDTH` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -283,4 +292,24 @@ test_expect_success 'detect bad hash version' ' "hash version" ' +test_expect_success 'detect bad chunk count' ' + corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\02" \ + "missing the Commit Data chunk" +' + +test_expect_success 'detect missing OID fanout chunk' ' + corrupt_graph_and_verify $GRAPH_BYTE_OID_FANOUT_ID "\0" \ + "missing the OID Fanout chunk" +' + +test_expect_success 'detect missing OID lookup chunk' ' + corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \ + "missing the OID Lookup chunk" +' + +test_expect_success 'detect missing commit data chunk' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \ + "missing the Commit Data chunk" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 20/20] commit-graph: update design document
The commit-graph feature is now integrated with 'fsck' and 'gc', so remove those items from the "Future Work" section of the commit-graph design document. Also remove the section on lazy-loading trees, as that was completed in an earlier patch series. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 22 -- 1 file changed, 22 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index e1a883eb46..c664acbd76 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -118,9 +118,6 @@ Future Work - The commit graph feature currently does not honor commit grafts. This can be remedied by duplicating or refactoring the current graft logic. -- The 'commit-graph' subcommand does not have a "verify" mode that is - necessary for integration with fsck. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered @@ -130,25 +127,6 @@ Future Work - 'log --topo-order' - 'tag --merged' -- Currently, parse_commit_gently() requires filling in the root tree - object for a commit. This passes through lookup_tree() and consequently - lookup_object(). Also, it calls lookup_commit() when loading the parents. - These method calls check the ODB for object existence, even if the - consumer does not need the content. For example, we do not need the - tree contents when computing merge bases. Now that commit parsing is - removed from the computation time, these lookup operations are the - slowest operations keeping graph walks from being fast. Consider - loading these objects without verifying their existence in the ODB and - only loading them fully when consumers need them. Consider a method - such as "ensure_tree_loaded(commit)" that fully loads a tree before - using commit->tree. - -- The current design uses the 'commit-graph' subcommand to generate the graph. - When this feature stabilizes enough to recommend to most users, we should - add automatic graph writes to common operations that create many commits. - For example, one could compute a graph on 'clone', 'fetch', or 'repack' - commands. - - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if the user is willing to trust the file, because verifying the file is correct -- 2.16.2.329.gfb62395de6
[PATCH v3 15/20] commit-graph: test for corrupted octopus edge
The commit-graph file has an extra chunk to store the parent int-ids for parents beyond the first parent for octopus merges. Our test repo has a single octopus merge that we can manipulate to demonstrate the 'verify' subcommand detects incorrect values in that chunk. Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 10 ++ 1 file changed, 10 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 58adb8246d..240aef6add 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -248,6 +248,7 @@ test_expect_success 'git commit-graph verify' ' ' NUM_COMMITS=9 +NUM_OCTOPUS_EDGES=2 HASH_LEN=20 GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 @@ -274,6 +275,10 @@ GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4` GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3` GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8` GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12` +GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16` +GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \ + $GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS` +GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -378,4 +383,9 @@ test_expect_success 'detect incorrect commit date' ' "commit date" ' +test_expect_success 'detect incorrect parent for octopus merge' ' + corrupt_graph_and_verify $GRAPH_BYTE_OCTOPUS "\01" \ + "invalid parent" +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 17/20] fsck: verify commit-graph
If core.commitGraph is true, verify the contents of the commit-graph during 'git fsck' using the 'git commit-graph verify' subcommand. Run this check on all alternates, as well. We use a new process for two reasons: 1. The subcommand decouples the details of loading and verifying a commit-graph file from the other fsck details. 2. The commit-graph verification requires the commits to be loaded in a specific order to guarantee we parse from the commit-graph file for some objects and from the object database for others. Signed-off-by: Derrick Stolee --- Documentation/git-fsck.txt | 3 +++ builtin/fsck.c | 21 + t/t5318-commit-graph.sh| 8 3 files changed, 32 insertions(+) diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt index b9f060e3b2..ab9a93fb9b 100644 --- a/Documentation/git-fsck.txt +++ b/Documentation/git-fsck.txt @@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives (i.e., you can just remove them and do an 'rsync' with some other site in the hopes that somebody else has the object you have corrupted). +If core.commitGraph is true, the commit-graph file will also be inspected +using 'git commit-graph verify'. See linkgit:git-commit-graph[1]. + Extracted Diagnostics - diff --git a/builtin/fsck.c b/builtin/fsck.c index ef78c6c00c..a6d5045b77 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -16,6 +16,7 @@ #include "streaming.h" #include "decorate.h" #include "packfile.h" +#include "run-command.h" #define REACHABLE 0x0001 #define SEEN 0x0002 @@ -45,6 +46,7 @@ static int name_objects; #define ERROR_REACHABLE 02 #define ERROR_PACK 04 #define ERROR_REFS 010 +#define ERROR_COMMIT_GRAPH 020 static const char *describe_object(struct object *obj) { @@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } check_connectivity(); + + if (core_commit_graph) { + struct child_process commit_graph_verify = CHILD_PROCESS_INIT; + const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL }; + commit_graph_verify.argv = verify_argv; + commit_graph_verify.git_cmd = 1; + + if (run_command(&commit_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + + prepare_alt_odb(); + for (alt = alt_odb_list; alt; alt = alt->next) { + verify_argv[2] = "--object-dir"; + verify_argv[3] = alt->path; + if (run_command(&commit_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + } + } + return errors_found; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 2680a2ebff..4941937163 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -394,4 +394,12 @@ test_expect_success 'detect invalid checksum hash' ' "incorrect checksum" ' +test_expect_success 'git fsck (checks commit-graph)' ' + cd "$TRASH_DIRECTORY/full" && + git fsck && + corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \ + "incorrect checksum" && + test_must_fail git fsck +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v3 18/20] commit-graph: add '--reachable' option
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc' which will call write_commit_graph_reachable() after performing cleanup of the object database. Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 8 ++-- builtin/commit-graph.c | 16 commit-graph.c | 32 commit-graph.h | 1 + t/t5318-commit-graph.sh| 10 ++ 5 files changed, 61 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index a222cfab08..dececb79d7 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by walking objects only in the specified pack-indexes. (Cannot be combined -with --stdin-commits.) +with `--stdin-commits` or `--reachable`.) + With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with ---stdin-packs.) +`--stdin-packs` or `--reachable`.) ++ +With the `--reachable` option, generate the new commit graph by walking +commits starting at all refs. (Cannot be combined with `--stdin-commits` +or `--stdin-packs`.) + With the `--append` option, include all commits that are present in the existing commit-graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0433dd6e20..20ce6437ae 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph verify [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; @@ -24,12 +24,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int reachable; int stdin_packs; int stdin_commits; int append; @@ -130,6 +131,8 @@ static int graph_write(int argc, const char **argv) OPT_STRING(0, "object-dir", &opts.obj_dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "reachable", &opts.reachable, + N_("start walk at all refs")), OPT_BOOL(0, "stdin-packs", &opts.stdin_packs, N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", &opts.stdin_commits, @@ -143,11 +146,16 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); - if (opts.stdin_packs && opts.stdin_commits) - die(_("cannot use both --stdin-commits and --stdin-packs")); + if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1) + die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); + if (opts.reachable) { + write_commit_graph_reachable(opts.obj_dir, opts.append); + return 0; + } + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; lines_nr = 0; diff --git a/commit-graph.c b/commit-graph.c index a33600c584..057d734926 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -6,6 +6,7 @@ #include "packfile.h" #include "commit.h" #include "object.h" +#include "refs.h" #include "revision.h" #include "sha1-lookup.h" #include "commit-graph.h" @@ -651,6 +652,37 @@ static void compute_generation_numbers(struct packed_commit_list* commits)
[PATCH v3 13/20] commit-graph: verify generation number
While iterating through the commit parents, perform the generation number calculation and compare against the value stored in the commit-graph. The tests demonstrate that having a different set of parents affects the generation number calculation, and this value propagates to descendants. Hence, we drop the single-line condition on the output. Signed-off-by: Derrick Stolee --- commit-graph.c | 18 ++ t/t5318-commit-graph.sh | 6 ++ 2 files changed, 24 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index fff22dc0c3..ead92460c1 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -922,6 +922,7 @@ int verify_commit_graph(struct commit_graph *g) for (i = 0; i < g->num_commits; i++) { struct commit *graph_commit, *odb_commit; struct commit_list *graph_parents, *odb_parents; + uint32_t max_generation = 0; hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); @@ -956,6 +957,9 @@ int verify_commit_graph(struct commit_graph *g) oid_to_hex(&graph_parents->item->object.oid), oid_to_hex(&odb_parents->item->object.oid)); + if (graph_parents->item->generation > max_generation) + max_generation = graph_parents->item->generation; + graph_parents = graph_parents->next; odb_parents = odb_parents->next; } @@ -963,6 +967,20 @@ int verify_commit_graph(struct commit_graph *g) if (odb_parents != NULL) graph_report("commit-graph parent list for commit %s terminates early", oid_to_hex(&cur_oid)); + + /* +* If one of our parents has generation GENERATION_NUMBER_MAX, then +* our generation is also GENERATION_NUMBER_MAX. Decrement to avoid +* extra logic in the following condition. +*/ + if (max_generation == GENERATION_NUMBER_MAX) + max_generation--; + + if (graph_commit->generation != max_generation + 1) + graph_report("commit-graph generation for commit %s is %u != %u", +oid_to_hex(&cur_oid), +graph_commit->generation, +max_generation + 1); } return verify_commit_graph_error; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 12f0d7f54d..673b0d37d5 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -272,6 +272,7 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN` GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4` GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3` +GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8` # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -366,4 +367,9 @@ test_expect_success 'detect incorrect tree OID' ' "commit-graph parent for" ' +test_expect_success 'detect incorrect generation number' ' + corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \ + "generation" +' + test_done -- 2.16.2.329.gfb62395de6
Re: [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
On 5/21/2018 2:53 PM, Jakub Narebski wrote: +corrupt_data() { + file=$1 + pos=$2 + data="${3:-\0}" + printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc +} First, if we do this that way (and not by adding a test helper), the use of this function should be, I think, protected using appropriate test prerequisite. Not everyone has 'dd' tool installed, for example on MS Windows. Windows does not, but it is also missing many things this test suite needs. 'dd' is included in the Git for Windows SDK. I rebased this series onto Git for Windows and the tests passed when run in an SDK shell. Thanks, -Stolee
Re: [PATCH v3 01/20] commit-graph: UNLEAK before die()
On 5/24/2018 6:47 PM, Stefan Beller wrote: On Thu, May 24, 2018 at 9:25 AM, Derrick Stolee wrote: Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..f0875b8bf3 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv) graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); - if (!graph) + if (!graph) { + UNLEAK(graph_name); die("graph file %s does not exist", graph_name); Unrelated to this patch: Is the command that ends up die()ing here a plumbing or porcelain, or: Do we want to translate the message here? In a lot of commands that show paths we single quote them '%s', (speaking from experience with a lot of submodule path code) This is for the 'git commit-graph read' command, which is plumbing (and 'read' is really only for testing). I don't think this message requires translation. I'll keep the quotes in mind for the future. Thanks, -Stolee
Re: [PATCH] t990X: use '.git/objects' as 'deep inside .git' path
On 5/27/2018 12:49 AM, Michael Haggerty wrote: On Sat, May 26, 2018 at 8:47 AM, Christian Couder wrote: Tests t9902-completion.sh and t9903-bash-prompt.sh each have tests that check what happens when we are "in the '.git' directory" and when we are "deep inside the '.git' directory". To test the case when we are "deep inside the '.git' directory" the test scripts used to perform a `cd .git/refs/heads`. As there are plans to implement other ref storage systems, let's use '.git/objects' instead of '.git/refs/heads' as the "deep inside the '.git' directory" path. Seems reasonable to me. +1. Michael Looks good to me, too. Thanks, -Stolee
Re: [PATCH v3 03/20] commit-graph: parse commit from chosen graph
On 5/27/2018 6:23 AM, Jakub Narebski wrote: Derrick Stolee writes: Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. If I understand it properly the problem is that when verifying against the object database we want to check one single commit-graph file, not concatenation of data from commit-graph file for the repository and commit-graph files from its alternates -- like prepare_commit_graph() does; which is called by parse_commit_in_graph(). Signed-off-by: Derrick Stolee O.K., so you introduce here a layer of indirection; parse_commit_in_graph() now just uses parse_commit_in_graph_one(), passing core_commit_graph (or the_commit_graph) to it, after checking that core_commit_graph is set (which handles the case when feature is not turned off) and loading commit-graph file. Nice and simple 'split function' refactoring, with new function taking over when there is commit graph file prepared. So, after the changes: * parse_commit_in_graph() is responsible for checking whether to use commit-graph feature and ensuring that data from commit-graph is loaded, where it passes the control to parse_commit_in_graph_one() * parse_commit_in_graph_one() checks whether commit-graph feature is turned on, whether commit we are interested in was already parsed, and then uses fill_commit_in_graph() to actually get the data * fill_commit_in_graph() gets data out of commit-graph file, extracting it from commit data chunk (and if needed large edges chunk). All those functions return 1 if they got data from commit-graph, and 0 if they didn't. One minor nitpick / complaint / question is about naming of global variables used here, namely: * static struct commit_graph *commit_graph from commit-graph.c for global storage of commit-graph[s] data * int core_commit_graph from environment.c for storing core.commitGraph config But I see that at least the latter is common convention in Git source code; I guess that the former maybe follows convention as used for "the index" and "the repository" - additionally it is static / file-local. See also `struct packed_git *packed_git;` from cache.h. --- commit-graph.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 82295f0975..78ba0edc80 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; @@ -318,9 +318,21 @@ static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) if (!core_commit_graph) return 0; All right, we check that commit-graph feature is enabled because parse_commit_in_graph_one() will be used standalone, not by being invoked from parse_commit_in_graph(). This check is fast. if (item->object.parsed) return 1; Sidenote: I just wonder why object.parsed and not for example object.graph_pos is used to checck whether the object was filled if possible with commit-graph data... Here we are filling all data, including commit date and parents. We use load_commit_graph_info() when we only need the graph_pos and generation values. + + if (find_commit_in_graph(item, g, &pos)) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ + if (!core_commit_graph) + return 0; All right, this check is here to short-circuit and make it so git does not even try to lod commit-graph file[s] if the feature is disabled. + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; }
Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
On 5/29/2018 12:27 AM, Junio C Hamano wrote: Derrick Stolee writes: Thanks for all the feedback on v2. I've tried to make this round's review a bit easier by splitting up the commits into smaller pieces. Also, the test script now has less boilerplate and uses variables and clear arithmetic to explain which bytes are being modified. One other change worth mentioning: in "commit-graph: add '--reachable' option" I put the ref-iteration into a new external 'write_commit_graph_reachable()' method inside commit-graph.c. This makes the 'gc: automatically write commit-graph files' a simpler change. I finally managed to find time to resolve conflicts this topic has with other topics (of your own included, if I am not mistaken). Please double check the resolution when I push out the day's integration result later today. Sorry about the confusion. I was operating on my copy of ds/generation-numbers (34fdd43339) but fetching just now I see you updated that branch to 1472978ec6. I reset my local branch to ds/commit-graph-fsk (53dd1e6600). The only diff I see between my v3 branch and that commit are the changes from ds/commit-graph-lockfile-fix (33286dcd6d). Thanks, -Stolee
Re: [PATCH v3 07/20] commit-graph: verify catches corrupt signature
On 5/28/2018 10:05 AM, Jakub Narebski wrote: Derrick Stolee writes: This is the first of several commits that add a test to check that 'git commit-graph verify' catches corruption in the commit-graph file. The first test checks that the command catches an error in the file signature. This is a check that exists in the existing commit-graph reading code. Good start. This handles 3 out of 5 checks in load_commit_graph_one(). The remaining are: * too short file (length smaller than minimal commit-graph size) * more than one chunk of one of 4 defined types Add a helper method 'corrupt_graph_and_verify' to the test script t5318-commit-graph.sh. This helper corrupts the commit-graph file at a certain location, runs 'git commit-graph verify', and reports the output to the 'err' file. This data is filtered to remove the lines added by 'test_must_fail' when the test is run verbosely. Then, the output is checked to contain a specific error message. Thanks for an explanation. Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 43 +++ 1 file changed, 43 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6ca451dfd2..bd64481c7a 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' ' test_cmp expect output ' +# the verify tests below expect the commit-graph to contain +# exactly the commits reachable from the commits/8 branch. +# If the file changes the set of commits in the list, then the +# offsets into the binary file will result in different edits +# and the tests will likely break. + test_expect_success 'git commit-graph verify' ' cd "$TRASH_DIRECTORY/full" && + git rev-parse commits/8 | git commit-graph write --stdin-commits && git commit-graph verify >output ' I don't quite understand what the change is meant to do. This gives us a constant commit-graph file to work with in the later tests. To get the "independent test" structure you want for the tests that are coming, we need to do one of the following: 1. Write a new commit-graph file for every test (slows things down). 2. Do all corruption/verify checks in a single test (reduces the information from a failed test, as it only reports the first failure). I don't like either of these options, so I went with this "prepare" step. Also, as I said earlier, I'd prefer if tests were as indpendent of each other as possible, to make running individual tests (e.g. only previously falling tests) easier. I especially do not like mixing running actual test with setting up the repository for future tests, as here. +GRAPH_BYTE_VERSION=4 +GRAPH_BYTE_HASH=5 + +# usage: corrupt_graph_and_verify +# Manipulates the commit-graph file at the position +# by inserting the data, then runs 'git commit-graph verify' +# and places the output in the file 'err'. Test 'err' for +# the given string. Very nice to have this description. +corrupt_graph_and_verify() { + pos=$1 + data="${2:-\0}" + grepstr=$3 + cd "$TRASH_DIRECTORY/full" && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + cp $objdir/info/commit-graph commit-graph-backup && + printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" conv=notrunc && Using 'printf' with octal is much more portable than relying on 'echo' supporting octal escape sequences (or supporting escape sequences at all). + test_must_fail git commit-graph verify 2>test_err && + grep -v "^+" test_err >err + grep "$grepstr" err Shouldn't this last 'grep' be 'test_i18ngrep' instead, to allow for translated messages from 'git commit-graph verify' / 'git fsck'? +} This function makes actual tests short and simple, without duplicated code. Very good. + +test_expect_success 'detect bad signature' ' + corrupt_graph_and_verify 0 "\0" \ + "graph signature" +' + +test_expect_success 'detect bad version' ' + corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \ + "graph version" +' + +test_expect_success 'detect bad hash version' ' + corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \ When we move from SHA-1 (hash version 1) to NewHash (hash version 2), this test would start failing... which is actually not a bad idea. + "hash version" +' + test_done
Re: [PATCH] commit-graph: fix a sparse 'integer as NULL pointer' warning
n 5/29/2018 4:01 PM, Ramsay Jones wrote: Signed-off-by: Ramsay Jones --- Hi Derrick, If you need to re-roll your 'ds/commit-graph-fsck' branch (pu@a84e06bc0f), could you please squash this into the relevant patch (commit 80453b4529, "commit-graph: add 'verify' subcommand", 2018-05-24). [No, No, that was the one in graph_read(). :-D ] Thanks for helping rid me of my bad habits! I'll get this one in v4. Thanks! ATB, Ramsay Jones builtin/commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 20ce6437a..6abfde5e6 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -39,7 +39,7 @@ static struct opts_commit_graph { static int graph_verify(int argc, const char **argv) { - struct commit_graph *graph = 0; + struct commit_graph *graph = NULL; char *graph_name; static struct option builtin_commit_graph_verify_options[] = {
Re: [RFC PATCH 00/35] object-store: lookup_commit
On 5/29/2018 8:47 PM, Stefan Beller wrote: This applies on the merge of nd/commit-util-to-slab and sb/object-store-grafts, and is available at http://github.com/stefanbeller/ as branch object-store-lookup-commit as the merge has some merge conflicts as well as syntactical conflicts (upload-pack.c and fetch-pack.c introduce new calls of functions that would want to take a repository struct in the object-store-grafts series) As layed out in https://public-inbox.org/git/20180517225154.9200-1-sbel...@google.com/ this is getting close to finishing the set of object store series though the last unfinished part of this RFC hints at new work on the plate: * To give this series a nice polish, we'd want to convert parse_commit, too. But that requires the conversion of the new commit graph. Maybe we need to split this series into 2. I'll take a look at this series tomorrow. I've been working in ds/commit-graph-fsck to make many of the methods take a 'struct commit_graph *' parameter, which could easily be 'r->commit_graph' for a 'struct the_repository *r' instead of the global 'commit_graph'. Those graph-local methods will make the transformation to be repo-local a lot easier. (There still may be some need to insert a repository, though.) Since you are working on the commit-slab stuff in this patch, I'll (continue to) delay my patch series on using the commit-slab for generation numbers to avoid collisions with your work. Thanks, -Stolee
Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand
On 5/27/2018 6:55 PM, Jakub Narebski wrote: Derrick Stolee writes: If the commit-graph file becomes corrupt, we need a way to verify that its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. So this commit is simply getting the boilerplate out of the way for implementing 'git commit-graph verify' subcommand. Good. Add a simple test that ensures the command returns a zero error code. Nice. If no commit-graph file exists, this is an acceptable state. Do not report any errors. All right. I assume that as it is explicit verification call, it does ignore core.commitGraph setting, isn't it? Signed-off-by: Derrick Stolee --- Documentation/git-commit-graph.txt | 6 ++ builtin/commit-graph.c | 38 ++ commit-graph.c | 26 ++ commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 5 files changed, 82 insertions(+) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..a222cfab08 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph read' [--object-dir ] +'git commit-graph verify' [--object-dir ] 'git commit-graph write' [--object-dir ] In alphabetical order, good. @@ -52,6 +53,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'verify':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + All right, good enough description. EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index f0875b8bf3..0433dd6e20 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,10 +8,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), + N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; In alphabetical order, same as in the manpage for git-commit-graph. +static const char * const builtin_commit_graph_verify_usage[] = { + N_("git commit-graph verify [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_verify(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_verify_options[] = { + OPT_STRING(0, "object-dir", &opts.obj_dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_verify_options, +builtin_commit_graph_verify_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); Getting the boilerplate of implementing the command mostly out of the way. Good. + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); So we are verifying only the commit-graph file belonging directly to current repository, as I have expected. This is needed to for warnings and error messages from the 'verify' action, to be able to tell in which file there are problems. This means that it is possible that there would be problems with commit-graph files that running 'git commit-graph verify' would not find, because they are in commit-graph file in one of the alternates. It is very easy, though, to check all commit-graph files that would be read and its data concatenated when using commit-graph feature (e.g. 'git commit-graph read', IIRC): $ git commit-graph verify $ for obj_dir in $(cat .git/objects/info/alternates) do; git commit-graph --object-dir="$obj_dir"; done Note: I have not checked
Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
On 5/30/2018 9:34 AM, Jakub Narebski wrote: Derrick Stolee writes: In the commit-graph file, the OID fanout chunk provides an index into the OID lookup. The 'verify' subcommand should find incorrect values in the fanout. Similarly, the 'verify' subcommand should find out-of-order values in the OID lookup. O.K., the OID Lookup chunk is checked together with OID Fanout chunk because those two chunks are tightly connected: OID Fanout is fanout into OID Lookup. Signed-off-by: Derrick Stolee --- commit-graph.c | 36 t/t5318-commit-graph.sh | 22 ++ 2 files changed, 58 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 06e3e4f9ba..cbd1aae514 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...) int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; Minor nitpick about the naming convention: why cur_*, and not curr_*? + if (!g) { graph_report("no commit-graph file loaded"); return 1; @@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report("commit-graph is missing the Commit Data chunk"); + if (verify_commit_graph_error) + return verify_commit_graph_error; + Before checking that commits in OID Lookup are sorted, and that OID Fanout correctly points into OID Lookup (thus also checking that OID Lookup is sorted), it would be good to verify that sizes of those chunks are correct. Out of 4 standrd chunks, 1 of them (OIDF) has constant size, and 2 of them have size given by number of commits and hash size - OID Fanout is (256 * 4 bytes) - OID Lookup is (N * H bytes), where N is number of commits, and H is hash size The one that is more significant is if number of commits gets corrupted upwards, and walking through OID Lookup would lead us out of bounds, outside the file size. IIRC we have checked that all chunks fit within file size, isn't it? + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); Why do you use hashcpy() here, but oidcpy() below? We are copying from a section of binary data, not from a struct object_id *. + + if (i && oidcmp(&prev_oid, &cur_oid) >= 0) All right, OIDs needs to be sorted in ascending lexicographical order, and the above condition checks that this constraint is fullfilled. + graph_report("commit-graph has incorrect OID order: %s then %s", +oid_to_hex(&prev_oid), +oid_to_hex(&cur_oid)); Nice informative error message. + + oidcpy(&prev_oid, &cur_oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } The k-th entry of fanout, F[k], stores the number of OIDs with first byte at most k. Let's examine it in detail on a simple example: fanout lookup -- -- 0 : 0 ---> 0: 1 1 : 2 -\ 1: 1 2 : 2 --\> 2: 3 3 : 3 ---> 3: 4 We are checking that after most significant byte (first byte) changes, then all fanout position up to the current first byte value (exclusive) point to current position in OID Lookup chunk. Seems all right; it would be nice to come up with rigorous proof that it is all right. + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } All right, this checks that all remaining fanout entries, up and including the 255-ith entry store the total number of commits, N. + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 4ef3fe3dc2..c050ef980b 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -247,6 +247,7 @@ test_expect_success 'git comm
Re: [PATCH 03/35] object: add repository argument to lookup_unknown_object
On 5/29/2018 8:47 PM, Stefan Beller wrote: From: Jonathan Nieder Add a repository argument to allow callers of lookup_unknown_object to be more specific about which repository to handle. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repositor other than the_repository at compile time. The included coccinelle semantic patch will adapt any new callers in the diff produced by `make coccicheck`. I don't see any coccinelle change in this patch. Perhaps it is in a later commit? Thanks, -Stolee Signed-off-by: Jonathan Nieder Signed-off-by: Stefan Beller --- builtin/fsck.c | 2 +- builtin/pack-objects.c | 2 +- http-push.c | 2 +- object.c | 2 +- object.h | 3 ++- refs.c | 2 +- t/helper/test-example-decorate.c | 6 +++--- upload-pack.c| 2 +- walker.c | 2 +- 9 files changed, 12 insertions(+), 11 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 98fdeef5407..700739804fc 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -638,7 +638,7 @@ static int fsck_cache_tree(struct cache_tree *it) static void mark_object_for_connectivity(const struct object_id *oid) { - struct object *obj = lookup_unknown_object(oid->hash); + struct object *obj = lookup_unknown_object(the_repository, oid->hash); obj->flags |= HAS_OBJ; } diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 8c108109985..6eae39cf858 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2689,7 +2689,7 @@ static void add_objects_in_unpacked_packs(struct rev_info *revs) for (i = 0; i < p->num_objects; i++) { nth_packed_object_oid(&oid, p, i); - o = lookup_unknown_object(oid.hash); + o = lookup_unknown_object(the_repository, oid.hash); if (!(o->flags & OBJECT_ADDED)) mark_in_pack_object(o, p, &in_pack); o->flags |= OBJECT_ADDED; diff --git a/http-push.c b/http-push.c index 2615c823d60..04d95bd5da8 100644 --- a/http-push.c +++ b/http-push.c @@ -1429,7 +1429,7 @@ static void one_remote_ref(const char *refname) * may be required for updating server info later. */ if (repo->can_update_info_refs && !has_object_file(&ref->old_oid)) { - obj = lookup_unknown_object(ref->old_oid.hash); + obj = lookup_unknown_object(the_repository, ref->old_oid.hash); fprintf(stderr, " fetch %s for %s\n", oid_to_hex(&ref->old_oid), refname); add_fetch_request(obj); diff --git a/object.c b/object.c index 4de4fa58d59..def3c71cac2 100644 --- a/object.c +++ b/object.c @@ -177,7 +177,7 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet) } } -struct object *lookup_unknown_object(const unsigned char *sha1) +struct object *lookup_unknown_object_the_repository(const unsigned char *sha1) { struct object *obj = lookup_object(the_repository, sha1); if (!obj) diff --git a/object.h b/object.h index fa41d711f44..778f83bf0f7 100644 --- a/object.h +++ b/object.h @@ -139,7 +139,8 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p); /** Returns the object, with potentially excess memory allocated. **/ -struct object *lookup_unknown_object(const unsigned char *sha1); +#define lookup_unknown_object(r, s) lookup_unknown_object_##r(s) +struct object *lookup_unknown_object_the_repository(const unsigned char *sha1); struct object_list *object_list_insert(struct object *item, struct object_list **list_p); diff --git a/refs.c b/refs.c index 23d53957deb..3b9e8463656 100644 --- a/refs.c +++ b/refs.c @@ -301,7 +301,7 @@ static int filter_refs(const char *refname, const struct object_id *oid, enum peel_status peel_object(const struct object_id *name, struct object_id *oid) { - struct object *o = lookup_unknown_object(name->hash); + struct object *o = lookup_unknown_object(the_repository, name->hash); if (o->type == OBJ_NONE) { int type = oid_object_info(the_repository, name, NULL); diff --git a/t/helper/test-example-decorate.c b/t/helper/test-example-decorate.c index 081115bf8eb..33e727f7fc5 100644 --- a/t/helper/test-example-decorate.c +++ b/t/helper/test-example-decorate.c @@ -26,8 +26,8 @@ int cmd__example_decorate(int argc, const char **argv) * Add 2 objects, one with a non-NULL decoration and one with a NULL
Re: [PATCH 04/35] object: add repository argument to parse_object_buffer
On 5/29/2018 8:47 PM, Stefan Beller wrote: Add a repository argument to allow the callers of parse_object_buffer to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Add the cocci patch that converted the callers. Again, I'm missing the cocci patch here. Signed-off-by: Jonathan Nieder Signed-off-by: Stefan Beller --- builtin/fast-export.c| 3 ++- builtin/fsck.c | 6 -- builtin/index-pack.c | 3 ++- builtin/unpack-objects.c | 3 ++- object.c | 5 +++-- object.h | 3 ++- ref-filter.c | 3 ++- 7 files changed, 17 insertions(+), 9 deletions(-) diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 24d42842f9d..a34ab9768f4 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -245,7 +245,8 @@ static void export_blob(const struct object_id *oid) die ("Could not read blob %s", oid_to_hex(oid)); if (check_object_signature(oid, buf, size, type_name(type)) < 0) die("sha1 mismatch in blob %s", oid_to_hex(oid)); - object = parse_object_buffer(oid, type, size, buf, &eaten); + object = parse_object_buffer(the_repository, oid, type, +size, buf, &eaten); } if (!object) diff --git a/builtin/fsck.c b/builtin/fsck.c index 700739804fc..e6d6eb266eb 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -392,7 +392,8 @@ static int fsck_obj_buffer(const struct object_id *oid, enum object_type type, * verify_packfile(), data_valid variable for details. */ struct object *obj; - obj = parse_object_buffer(oid, type, size, buffer, eaten); + obj = parse_object_buffer(the_repository, oid, type, size, buffer, + eaten); if (!obj) { errors_found |= ERROR_OBJECT; return error("%s: object corrupt or missing", oid_to_hex(oid)); @@ -522,7 +523,8 @@ static struct object *parse_loose_object(const struct object_id *oid, if (!contents && type != OBJ_BLOB) die("BUG: read_loose_object streamed a non-blob"); - obj = parse_object_buffer(oid, type, size, contents, &eaten); + obj = parse_object_buffer(the_repository, oid, type, size, + contents, &eaten); if (!eaten) free(contents); diff --git a/builtin/index-pack.c b/builtin/index-pack.c index e2f670bef9e..0dd10693597 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -848,7 +848,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry, * we do not need to free the memory here, as the * buf is deleted by the caller. */ - obj = parse_object_buffer(oid, type, size, buf, + obj = parse_object_buffer(the_repository, oid, type, + size, buf, &eaten); if (!obj) die(_("invalid %s"), type_name(type)); diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 9a4d2708123..8e454c48649 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -265,7 +265,8 @@ static void write_object(unsigned nr, enum object_type type, int eaten; hash_object_file(buf, size, type_name(type), &obj_list[nr].oid); added_object(nr, type, buf, size); - obj = parse_object_buffer(&obj_list[nr].oid, type, size, buf, + obj = parse_object_buffer(the_repository, &obj_list[nr].oid, + type, size, buf, &eaten); if (!obj) die("invalid %s", type_name(type)); diff --git a/object.c b/object.c index def3c71cac2..4250ddd3f7f 100644 --- a/object.c +++ b/object.c @@ -186,7 +186,7 @@ struct object *lookup_unknown_object_the_repository(const unsigned char *sha1) return obj; } -struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p) +struct object *parse_object_buffer_the_repository(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p) { struct object *obj; *eaten_p = 0; @@ -278,7 +278,8 @@ struct object *parse_object_the_repository(const struct object_id *oid) return NULL; } - obj = parse_object_buffer(oid, type, size, buffer, &eaten); +
Re: [PATCH 06/35] blob: add repository argument to lookup_blob
On 5/29/2018 8:47 PM, Stefan Beller wrote: Add a repository argument to allow the callers of lookup_blob to be more specific about which repository to act on. This is a small mechanical change; it doesn't change the implementation to handle repositories other than the_repository yet. As with the previous commits, use a macro to catch callers passing a repository other than the_repository at compile time. Add the cocci patch that converted the callers. (this is the last time I'll point this out. I'll wait until I read to the end to see if the coccinelle patch is included in the series.) Signed-off-by: Jonathan Nieder Signed-off-by: Stefan Beller --- blob.c | 2 +- blob.h | 3 ++- builtin/fast-export.c| 2 +- builtin/fsck.c | 3 ++- builtin/index-pack.c | 2 +- builtin/merge-tree.c | 3 ++- builtin/unpack-objects.c | 2 +- fsck.c | 2 +- http-push.c | 3 ++- list-objects.c | 2 +- object.c | 4 ++-- reachable.c | 2 +- revision.c | 4 ++-- tag.c| 2 +- walker.c | 3 ++- 15 files changed, 22 insertions(+), 17 deletions(-) diff --git a/blob.c b/blob.c index dada295698c..17b9314f0a0 100644 --- a/blob.c +++ b/blob.c @@ -5,7 +5,7 @@ const char *blob_type = "blob"; -struct blob *lookup_blob(const struct object_id *oid) +struct blob *lookup_blob_the_repository(const struct object_id *oid) { struct object *obj = lookup_object(the_repository, oid->hash); if (!obj) diff --git a/blob.h b/blob.h index 44606168310..08bc34487a0 100644 --- a/blob.h +++ b/blob.h @@ -9,7 +9,8 @@ struct blob { struct object object; }; -struct blob *lookup_blob(const struct object_id *oid); +#define lookup_blob(r, o) lookup_blob_##r(o) +struct blob *lookup_blob_the_repository(const struct object_id *oid); int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size); diff --git a/builtin/fast-export.c b/builtin/fast-export.c index a34ab9768f4..23ca46e6008 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -237,7 +237,7 @@ static void export_blob(const struct object_id *oid) if (anonymize) { buf = anonymize_blob(&size); - object = (struct object *)lookup_blob(oid); + object = (struct object *)lookup_blob(the_repository, oid); eaten = 0; } else { buf = read_object_file(oid, &type, &size); diff --git a/builtin/fsck.c b/builtin/fsck.c index 82a5acaf76e..6e2b204e938 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -808,7 +808,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) mode = active_cache[i]->ce_mode; if (S_ISGITLINK(mode)) continue; - blob = lookup_blob(&active_cache[i]->oid); + blob = lookup_blob(the_repository, + &active_cache[i]->oid); if (!blob) continue; obj = &blob->object; diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 0dd10693597..51c5244a15a 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -832,7 +832,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry, if (strict || do_fsck_object) { read_lock(); if (type == OBJ_BLOB) { - struct blob *blob = lookup_blob(oid); + struct blob *blob = lookup_blob(the_repository, oid); if (blob) blob->object.flags |= FLAG_CHECKED; else diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c index 8a8d5797520..f8023bae1e2 100644 --- a/builtin/merge-tree.c +++ b/builtin/merge-tree.c @@ -2,6 +2,7 @@ #include "tree-walk.h" #include "xdiff-interface.h" #include "object-store.h" +#include "repository.h" #include "blob.h" #include "exec-cmd.h" #include "merge-blobs.h" @@ -170,7 +171,7 @@ static struct merge_list *create_entry(unsigned stage, unsigned mode, const stru res->stage = stage; res->path = path; res->mode = mode; - res->blob = lookup_blob(oid); + res->blob = lookup_blob(the_repository, oid); return res; } diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 8e454c48649..dfea7790fde 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -254,7 +254,7 @@ static void write_object(unsigned nr, enum object_type type, added_object(nr, type, buf, size); free(buf); - blob = lookup_blob(&obj_list[nr].oid); + blob = lookup_blob(the_repository, &obj_list[nr].oid); if (blob) blob->
Re: [PATCH 27/35] commit-slabs: remove realloc counter outside of slab struct
On 5/29/2018 8:48 PM, Stefan Beller wrote: The realloc counter is declared outside the struct for the given slabname, which makes it harder for a follow up patch to move the declaration of the struct around as then the counter variable would need special treatment. As the reallocation counter is currently unused we can just remove it. If we ever need to count the reallocations again, we can reintroduce the counter as part of 'struct slabname' in commit-slab-decl.h. It's worth noting that this patch is different from the other mechanical patches. I do agree that we should remove this unused portion of the slab API. Likely this was built for testing purposes. CC'ing Junio, who I found introduced this in a84b794ad "commit-slab: introduce a macro to define a slab for new type". Looking at that diff, the realloc portion did not appear to exist in the previous in-degree-specific version of slabs. Signed-off-by: Stefan Beller --- commit-slab-impl.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/commit-slab-impl.h b/commit-slab-impl.h index 87a9cadfcca..ac1e6d409ad 100644 --- a/commit-slab-impl.h +++ b/commit-slab-impl.h @@ -11,8 +11,6 @@ #define implement_commit_slab(slabname, elemtype, scope) \ \ -static int stat_ ##slabname## realloc; \ - \ scope void init_ ##slabname## _with_stride(struct slabname *s, \ unsigned stride) \ { \ @@ -54,7 +52,6 @@ scope elemtype *slabname## _at_peek(struct slabname *s, \ if (!add_if_missing)\ return NULL;\ REALLOC_ARRAY(s->slab, nth_slab + 1);\ - stat_ ##slabname## realloc++; \ for (i = s->slab_count; i <= nth_slab; i++) \ s->slab[i] = NULL; \ s->slab_count = nth_slab + 1;\
Re: [RFC PATCH 00/35] object-store: lookup_commit
On 5/29/2018 11:18 PM, Stefan Beller wrote: On Tue, May 29, 2018 at 6:05 PM, Derrick Stolee wrote: On 5/29/2018 8:47 PM, Stefan Beller wrote: This applies on the merge of nd/commit-util-to-slab and sb/object-store-grafts, and is available at http://github.com/stefanbeller/ as branch object-store-lookup-commit as the merge has some merge conflicts as well as syntactical conflicts (upload-pack.c and fetch-pack.c introduce new calls of functions that would want to take a repository struct in the object-store-grafts series) As layed out in https://public-inbox.org/git/20180517225154.9200-1-sbel...@google.com/ this is getting close to finishing the set of object store series though the last unfinished part of this RFC hints at new work on the plate: * To give this series a nice polish, we'd want to convert parse_commit, too. But that requires the conversion of the new commit graph. Maybe we need to split this series into 2. I'll take a look at this series tomorrow. I've been working in ds/commit-graph-fsck to make many of the methods take a 'struct commit_graph *' parameter, which could easily be 'r->commit_graph' for a 'struct the_repository *r' instead of the global 'commit_graph'. Those graph-local methods will make the transformation to be repo-local a lot easier. (There still may be some need to insert a repository, though.) Since you are working on the commit-slab stuff in this patch, I'll (continue to) delay my patch series on using the commit-slab for generation numbers to avoid collisions with your work. This series touches the commit slabs in the last few commits only. I don't think there will be huge conflicts, as all I had to do was move the definition from a global to inside the parsed object store. I went through the whole series and didn't find any issue with the code. The commit messages reference coccinelle scripts that I never saw, so those should be struck from the messages. If anyone else plans to review, these two commits are worth special attention compared to the others: [PATCH 27/35] commit-slabs: remove realloc counter outside of slab struct [PATCH 28/35] commit.c: migrate the commit buffer to the parsed object store I found no issue with them, but they do actually change behavior slightly as opposed to the other commits which have no change other than to relax the dependence on the_repository as a global. I also like the pattern of storing the slabs inside the parsed object structure in the_repository. I'll use a similar pattern when looking into using slabs for generation number. For Junio: I know this will conflict with ds/commit-graph-fsck (specifically, I add new references to lookup_commit() and manipulate parse_commit_gently()). If this series is ready to be queued quickly, I can rebase my v4 onto these commits as they are placed into 'pu' (and please eject ds/commit-graph-fsck at that time). Thanks, -Stolee
Re: [PATCH v3 10/20] commit-graph: verify objects exist
On 5/30/2018 3:22 PM, Jakub Narebski wrote: Derrick Stolee writes: In the 'verify' subcommand, load commits directly from the object database to ensure they exist. Parse by skipping the commit-graph. All right, before we check that the commit data matches, we need to check that all the commits in cache (in the serialized commit graph) are present in real data (in the object database of the repository). What's nice of this series is that the operation that actually removes unreachable commits from the object database, that is `git gc`, would also update commit-gaph file. Signed-off-by: Derrick Stolee --- commit-graph.c | 20 t/t5318-commit-graph.sh | 7 +++ 2 files changed, 27 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index cbd1aae514..0420ebcd87 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, { struct commit *c; struct object_id oid; + + if (pos >= g->num_commits) + die("invalid parent position %"PRIu64, pos); + This change is not described in the commit message. This change should go in "commit-graph: verify parent list" which adds a test that fails without it. Thanks. hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); c = lookup_commit(&oid); if (!c) @@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + if (verify_commit_graph_error) + return verify_commit_graph_error; All right, so we by default short-circuit so that errors found earlier wouldn't cause crash when checking other things. Is it needed, though, in this case? Though I guess it is better to be conservative; lso by terminating after a batch of one type of errors we don't get many different error messages from the same error (i.e. error propagation). + + for (i = 0; i < g->num_commits; i++) { + struct commit *odb_commit; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); Do we really need to keep all those commits from the object database in memory (in the object::obj_hash hash table)? Perhaps using something like Flywheel / Recycler pattern would be a better solution (if possible)? Though perhaps this doesn't matter much with respect to memory use. + if (parse_commit_internal(odb_commit, 0, 0)) { Just a reminder to myself: the signature is int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) Hmmm... I wonder if with two boolean paramaters wouldn't it be better to use flags parameter, i.e. int parse_commit_internal(struct commit *item, int flags) ... parse_commit_internal(commit, QUIET_ON_MISSING | USE_COMMIT_GRAPH) But I guess that it is not worth it (especially for internal-ish function). + graph_report("failed to parse %s from object database", +oid_to_hex(&cur_oid)); Wouldn't parse_commit_internal() show it's own error message, in addition to the one above? + continue; + } + } + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index c050ef980b..996a016239 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +NUM_COMMITS=9 HASH_LEN=20 GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 @@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4` GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255` GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256` GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8` +GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10` All right, so we modify 10-the byte of OID of 5-th commit out of 9, am I correct here? # usage: corrupt_graph_and_verify # Manipulates the commit-graph file at the position @@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' ' "incorrect OID order" ' +test_expect_success 'detect OID not in object database' ' + corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \ + "from object database" +' I guess that if we ensure that OIDs are constant, you have chosen the change to actually corrupt the OID in OID Lookup chunk to point to OID that is not in the object database (while still not violating the constraint that OID in OID Lookup chunk needs to be sorted). + test_done All right (well, except for `expr ... ` --> $(( ... )) change).
Re: [PATCH v3 11/20] commit-graph: verify root tree OIDs
On 5/30/2018 6:24 PM, Jakub Narebski wrote: Derrick Stolee writes: The 'verify' subcommand must compare the commit content parsed from the commit-graph and compare it against the content in the object database. You have "compare" twice in the above sentence. Use lookup_commit() and parse_commit_in_graph_one() to parse the commits from the graph and compare against a commit that is loaded separately and parsed directly from the object database. All right, that looks like a nice extension of what was done in previous patch. We want to check that cache (serialized commit graph) matches reality (object database). Add checks for the root tree OID. All right; isn't it that now we check almost all information from commit-graph that hs match in object database (with exception of commit parents, I think). Signed-off-by: Derrick Stolee --- commit-graph.c | 17 - t/t5318-commit-graph.sh | 7 +++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 0420ebcd87..19ea369fc6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g) return verify_commit_graph_error; NOTE: we will be checking Commit Data chunk; I think it would be good idea to verify that size of Commit Data chunk matches (N * (H + 16) bytes) that format gives us, so that we don't accidentally red outside of memory if something got screwed up (like number of commits being wrong, or file truncated). This is actually how we calculate 'num_commits' during load_commit_graph_one(): if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP) { graph->num_commits = (chunk_offset - last_chunk_offset) / graph->hash_len; } So, if the chunk doesn't match N*(H+16), we detect this because FANOUT[255] != N. (There is one caveat here: (chunk_offset - last_chunk_offset) may not be a multiple of hash_len, and "accidentally" truncate to N in the division. I'll add more careful checks for this.) We also check out-of-bounds offsets in that method. for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i && oidcmp(&prev_oid, &cur_oid) >= 0) @@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(&cur_oid); So now I see why we add all commits to memory (to hash structure). + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report("failed to parse %s from commit-graph", +oid_to_hex(&cur_oid)); All right, this verifies that commit in OID Lookup chunk has parse-able data in Commit Data chunk, isn't it? } while (cur_fanout_pos < 256) { @@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g) return verify_commit_graph_error; for (i = 0; i < g->num_commits; i++) { - struct commit *odb_commit; + struct commit *graph_commit, *odb_commit; hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + graph_commit = lookup_commit(&cur_oid); odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); All right, so we have commit data from graph, and commit data from the object database. if (parse_commit_internal(odb_commit, 0, 0)) { graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid)); continue; } + + if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report("root tree OID for commit %s in commit-graph is %s != %s", +oid_to_hex(&cur_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); Maybe explicitly say that it doesn't match the value from the object database; OTOH this may be too verbose. } return verify_commit_graph_error; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 996a016239..21cc8e82f3 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255` GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256` GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN
[RFC PATCH 0/6] Fix commit-graph/graft/replace/shallow combo
The commit-graph file stores a condensed version of the commit history. This helps speed up several operations involving commit walks. This feature does not work well if those commits "change" using features like commit grafts, replace objects, or shallow clones. Since the commit-graph feature is optional, hidden behind the 'core.commitGraph' config option, and requires manual maintenance (until ds/commit-graph-fsck' is merged), these issues only arise for expert users who decided to opt-in. This RFC is a first attempt at rectify the issues that come up when these features interact. I have not succeeded entirely, as I will discuss below. The first two "DO NOT MERGE" commits are not intended to be part of the main product, but are ways to help the full test suite run while computing and consuming commit-graph files. This is acheived by calling write_commit_graph_reachable() from `git fetch` and `git commit`. I believe this is the only dependence on ds/commit-graph-fsck. The other commits should only depend on ds/commit-graph-lockfile-fix. Running the full test suite after these DO NO MERGE commits results in the following test failures, which I categorize by type of failure. The following tests are red herrings. Most work by deleting a commit from the object database and trying to demonstrate a failure. Others rely on how certain objects are loaded. These are not bugs, but will add noise if you run the tests suite with this patch. t0410-partial-clone.sh Failed tests: 5, 8 t5307-pack-missing-commit.shFailed tests: 3-4 t6011-rev-list-with-bad-commit.sh Failed test: 4 t6024-recursive-merge.shFailed test: 4 The following tests are fixed in "commit-graph: enable replace-object and grafts". t6001-rev-list-graft.sh Failed tests: 3, 5, 7, 9, 11, 13 t6050-replace.shFailed tests: 11-15, 23-24, 29 The following tests involve shallow clones. t5500-fetch-pack.sh Failed tests: 30-31, 34, 348-349 t5537-fetch-shallow.sh Failed tests: 4-7, 9 t5802-connect-helper.sh Failed test: 3 These tests are mostly fixed by three commits: * commit-graph: avoid writing when repo is shallow * fetch: destroy commit graph on shallow parameters * commit-graph: revert to odb on missing parents Each of these cases cover a different interaction that occurs with shallow clones. Some are due to a commit becoming shallow, while others are due to a commit becoming unshallow (and hence invalidating its generation number). After these changes, there is one test case that still fails, and I cannot understand why: t5500-fetch-pack.sh Failed test: 348 This test fails when performing a `git fetch --shallow-exclude`. When I halt the test using `t5500-fetch-pack.sh -d -i` and navigate to the directory to replay the fetch it performs as expected. After struggling with it for a while, I figured I should just put this series up for discussion. Maybe someone with more experience in shallow clones can point out the obvious issues I'm having. Thanks, -Stolee Derrick Stolee (6): DO NOT MERGE: compute commit-graph on every commit DO NOT MERGE: write commit-graph on every fetch commit-graph: enable replace-object and grafts commit-graph: avoid writing when repo is shallow fetch: destroy commit graph on shallow parameters commit-graph: revert to odb on missing parents builtin/commit.c | 5 + builtin/fetch.c | 10 + builtin/gc.c | 2 +- builtin/replace.c | 3 +++ commit-graph.c| 65 +++ commit-graph.h| 9 commit.c | 5 + environment.c | 2 +- 8 files changed, 95 insertions(+), 6 deletions(-) -- 2.16.2.338.gcfe06ae955
[RFC PATCH 5/6] fetch: destroy commit graph on shallow parameters
The commit-graph feature is not compatible with history-rewriting features such as shallow clones. When running a 'git fetch' with any of the shallow/unshallow options, destroy the commit-graph file and override core.commitGraph to be false. Signed-off-by: Derrick Stolee --- builtin/fetch.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/builtin/fetch.c b/builtin/fetch.c index af896e4b74..2001dca881 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1452,6 +1452,12 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) } } + if (update_shallow || depth || deepen_since || deepen_not.nr || + deepen_relative || unshallow || update_shallow || is_repository_shallow()) { + destroy_commit_graph(get_object_directory()); + core_commit_graph = 0; + } + if (remote) { if (filter_options.choice || repository_format_partial_clone) fetch_one_setup_partial(remote); -- 2.16.2.338.gcfe06ae955
[RFC PATCH 2/6] DO NOT MERGE: write commit-graph on every fetch
THIS IS ONLY FOR TESTING TO INCREASE TEST COVERAGE Signed-off-by: Derrick Stolee --- builtin/fetch.c | 4 1 file changed, 4 insertions(+) diff --git a/builtin/fetch.c b/builtin/fetch.c index 8ee998ea2e..af896e4b74 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -20,6 +20,7 @@ #include "utf8.h" #include "packfile.h" #include "list-objects-filter-options.h" +#include "commit-graph.h" static const char * const builtin_fetch_usage[] = { N_("git fetch [] [ [...]]"), @@ -1480,6 +1481,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) close_all_packs(); + if (core_commit_graph) + write_commit_graph_reachable(get_object_directory(), 1); + argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL); if (verbosity < 0) argv_array_push(&argv_gc_auto, "--quiet"); -- 2.16.2.338.gcfe06ae955
[RFC PATCH 1/6] DO NOT MERGE: compute commit-graph on every commit
Also enable core.commitGraph and gc.commitGraph. Helps to test the commit-graph feature with the rest of the test infrastructure. Signed-off-by: Derrick Stolee --- builtin/commit.c | 5 + builtin/gc.c | 2 +- environment.c| 2 +- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/builtin/commit.c b/builtin/commit.c index e8e8d13be4..8751b816c1 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -32,6 +32,7 @@ #include "column.h" #include "sequencer.h" #include "mailmap.h" +#include "commit-graph.h" static const char * const builtin_commit_usage[] = { N_("git commit [] [--] ..."), @@ -1623,5 +1624,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix) UNLEAK(err); UNLEAK(sb); + + if (core_commit_graph) + write_commit_graph_reachable(get_object_directory(), 1); + return 0; } diff --git a/builtin/gc.c b/builtin/gc.c index efd214a59f..999c09d8e2 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -35,7 +35,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; -static int gc_commit_graph = 0; +static int gc_commit_graph = 1; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; diff --git a/environment.c b/environment.c index 8853e2f0dd..fdb2d56d90 100644 --- a/environment.c +++ b/environment.c @@ -62,7 +62,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED; enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; -int core_commit_graph; +int core_commit_graph = 1; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ -- 2.16.2.338.gcfe06ae955
[RFC PATCH 4/6] commit-graph: avoid writing when repo is shallow
Shallow clones do not interact well with the commit-graph feature for several reasons. Instead of doing the hard thing to fix those interactions, instead prevent reading or writing a commit-graph file for shallow repositories. Signed-off-by: Derrick Stolee --- commit-graph.c | 12 1 file changed, 12 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 95af4ed519..80e377b90f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -208,6 +208,9 @@ static void prepare_commit_graph(void) return; prepare_commit_graph_run_once = 1; + if (is_repository_shallow()) + return; + obj_dir = get_object_directory(); prepare_commit_graph_one(obj_dir); prepare_alt_odb(); @@ -711,6 +714,15 @@ void write_commit_graph(const char *obj_dir, int num_extra_edges; struct commit_list *parent; + /* +* Shallow clones are not supproted, as they create bad +* generation skips as they are un-shallowed. +*/ + if (is_repository_shallow()) { + warning("writing a commit-graph in a shallow repository is not supported"); + return; + } + oids.nr = 0; oids.alloc = approximate_object_count() / 4; -- 2.16.2.338.gcfe06ae955
[RFC PATCH 6/6] commit-graph: revert to odb on missing parents
The commit-graph format includes a way to specify a parent is "missing" from the commit-graph (i.e. we do not have a record of that parent in our list of object IDs, and hence cannot provide a graph position). For mose cases, this does not occur due to the close_reachable() method adding all reachable commits. However, in a shallow clone, we will try to record the parents of a commit on the shallow boundary, but the parents are not in the repository. The GRAPH_PARENT_MISSING value that is stored in the format is purposeful, especially for future plans to make the commit-graph file incremental or transporting sections of a commit-graph file across the network. In the meantime, check if a commit has a missing parent while filling its details from the commit-graph. If a parent is missing, still assign the generation number and graph position for that item, but report that the commit-graph failed to fill the contents. Then the caller is responsible for filling the rest of the data from a commit buffer. Signed-off-by: Derrick Stolee --- commit-graph.c | 33 ++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 80e377b90f..3e33d061fe 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -278,17 +278,44 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin struct commit_list **pptr; const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; - item->object.parsed = 1; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; item->graph_pos = pos; + /* +* If we have any edges marked as GRAPH_PARENT_MISSING, we must not parse any +* more of this object and leave it to the commit buffer to parse. +*/ + edge_value = get_be32(commit_data + g->hash_len); + if (edge_value == GRAPH_PARENT_MISSING) + return 0; + if (edge_value == GRAPH_PARENT_NONE) + goto continue_parsing; + + edge_value = get_be32(commit_data + g->hash_len + 4); + if (edge_value == GRAPH_PARENT_MISSING) + return 0; + if (edge_value == GRAPH_PARENT_NONE) + goto continue_parsing; + if (!(edge_value & GRAPH_OCTOPUS_EDGES_NEEDED)) + goto continue_parsing; + + parent_data_ptr = (uint32_t*)(g->chunk_large_edges + + 4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK)); + do { + edge_value = get_be32(parent_data_ptr); + if (edge_value == GRAPH_PARENT_MISSING) + return 0; + parent_data_ptr++; + } while (!(edge_value & GRAPH_LAST_EDGE)); + +continue_parsing: + item->object.parsed = 1; item->maybe_tree = NULL; date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); - item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; - pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); -- 2.16.2.338.gcfe06ae955
[RFC PATCH 3/6] commit-graph: enable replace-object and grafts
Create destroy_commit_graph() method to delete the commit-graph file when history is altered by a replace-object call. If the commit-graph is rebuilt after that, we will load the correct object while reading the commit-graph. When parsing a commit, first check if the commit was grafted. If so, then ignore the commit-graph for that commit and insted use the parents loaded by parsing the commit buffer and comparing against the graft file. Signed-off-by: Derrick Stolee --- builtin/replace.c | 3 +++ commit-graph.c| 20 +++- commit-graph.h| 9 + commit.c | 5 + 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/builtin/replace.c b/builtin/replace.c index 9f01f3fc7f..d553aadcdc 100644 --- a/builtin/replace.c +++ b/builtin/replace.c @@ -15,6 +15,7 @@ #include "parse-options.h" #include "run-command.h" #include "tag.h" +#include "commit-graph.h" static const char * const git_replace_usage[] = { N_("git replace [-f] "), @@ -468,6 +469,8 @@ int cmd_replace(int argc, const char **argv, const char *prefix) usage_msg_opt("--raw only makes sense with --edit", git_replace_usage, options); + destroy_commit_graph(get_object_directory()); + switch (cmdmode) { case MODE_DELETE: if (argc < 1) diff --git a/commit-graph.c b/commit-graph.c index e9195dfb17..95af4ed519 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -6,6 +6,7 @@ #include "pack.h" #include "packfile.h" #include "commit.h" +#include "dir.h" #include "object.h" #include "refs.h" #include "revision.h" @@ -240,15 +241,22 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, { struct commit *c; struct object_id oid; + const unsigned char *real; if (pos >= g->num_commits) die("invalid parent position %"PRIu64, pos); hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); + + real = lookup_replace_object(oid.hash); + c = lookup_commit(&oid); if (!c) die("could not find commit %s", oid_to_hex(&oid)); - c->graph_pos = pos; + + if (!hashcmp(real, oid.hash)) + c->graph_pos = pos; + return &commit_list_insert(c, pptr)->next; } @@ -1019,3 +1027,13 @@ int verify_commit_graph(struct commit_graph *g) return verify_commit_graph_error; } + +void destroy_commit_graph(const char *obj_dir) +{ + char *graph_name; + close_commit_graph(); + + graph_name = get_commit_graph_filename(obj_dir); + remove_path(graph_name); + FREE_AND_NULL(graph_name); +} diff --git a/commit-graph.h b/commit-graph.h index 9a06a5f188..1d17da1582 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -56,4 +56,13 @@ void write_commit_graph(const char *obj_dir, int verify_commit_graph(struct commit_graph *g); +/* + * Delete the commit-graph file in the given object directory. + * + * WARNING: this deletes data, so should only be used when + * performing history-altering actions like replace-object + * or grafts. + */ +void destroy_commit_graph(const char *obj_dir); + #endif diff --git a/commit.c b/commit.c index 6eaed0174c..2fe31cde77 100644 --- a/commit.c +++ b/commit.c @@ -403,6 +403,11 @@ int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_com return -1; if (item->object.parsed) return 0; + + prepare_commit_graft(); + if (commit_graft_pos(item->object.oid.hash) >= 0) + use_commit_graph = 0; + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, &type, &size); -- 2.16.2.338.gcfe06ae955
Re: [RFC PATCH 0/6] Fix commit-graph/graft/replace/shallow combo
On 5/31/2018 2:33 PM, Stefan Beller wrote: On Thu, May 31, 2018 at 10:40 AM, Derrick Stolee wrote: The commit-graph file stores a condensed version of the commit history. This helps speed up several operations involving commit walks. This feature does not work well if those commits "change" using features like commit grafts, replace objects, or shallow clones. Since the commit-graph feature is optional, hidden behind the 'core.commitGraph' config option, and requires manual maintenance (until ds/commit-graph-fsck' is merged), these issues only arise for expert users who decided to opt-in. This RFC is a first attempt at rectify the issues that come up when these features interact. I have not succeeded entirely, as I will discuss below. The first two "DO NOT MERGE" commits are not intended to be part of the main product, but are ways to help the full test suite run while computing and consuming commit-graph files. This is acheived by calling write_commit_graph_reachable() from `git fetch` and `git commit`. I believe this is the only dependence on ds/commit-graph-fsck. The other commits should only depend on ds/commit-graph-lockfile-fix. I applied these patches on top of ds/commit-graph-fsck (it would have been nice if you mentioned that it applies there) Running the test suite with all patches applied results in: ./t0410-partial-clone.sh(Wstat: 256 Tests: 15 Failed: 2) Failed tests: 5, 8 ./t5307-pack-missing-commit.sh (Wstat: 256 Tests: 5 Failed: 2) Failed tests: 3-4 ./t5500-fetch-pack.sh (Wstat: 256 Tests: 353 Failed: 1) Failed test: 348 ./t6011-rev-list-with-bad-commit.sh (Wstat: 256 Tests: 6 Failed: 1) Failed test: 4 ./t6024-recursive-merge.sh (Wstat: 256 Tests: 6 Failed: 1) Failed test: 4 which you identified as 4x noise and t5500 as not understood. $ GIT_TRACE=1 ./t5500-fetch-pack.sh -d -i -v -x [...] expecting success: git -C shallow12 fetch --shallow-exclude one origin && git -C shallow12 log --pretty=tformat:%s origin/master >actual && test_write_lines three two >expected && test_cmp expected actual ++ git -C shallow12 fetch --shallow-exclude one origin trace: built-in: git fetch --shallow-exclude one origin trace: run_command: unset GIT_PREFIX; 'git-upload-pack '\''/u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/.'\''' trace: run_command: git --shallow-file pack-objects --revs --thin --stdout --shallow --progress --delta-base-offset --include-tag trace: built-in: git pack-objects --revs --thin --stdout --shallow --progress --delta-base-offset --include-tag remote: Counting objects: 4, done. remote: Compressing objects: 100% (3/3), done. remote: Total 4 (delta 0), reused 0 (delta 0) trace: run_command: git --shallow-file unpack-objects --pack_header=2,4 trace: built-in: git unpack-objects --pack_header=2,4 Unpacking objects: 100% (4/4), done. trace: run_command: git rev-list --objects --stdin --not --all --quiet trace: built-in: git rev-list --objects --stdin --not --all --quiet trace: run_command: unset GIT_PREFIX; 'git-upload-pack '\''/u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/.'\''' trace: run_command: git pack-objects --revs --thin --stdout --progress --delta-base-offset trace: built-in: git pack-objects --revs --thin --stdout --progress --delta-base-offset remote: Total 0 (delta 0), reused 0 (delta 0) trace: run_command: git unpack-objects --pack_header=2,0 trace: built-in: git unpack-objects --pack_header=2,0 trace: run_command: git rev-list --objects --stdin --not --all --quiet trace: built-in: git rev-list --objects --stdin --not --all --quiet From file:///u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/. * [new tag] one-> one * [new tag] two-> two run_processes_parallel: preparing to run up to 1 tasks run_processes_parallel: done trace: run_command: git gc --auto trace: built-in: git gc --auto ++ git -C shallow12 log --pretty=tformat:%s origin/master trace: built-in: git log '--pretty=tformat:%s' origin/master ++ test_write_lines three two ++ printf '%s\n' three two ++ test_cmp expected actual ++ diff -u expected actual --- expected 2018-05-31 18:14:25.944540810 + +++ actual 2018-05-31 18:14:25.944540810 + @@ -1,2 +1,3 @@ three two +one error: last command exited with $?=1 not ok 348 - fetch exclude tag one # # git -C shallow12 fetch --shallow-exclude one origin && # git -C shallow12 log --pretty=tformat:%s origin/master >actual && # test_write_lines three two >expected && # test_cmp expected actual # After these changes, there is one test case that still fails, and I cannot understand why: t5500-fetch-pack.sh Failed test: 348 This test fails when performing a `git fetch -
ds/generation-numbers (was Re: What's cooking in git.git (Jun 2018, #01; Fri, 1))
On 6/1/2018 3:21 AM, Junio C Hamano wrote: * ds/commit-graph-lockfile-fix (2018-05-22) 1 commit (merged to 'next' on 2018-05-24 at 3d12a02b0c) + commit-graph: fix UX issue when .lock file exists (this branch is used by ds/commit-graph-fsck; uses ds/generation-numbers.) Update to ds/generation-numbers topic. Wait for ds/generation-numbers * ds/generation-numbers (2018-05-22) 11 commits (merged to 'next' on 2018-05-24 at 56fc38a1b6) + commit-graph.txt: update design document + merge: check config before loading commits + commit: use generation number in remove_redundant() + commit: add short-circuit to paint_down_to_common() + commit: use generation numbers for in_merge_bases() + ref-filter: use generation number for --contains + commit-graph: always load commit-graph information + commit: use generations in paint_down_to_common() + commit-graph: compute generation numbers + commit: add generation number to struct commit + ref-filter: fix outdated comment on in_commit_list (this branch is used by ds/commit-graph-fsck and ds/commit-graph-lockfile-fix.) A recently added "commit-graph" datafile has learned to store pre-computed generation numbers to speed up the decisions to stop history traversal. Will cook in 'next'. On Wednesday, these were marked as "Will merge to 'master'" What changed? Thanks, -Stolee
Re: [RFC PATCH 4/6] commit-graph: avoid writing when repo is shallow
On 5/31/2018 10:30 PM, Junio C Hamano wrote: Derrick Stolee writes: Shallow clones do not interact well with the commit-graph feature for several reasons. Instead of doing the hard thing to fix those interactions, instead prevent reading or writing a commit-graph file for shallow repositories. The latter instead would want to vanish, I would guess. Do you mean that we should call destroy_commit_graph() if we detect a shallow repository during write_commit_graph(), then I can make that change. Signed-off-by: Derrick Stolee --- commit-graph.c | 12 1 file changed, 12 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 95af4ed519..80e377b90f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -208,6 +208,9 @@ static void prepare_commit_graph(void) return; prepare_commit_graph_run_once = 1; + if (is_repository_shallow()) + return; + obj_dir = get_object_directory(); prepare_commit_graph_one(obj_dir); prepare_alt_odb(); @@ -711,6 +714,15 @@ void write_commit_graph(const char *obj_dir, int num_extra_edges; struct commit_list *parent; + /* +* Shallow clones are not supproted, as they create bad +* generation skips as they are un-shallowed. +*/ + if (is_repository_shallow()) { + warning("writing a commit-graph in a shallow repository is not supported"); + return; + } + oids.nr = 0; oids.alloc = approximate_object_count() / 4;
Re: t5318-commit-graph.sh breaks travis gettext poison job
On 6/1/2018 12:17 PM, Duy Nguyen wrote: In case you're not checking travis, [1] reports Test Summary Report --- t5318-commit-graph.sh(Wstat: 256 Tests: 62 Failed: 2) Failed tests: 61-62 Non-zero exit status: 1 This usually means you're grepping an i18n string [1] https://travis-ci.org/git/git/jobs/386532326 Thanks, Duy. I believe this is related to the place Szeder pointed out [1] with the one translatable string in verify_commit_graph(). I've fixed this in my local branch, and plan to push v4 when review of v3 is complete. Thanks, -Stolee [1] https://public-inbox.org/git/20180530123529.32090-1-szeder@gmail.com/ Re: [PATCH v3 16/20] commit-graph: verify contents match checksum
Re: ds/generation-numbers (was Re: What's cooking in git.git (Jun 2018, #01; Fri, 1))
On 6/1/2018 7:15 PM, Junio C Hamano wrote: Derrick Stolee writes: A recently added "commit-graph" datafile has learned to store pre-computed generation numbers to speed up the decisions to stop history traversal. Will cook in 'next'. On Wednesday, these were marked as "Will merge to 'master'" What changed? Nothing has changed. "Will merge to 'master'" means "This is now in 'next', and unless there is some blocking breakage found, this topic will be merged to 'master' eventually". It does not even say if that eventuality is before or after the release the current cycle is working towards. When it comes near pre-release feature freeze, things in 'next' need to be sifted into various bins, and their labels are updated. The ones that are too big to be comfortably merged to the upcoming release after -rc0 has passed (i.e. biggies are better merged early to the 'master' in the cycle to give it full cycle of larger exposure) will be kept in 'next' so that it can go first after the final, the ones that are low risk but with higher importance will be merged to 'master' before the release, the ones that are trivial, distracting and lower value (i.e. the ones that force i18n teams extra work) may be held in 'next', and the ones that deserve a chance to freshly restart are marked to be kicked back to 'pu'. Etc. etc. Thanks, Junio. This explanation is what I expected. I suppose the small extra bit of information of "Will cook in 'next' until after next release" would have answered my question in advance. Thanks for the patience as I get used to your workflow. I am a little disappointed this didn't make 2.18 because this gives some of the biggest speedups for typically painful computations (like 'git branch --contains'). The generation numbers are what give us more than a constant-multiple speedup; what is in master only parses commit relationships faster, doesn't reduce the number of commits walked. It also means we will release a version of Git that writes commit-graph file with GENERATION_ZERO and so we will never be able to deprecate that logic in the code. Thanks, -Stolee
[PATCH 01/14] graph: add packed graph design document
Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The packed graph is a file that stores the commit graph structure along +with some extra metadata to speed up graph walks. This format allows a +consumer to load the following info for a commit: + +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. + +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a root commit. The recursive definition +is easier to use for computation and the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation can make +mistakes with topological orders (such as "git log" with default order), +but is not used when the topological order is required (such as merge +base calculations, "git log --graph"). + +Design Details +-- + +- A graph file is stored in a file named 'graph-.graph' in the pack + directory. This could be stored in an alternate. + +- The most-recent graph file OID is stored in a 'graph-head' file for + immediate access and storing backup graphs. This could be stored in an + alternate, and refers to a 'graph-.graph' file in the same pack + directory. + +- The core.graph config setting must be on to create or consume graph files. + +- The graph file is only a supplemental structure. If a user downgrades + or disables the 'core.graph' config setting, then the existing ODB is + sufficient. + +- The file format includes parameters for the object id length + and hash algorithm, so a future change of hash algorithm does + not require a change in format. + +Current Limitations +--- + +- Only one graph file is used at one time. This allows the integer ID to + seek into the single graph file. It is possible to extend the model + for multiple graph files, but that is currently not part of the design. + +- .graph files are managed only by the 'graph' builtin. These are not + updated automatically during clone or fetch. + +- There is no '--verify' option for the 'graph' builtin to verify the + contents of the graph file. + +- The graph only considers commits existing in packfiles and does not + walk to fill in reachable commits. [Small] + +- When rewriting
[PATCH 03/14] packed-graph: create git-graph builtin
Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 7 +++ Makefile| 1 + builtin.h | 1 + builtin/graph.c | 36 command-list.txt| 1 + git.c | 1 + 6 files changed, 47 insertions(+) create mode 100644 Documentation/git-graph.txt create mode 100644 builtin/graph.c diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt new file mode 100644 index 00..de5a3c07e6 --- /dev/null +++ b/Documentation/git-graph.txt @@ -0,0 +1,7 @@ +git-graph(1) + + +NAME + +git-graph - Write and verify Git commit graphs (.graph files) + diff --git a/Makefile b/Makefile index 1a9b23b679..d8b0d0457a 100644 --- a/Makefile +++ b/Makefile @@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o BUILTIN_OBJS += builtin/fsck.o BUILTIN_OBJS += builtin/gc.o BUILTIN_OBJS += builtin/get-tar-commit-id.o +BUILTIN_OBJS += builtin/graph.o BUILTIN_OBJS += builtin/grep.o BUILTIN_OBJS += builtin/hash-object.o BUILTIN_OBJS += builtin/help.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..ae7e816908 100644 --- a/builtin.h +++ b/builtin.h @@ -168,6 +168,7 @@ extern int cmd_format_patch(int argc, const char **argv, const char *prefix); extern int cmd_fsck(int argc, const char **argv, const char *prefix); extern int cmd_gc(int argc, const char **argv, const char *prefix); extern int cmd_get_tar_commit_id(int argc, const char **argv, const char *prefix); +extern int cmd_graph(int argc, const char **argv, const char *prefix); extern int cmd_grep(int argc, const char **argv, const char *prefix); extern int cmd_hash_object(int argc, const char **argv, const char *prefix); extern int cmd_help(int argc, const char **argv, const char *prefix); diff --git a/builtin/graph.c b/builtin/graph.c new file mode 100644 index 00..a902dc8646 --- /dev/null +++ b/builtin/graph.c @@ -0,0 +1,36 @@ +#include "builtin.h" +#include "cache.h" +#include "config.h" +#include "dir.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "packfile.h" +#include "parse-options.h" + +static char const * const builtin_graph_usage[] ={ + N_("git graph [--pack-dir ]"), + NULL +}; + +static struct opts_graph { + const char *pack_dir; +} opts; + +int cmd_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_graph_options[] = { + { OPTION_STRING, 'p', "pack-dir", &opts.pack_dir, + N_("dir"), + N_("The pack directory to store the graph") }, + OPT_END(), + }; + + if (!core_graph) + die("core.graph is false"); + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_graph_usage, builtin_graph_options); + + return 0; +} + diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..d9c17cb9f8 100644 --- a/command-list.txt +++ b/command-list.txt @@ -61,6 +61,7 @@ git-format-patchmainporcelain git-fsckancillaryinterrogators git-gc mainporcelain git-get-tar-commit-id ancillaryinterrogators +git-graph plumbingmanipulators git-grepmainporcelain info git-gui mainporcelain git-hash-object plumbingmanipulators diff --git a/git.c b/git.c index c870b9719c..29f8b6e7dd 100644 --- a/git.c +++ b/git.c @@ -408,6 +408,7 @@ static struct cmd_struct commands[] = { { "fsck-objects", cmd_fsck, RUN_SETUP }, { "gc", cmd_gc, RUN_SETUP }, { "get-tar-commit-id", cmd_get_tar_commit_id }, + { "graph", cmd_graph, RUN_SETUP_GENTLY }, { "grep", cmd_grep, RUN_SETUP_GENTLY }, { "hash-object", cmd_hash_object }, { "help", cmd_help }, -- 2.16.0
[PATCH 00/14] Serialized Commit Graph
As promised [1], this patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | To test this yourself, run the following on your repo: git config core.graph true git show-ref -s | git graph --write --update-head The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisions by toggling the 'core.graph' setting. [1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/ Re: What's cooking in git.git (Jan 2018, #03; Tue, 23) [2] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. P.S. I'm sending this patch from my gmail address to avoid Outlook munging the URLs included in the design document. Derrick Stolee (14): graph: add packed graph design document packed-graph: add core.graph setting packed-graph: create git-graph builtin packed-graph: add format document packed-graph: implement construct_graph() packed-graph: implement git-graph --write packed-graph: implement git-graph --read graph: implement git-graph --update-head packed-graph: implement git-graph --clear packed-graph: teach git-graph --delete-expired commit: integrate packed graph with commit parsing packed-graph: read only from specific pack-indexes packed-graph: close under reachability packed-graph: teach git-graph to read commits Documentation/config.txt | 3 + Documentation/git-graph.txt | 102 Documentation/technical/graph-format.txt | 88 Documentation/technical/packed-graph.txt | 185 +++ Makefile | 2 + alloc.c | 1 + builtin.h| 1 + builtin/graph.c | 231 + cache.h | 1 + command-list.txt | 1 + commit.c | 20 +- commit.h | 2 + config.c | 5 + environment.c| 1 + git.c| 1 + log-tree.c | 3 +- packed-graph.c | 840 +++ packed-graph.h | 65 +++ packfile.c | 4 +- packfile.h | 2 + t/t5319-graph.sh | 271 ++ 21 files changed, 1822 insertions(+), 7 deletions(-) create mode 100644 Documentation/git-graph.txt create mode 100644 Documentation/technical/graph-format.txt create mode 100644 Documentation/technical/packed-graph.txt create mode 100644 builtin/graph.c create mode 100644 packed-graph.c
[PATCH 04/14] packed-graph: add format document
Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee --- Documentation/technical/graph-format.txt | 88 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the int-ids of the first two parents of + the ith commit. Stores value 0x if no parent in that position. + If there are more than two parents, the second value has its most- + significant bit on and the other bits store an offset into the Large + Edge List chunk. + * The next 8 bytes store the generation number of the commit and the + commit time in seconds since EPOCH. The generation number uses the + higher 30 bits of the first 4 bytes, while the commit time uses the + 32 bits of the second 4 bytes, along with the lowest 2 bits of the + lowest byte, storing the 33rd and 34th bit of the commit time. + + [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'}) + This list of 4-byte values store the second through nth parents for + all octoput merges. The second parent value in the commit data is a + negative number pointing into this list. Then iterate through this + list starting at that position until reaching a value with the most- + significant bit on. The other bits correspond to the int-id of the + last parent. + +TRAILER: + + H-byte HASH-checksum of all of the above. -- 2.16.0
[PATCH 02/14] packed-graph: add core.graph setting
The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. Signed-off-by: Derrick Stolee --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 0e25b2c92b..e7b98fa14f 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.graph:: + Enable git commit graph feature. Allows writing and reading from .graph files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index d8b975a571..655a81ac90 100644 --- a/cache.h +++ b/cache.h @@ -825,6 +825,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_apply_sparse_checkout; +extern int core_graph; extern int precomposed_unicode; extern int protect_hfs; extern int protect_ntfs; diff --git a/config.c b/config.c index e617c2018d..fee90912d8 100644 --- a/config.c +++ b/config.c @@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.graph")) { + core_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index 63ac38a46f..0c56a3d869 100644 --- a/environment.c +++ b/environment.c @@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; int core_apply_sparse_checkout; +int core_graph; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ unsigned long pack_size_limit_cfg; -- 2.16.0
[PATCH 09/14] packed-graph: implement git-graph --clear
Teach Git to delete the current 'graph_head' file and the packed graph it references. This is a good safety valve if somehow the file is corrupted and needs to be recalculated. Since the packed graph is a summary of contents already in the ODB, it can be regenerated. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 16 ++-- builtin/graph.c | 31 ++- t/t5319-graph.sh| 7 ++- 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index ac20aa67a9..f690699570 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -11,6 +11,7 @@ SYNOPSIS [verse] 'git graph' --write [--pack-dir ] 'git graph' --read [--pack-dir ] +'git graph' --clear [--pack-dir ] OPTIONS --- @@ -18,16 +19,21 @@ OPTIONS Use given directory for the location of packfiles, graph-head, and graph files. +--clear:: + Delete the graph-head file and the graph file it references. + (Cannot be combined with --read or --write.) + --read:: Read a graph file given by the graph-head file and output basic - details about the graph file. (Cannot be combined with --write.) + details about the graph file. (Cannot be combined with --clear + or --write.) --graph-id:: When used with --read, consider the graph file graph-.graph. --write:: Write a new graph file to the pack directory. (Cannot be combined - with --read.) + with --clear or --read.) --update-head:: When used with --write, update the graph-head file to point to @@ -61,6 +67,12 @@ $ git graph --write --update-head $ git graph --read --graph-id= +* Delete /graph-head and the file it references. ++ + +$ git graph --clear --pack-dir= + + CONFIGURATION - diff --git a/builtin/graph.c b/builtin/graph.c index 0760d99f43..ac15febc46 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -10,6 +10,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), N_("git graph --write [--pack-dir ] [--update-head]"), NULL @@ -17,6 +18,7 @@ static char const * const builtin_graph_usage[] ={ static struct opts_graph { const char *pack_dir; + int clear; int read; const char *graph_id; int write; @@ -25,6 +27,29 @@ static struct opts_graph { struct object_id old_graph_oid; } opts; +static int graph_clear(void) +{ + struct strbuf head_path = STRBUF_INIT; + char *old_path; + + if (!opts.has_existing) + return 0; + + strbuf_addstr(&head_path, opts.pack_dir); + strbuf_addstr(&head_path, "/"); + strbuf_addstr(&head_path, "graph-head"); + if (remove_path(head_path.buf)) + die("failed to remove path %s", head_path.buf); + strbuf_release(&head_path); + + old_path = get_graph_filename_oid(opts.pack_dir, &opts.old_graph_oid); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + free(old_path); + + return 0; +} + static int graph_read(void) { struct object_id graph_oid; @@ -105,6 +130,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", &opts.pack_dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('c', "clear", &opts.clear, + N_("clear graph file and graph-head")), OPT_BOOL('r', "read", &opts.read, N_("read graph file")), OPT_BOOL('w', "write", &opts.write, @@ -129,7 +156,7 @@ int cmd_graph(int argc, const char **argv, const char *prefix) builtin_graph_options, builtin_graph_usage, 0); - if (opts.write + opts.read > 1) + if (opts.write + opts.read + opts.clear > 1) usage_with_options(builtin_graph_usage, builtin_graph_options); if (!opts.pack_dir) { @@ -141,6 +168,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) opts.has_existing = !!get_graph_head_oid(opts.pack_dir, &opts.old_graph_oid); + if (opts.clear) + return graph_clear(); if (opts.read) return graph_read();
[PATCH 08/14] graph: implement git-graph --update-head
It is possible to have multiple packed graph files in a pack directory, but only one is important at a time. Use a 'graph_head' file to point to the important file. Teach git-graph to write 'graph_head' upon writing a new packed graph file. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 38 -- builtin/graph.c | 38 +++--- packed-graph.c | 25 + packed-graph.h | 1 + t/t5319-graph.sh| 12 ++-- 5 files changed, 107 insertions(+), 7 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index 0939c3f1be..ac20aa67a9 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -12,19 +12,53 @@ SYNOPSIS 'git graph' --write [--pack-dir ] 'git graph' --read [--pack-dir ] +OPTIONS +--- +--pack-dir:: + Use given directory for the location of packfiles, graph-head, + and graph files. + +--read:: + Read a graph file given by the graph-head file and output basic + details about the graph file. (Cannot be combined with --write.) + +--graph-id:: + When used with --read, consider the graph file graph-.graph. + +--write:: + Write a new graph file to the pack directory. (Cannot be combined + with --read.) + +--update-head:: + When used with --write, update the graph-head file to point to + the written graph file. + EXAMPLES +* Output the OID of the graph file pointed to by /graph-head. ++ + +$ git graph --pack-dir= + + * Write a graph file for the packed commits in your local .git folder. + -$ git midx --write +$ git graph --write + + +* Write a graph file for the packed commits in your local .git folder, +* and update graph-head. ++ + +$ git graph --write --update-head * Read basic information from a graph file. + -$ git midx --read --graph-id= +$ git graph --read --graph-id= CONFIGURATION diff --git a/builtin/graph.c b/builtin/graph.c index bc66722924..0760d99f43 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -11,7 +11,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ]"), + N_("git graph --write [--pack-dir ] [--update-head]"), NULL }; @@ -20,6 +20,9 @@ static struct opts_graph { int read; const char *graph_id; int write; + int update_head; + int has_existing; + struct object_id old_graph_oid; } opts; static int graph_read(void) @@ -30,8 +33,8 @@ static int graph_read(void) if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ) get_oid_hex(opts.graph_id, &graph_oid); - else - die("no graph id specified"); + else if (!get_graph_head_oid(opts.pack_dir, &graph_oid)) + die("no graph-head exists."); graph_file = get_graph_filename_oid(opts.pack_dir, &graph_oid); graph = load_packed_graph_one(graph_file, opts.pack_dir); @@ -62,10 +65,33 @@ static int graph_read(void) return 0; } +static void update_head_file(const char *pack_dir, const struct object_id *graph_id) +{ + struct strbuf head_path = STRBUF_INIT; + int fd; + struct lock_file lk = LOCK_INIT; + + strbuf_addstr(&head_path, pack_dir); + strbuf_addstr(&head_path, "/"); + strbuf_addstr(&head_path, "graph-head"); + + fd = hold_lock_file_for_update(&lk, head_path.buf, LOCK_DIE_ON_ERROR); + strbuf_release(&head_path); + + if (fd < 0) + die_errno("unable to open graph-head"); + + write_in_full(fd, oid_to_hex(graph_id), GIT_MAX_HEXSZ); + commit_lock_file(&lk); +} + static int graph_write(void) { struct object_id *graph_id = construct_graph(opts.pack_dir); + if (opts.update_head) + update_head_file(opts.pack_dir, graph_id); + if (graph_id) printf("%s\n", oid_to_hex(graph_id)); @@ -83,6 +109,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("read graph file")), OPT_BOOL('w', "write", &opts.write, N_("write graph file")), +
[PATCH 06/14] packed-graph: implement git-graph --write
Teach git-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 26 ++ builtin/graph.c | 37 ++-- t/t5319-graph.sh| 83 + 3 files changed, 143 insertions(+), 3 deletions(-) create mode 100755 t/t5319-graph.sh diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index de5a3c07e6..be6bc38814 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -5,3 +5,29 @@ NAME git-graph - Write and verify Git commit graphs (.graph files) + +SYNOPSIS + +[verse] +'git graph' --write [--pack-dir ] + +EXAMPLES + + +* Write a graph file for the packed commits in your local .git folder. ++ + +$ git midx --write + + +CONFIGURATION +- + +core.graph:: + The graph command will fail if core.graph is false. + Also, the written graph files will be ignored by other commands + unless core.graph is true. + +GIT +--- +Part of the linkgit:git[1] suite \ No newline at end of file diff --git a/builtin/graph.c b/builtin/graph.c index a902dc8646..09f5552338 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -6,31 +6,62 @@ #include "lockfile.h" #include "packfile.h" #include "parse-options.h" +#include "packed-graph.h" static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --write [--pack-dir ]"), NULL }; static struct opts_graph { const char *pack_dir; + int write; } opts; +static int graph_write(void) +{ + struct object_id *graph_id = construct_graph(opts.pack_dir); + + if (graph_id) + printf("%s\n", oid_to_hex(graph_id)); + + free(graph_id); + return 0; +} + int cmd_graph(int argc, const char **argv, const char *prefix) { static struct option builtin_graph_options[] = { { OPTION_STRING, 'p', "pack-dir", &opts.pack_dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('w', "write", &opts.write, + N_("write graph file")), OPT_END(), }; - if (!core_graph) - die("core.graph is false"); - if (argc == 2 && !strcmp(argv[1], "-h")) usage_with_options(builtin_graph_usage, builtin_graph_options); + git_config(git_default_config, NULL); + if (!core_graph) + die("git-graph requires core.graph=true."); + + argc = parse_options(argc, argv, prefix, +builtin_graph_options, +builtin_graph_usage, 0); + + if (!opts.pack_dir) { + struct strbuf path = STRBUF_INIT; + strbuf_addstr(&path, get_object_directory()); + strbuf_addstr(&path, "/pack"); + opts.pack_dir = strbuf_detach(&path, NULL); + } + + if (opts.write) + return graph_write(); + return 0; } diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh new file mode 100755 index 00..52e979dfd3 --- /dev/null +++ b/t/t5319-graph.sh @@ -0,0 +1,83 @@ +#!/bin/sh + +test_description='packed graph' +. ./test-lib.sh + +test_expect_success \ +'setup full repo' \ +'rm -rf .git && + mkdir full && + cd full && + git init && + git config core.graph true && + git config pack.threads 1 && + packdir=".git/objects/pack"' + +test_expect_success \ +'write graph with no packs' \ +'git graph --write --pack-dir .' + +test_expect_success \ +'create commits and repack' \ +'for i in $(test_seq 5) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success \ +'write graph' \ +'graph1=$(git graph --write) && + test_path_is_file ${packdir}/graph-${graph1}.graph' + +test_expect_success \ +'Add more commits' \ +'git reset --hard commits/3 && + for i in $(test_seq 6 10) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/7
[PATCH 14/14] packed-graph: teach git-graph to read commits
Teach git-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee --- builtin/graph.c | 33 + packed-graph.c | 18 +++--- packed-graph.h | 3 ++- t/t5319-graph.sh | 18 ++ 4 files changed, 60 insertions(+), 12 deletions(-) diff --git a/builtin/graph.c b/builtin/graph.c index 3cace3a18c..708889677b 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"), NULL }; @@ -25,6 +25,7 @@ static struct opts_graph { int update_head; int delete_expired; int stdin_packs; + int stdin_commits; int has_existing; struct object_id old_graph_oid; } opts; @@ -116,22 +117,36 @@ static int graph_write(void) { struct object_id *graph_id; char **pack_indexes = NULL; + char **commits = NULL; int num_packs = 0; - int size_packs = 0; + int num_commits = 0; + char **lines = NULL; + int num_lines = 0; + int size_lines = 0; - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; - size_packs = 128; - ALLOC_ARRAY(pack_indexes, size_packs); + size_lines = 128; + ALLOC_ARRAY(lines, size_lines); while (strbuf_getline(&buf, stdin) != EOF) { - ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); - pack_indexes[num_packs++] = buf.buf; + ALLOC_GROW(lines, num_lines + 1, size_lines); + lines[num_lines++] = buf.buf; strbuf_detach(&buf, NULL); } + + if (opts.stdin_packs) { + pack_indexes = lines; + num_packs = num_lines; + } + if (opts.stdin_commits) { + commits = lines; + num_commits = num_lines; + } } - graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs); + graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs, + commits, num_commits); if (opts.update_head) update_head_file(opts.pack_dir, graph_id); @@ -170,6 +185,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("delete expired head graph file")), OPT_BOOL('s', "stdin-packs", &opts.stdin_packs, N_("only scan packfiles listed by stdin")), + OPT_BOOL('C', "stdin-commits", &opts.stdin_commits, + N_("start walk at commits listed by stdin")), { OPTION_STRING, 'G', "graph-id", &opts.graph_id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), diff --git a/packed-graph.c b/packed-graph.c index c93515f18e..94e1a97000 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -662,7 +662,8 @@ static void close_reachable(struct packed_oid_list *oids) } } -struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs) +struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs, + char **commit_hex, int nr_commits) { // Find a list of oids, adding the pointer to a list. struct packed_oid_list oids; @@ -719,10 +720,21 @@ struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids); close_pack(p); } - } else { - for_each_packed_object(if_packed_commit_add_to_list, &oids, 0); } + if (commit_hex) { + for (i = 0; i < nr_commits; i++) { + const char *end; +
[PATCH 05/14] packed-graph: implement construct_graph()
Teach Git to write a packed graph file by checking all packed objects to see if they are commits, then store the file in the given pack directory. Signed-off-by: Derrick Stolee --- Makefile | 1 + packed-graph.c | 375 + packed-graph.h | 20 +++ 3 files changed, 396 insertions(+) create mode 100644 packed-graph.c create mode 100644 packed-graph.h diff --git a/Makefile b/Makefile index d8b0d0457a..59439e13a1 100644 --- a/Makefile +++ b/Makefile @@ -841,6 +841,7 @@ LIB_OBJS += notes-utils.o LIB_OBJS += object.o LIB_OBJS += oidmap.o LIB_OBJS += oidset.o +LIB_OBJS += packed-graph.o LIB_OBJS += packfile.o LIB_OBJS += pack-bitmap.o LIB_OBJS += pack-bitmap-write.o diff --git a/packed-graph.c b/packed-graph.c new file mode 100644 index 00..9be9811667 --- /dev/null +++ b/packed-graph.c @@ -0,0 +1,375 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "packed-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 20 +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_LARGE_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +char* get_graph_filename_oid(const char *pack_dir, + struct object_id *oid) +{ + size_t len; + struct strbuf head_path = STRBUF_INIT; + strbuf_addstr(&head_path, pack_dir); + strbuf_addstr(&head_path, "/graph-"); + strbuf_addstr(&head_path, oid_to_hex(oid)); + strbuf_addstr(&head_path, ".graph"); + + return strbuf_detach(&head_path, &len); +} + +static void write_graph_chunk_fanout( + struct sha1file *f, + struct commit **commits, int nr_commits) +{ + uint32_t i, count = 0; + struct commit **list = commits; + struct commit **last = commits + nr_commits; + + /* + * Write the first-level table (the list is sorted, + * but we use a 256-entry lookup to be able to avoid + * having to do eight extra binary search iterations). + */ + for (i = 0; i < 256; i++) { + uint32_t swap_count; + + while (list < last) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + swap_count = htonl(count); + sha1write(f, &swap_count, 4); + } +} + +static void write_graph_chunk_oids( + struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + uint32_t i; + for (i = 0; i < nr_commits; i++) { + sha1write(f, (*list)->object.oid.hash, (int)hash_len); + list++; + } +} + +static int commit_pos(struct commit **commits, int nr_commits, const struct object_id *oid, uint32_t *pos) +{ + uint32_t first = 0, last = nr_commits; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + struct object_id *current; + int cmp; + + current = &(commits[mid]->object.oid); + cmp = oidcmp(oid, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + +static void write_graph_chunk_data( + struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_large_edges = 0; + + while (list < last) { + struct commit_list *parent; + uint32_t intId, swapIntId; + uint32_t packedDate[2]; + + parse_commit(*list); + sha1write(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + +
[PATCH 07/14] packed-graph: implement git-graph --read
Teach git-graph to read packed graph files and summarize their contents. Use the --read option to verify the contents of a graph file in the graph tests. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 7 +++ builtin/graph.c | 54 packed-graph.c | 147 +++- packed-graph.h | 25 t/t5319-graph.sh| 50 +-- 5 files changed, 260 insertions(+), 23 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index be6bc38814..0939c3f1be 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git graph' --write [--pack-dir ] +'git graph' --read [--pack-dir ] EXAMPLES @@ -20,6 +21,12 @@ EXAMPLES $ git midx --write +* Read basic information from a graph file. ++ + +$ git midx --read --graph-id= + + CONFIGURATION - diff --git a/builtin/graph.c b/builtin/graph.c index 09f5552338..bc66722924 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -10,15 +10,58 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --read [--graph-id=]"), N_("git graph --write [--pack-dir ]"), NULL }; static struct opts_graph { const char *pack_dir; + int read; + const char *graph_id; int write; } opts; +static int graph_read(void) +{ + struct object_id graph_oid; + struct packed_graph *graph = 0; + const char *graph_file; + + if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ) + get_oid_hex(opts.graph_id, &graph_oid); + else + die("no graph id specified"); + + graph_file = get_graph_filename_oid(opts.pack_dir, &graph_oid); + graph = load_packed_graph_one(graph_file, opts.pack_dir); + + if (!graph) + die("graph file %s does not exist.\n", graph_file); + + printf("header: %08x %02x %02x %02x %02x\n", + ntohl(graph->hdr->graph_signature), + graph->hdr->graph_version, + graph->hdr->hash_version, + graph->hdr->hash_len, + graph->hdr->num_chunks); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + printf("pack_dir: %s\n", graph->pack_dir); + return 0; +} + static int graph_write(void) { struct object_id *graph_id = construct_graph(opts.pack_dir); @@ -36,8 +79,14 @@ int cmd_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", &opts.pack_dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('r', "read", &opts.read, + N_("read graph file")), OPT_BOOL('w', "write", &opts.write, N_("write graph file")), + { OPTION_STRING, 'M', "graph-id", &opts.graph_id, + N_("oid"), + N_("An OID for a specific graph file in the pack-dir."), + PARSE_OPT_OPTARG, NULL, (intptr_t) "" }, OPT_END(), }; @@ -52,6 +101,9 @@ int cmd_graph(int argc, const char **argv, const char *prefix) builtin_graph_options, builtin_graph_usage, 0); + if (opts.write + opts.read > 1) + usage_with_options(builtin_graph_usage, builtin_graph_options); + if (!opts.pack_dir) { struct strbuf path = STRBUF_INIT; strbuf_addstr(&path, get_object_directory()); @@ -59,6 +111,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) opts.pack_dir = strbuf_detach(&path, NULL); } + if (opts.read) + return graph_read(); if (opts.write) return graph_write(); diff --git a/packed-graph.c b/packed-graph.c index 9be981
[PATCH 10/14] packed-graph: teach git-graph --delete-expired
Teach git-graph to delete the graph previously referenced by 'graph_head' when writing a new graph file and updating 'graph_head'. This prevents data creep by storing a list of useless graphs. Be careful to not delete the graph if the file did not change. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 8 ++-- builtin/graph.c | 14 +- t/t5319-graph.sh| 37 +++-- 3 files changed, 54 insertions(+), 5 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index f690699570..f4f1048d28 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -39,6 +39,10 @@ OPTIONS When used with --write, update the graph-head file to point to the written graph file. +--delete-expired:: + When used with --write and --update-head, delete the graph file + previously referenced by graph-head. + EXAMPLES @@ -55,10 +59,10 @@ $ git graph --write * Write a graph file for the packed commits in your local .git folder, -* and update graph-head. +* update graph-head, and delete the old graph-.graph file. + -$ git graph --write --update-head +$ git graph --write --update-head --delete-expired * Read basic information from a graph file. diff --git a/builtin/graph.c b/builtin/graph.c index ac15febc46..adf779b601 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired]"), NULL }; @@ -23,6 +23,7 @@ static struct opts_graph { const char *graph_id; int write; int update_head; + int delete_expired; int has_existing; struct object_id old_graph_oid; } opts; @@ -120,6 +121,15 @@ static int graph_write(void) if (graph_id) printf("%s\n", oid_to_hex(graph_id)); + if (opts.delete_expired && opts.update_head && opts.has_existing && + oidcmp(graph_id, &opts.old_graph_oid)) { + char *old_path = get_graph_filename_oid(opts.pack_dir, &opts.old_graph_oid); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + + free(old_path); + } + free(graph_id); return 0; } @@ -138,6 +148,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("write graph file")), OPT_BOOL('u', "update-head", &opts.update_head, N_("update graph-head to written graph file")), + OPT_BOOL('d', "delete-expired", &opts.delete_expired, + N_("delete expired head graph file")), { OPTION_STRING, 'M', "graph-id", &opts.graph_id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh index 311fb9dd67..a70c7bbb02 100755 --- a/t/t5319-graph.sh +++ b/t/t5319-graph.sh @@ -80,9 +80,42 @@ test_expect_success 'write graph with merges' \ _graph_read_expect "18" "${packdir}" && cmp expect output' +test_expect_success 'Add more commits' \ +'git reset --hard commits/3 && + for i in $(test_seq 16 20) + do +git commit --allow-empty -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success 'write graph with merges' \ +'graph3=$(git graph --write --update-head --delete-expired) && + test_path_is_file ${packdir}/graph-${graph3}.graph && + test_path_is_missing ${packdir}/graph-${graph2}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_is_file ${packdir}/graph-head && + echo ${graph3} >expect && + cmp -n 40 expect ${packdir}/graph-head && + git graph --read --graph-id=${graph3} >output && + _graph_read_expect "23" "${packdir}" && + cmp expect output' + +test_expect_success 'write graph with nothing new' \ +'graph4=$(git graph --write --update-head --delete-expired) && + test_
[PATCH 12/14] packed-graph: read only from specific pack-indexes
Teach git-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the graph iteratively, since we add all commits stored in a previous packed graph. Signed-off-by: Derrick Stolee --- Documentation/git-graph.txt | 12 builtin/graph.c | 26 +++--- packed-graph.c | 27 +++ packed-graph.h | 2 +- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5319-graph.sh| 10 ++ 7 files changed, 69 insertions(+), 14 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index f4f1048d28..b68a61ddea 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -43,6 +43,11 @@ OPTIONS When used with --write and --update-head, delete the graph file previously referenced by graph-head. +--stdin-packs:: + When used with --write, generate the new graph by walking objects + only in the specified packfiles and any commits in the + existing graph-head. + EXAMPLES @@ -65,6 +70,13 @@ $ git graph --write $ git graph --write --update-head --delete-expired +* Write a graph file, extending the current graph file using commits +* in , update graph-head, and delete the old graph-.graph file. ++ + +$ echo | git graph --write --update-head --delete-expired --stdin-packs + + * Read basic information from a graph file. + diff --git a/builtin/graph.c b/builtin/graph.c index adf779b601..3cace3a18c 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), NULL }; @@ -24,6 +24,7 @@ static struct opts_graph { int write; int update_head; int delete_expired; + int stdin_packs; int has_existing; struct object_id old_graph_oid; } opts; @@ -113,7 +114,24 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph static int graph_write(void) { - struct object_id *graph_id = construct_graph(opts.pack_dir); + struct object_id *graph_id; + char **pack_indexes = NULL; + int num_packs = 0; + int size_packs = 0; + + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + size_packs = 128; + ALLOC_ARRAY(pack_indexes, size_packs); + + while (strbuf_getline(&buf, stdin) != EOF) { + ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); + pack_indexes[num_packs++] = buf.buf; + strbuf_detach(&buf, NULL); + } + } + + graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs); if (opts.update_head) update_head_file(opts.pack_dir, graph_id); @@ -150,7 +168,9 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("update graph-head to written graph file")), OPT_BOOL('d', "delete-expired", &opts.delete_expired, N_("delete expired head graph file")), - { OPTION_STRING, 'M', "graph-id", &opts.graph_id, + OPT_BOOL('s', "stdin-packs", &opts.stdin_packs, + N_("only scan packfiles listed by stdin")), + { OPTION_STRING, 'G', "graph-id", &opts.graph_id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), PARSE_OPT_OPTARG, NULL, (intptr_t) "" }, diff --git a/packed-graph.c b/packed-graph.c index 343b231973..0dc68a077e 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -401,7 +401,7 @@ static int fill_packed_commit(struct commit *item, struct packed_graph *g, uint3 * 2. date * 3. parents. * - * Returns 1 iff the commit was found in the packed graph. + * Returns 1 if and only if the commit was found in the packed graph. * * See parse_commit_buffer() for the fallback after this call. */ @@ -427,7 +427,7 @@ int parse_packed_commit(struct commit *item) return fill_packed_commit(item, packed_graph, pos); } -
[PATCH 13/14] packed-graph: close under reachability
Teach construct_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Signed-off-by: Derrick Stolee --- packed-graph.c | 26 ++ t/t5319-graph.sh | 14 ++ 2 files changed, 40 insertions(+) diff --git a/packed-graph.c b/packed-graph.c index 0dc68a077e..c93515f18e 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -5,6 +5,7 @@ #include "packfile.h" #include "commit.h" #include "object.h" +#include "revision.h" #include "packed-graph.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ @@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(&revs, NULL); + + for (i = 0; i < oids->num; i++) { + commit = lookup_commit(oids->list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, &revs.commits); + } + + if (prepare_revision_walk(&revs)) + die(_("revision walk setup failed")); + + while ((commit = get_revision(&revs)) != NULL) { + ALLOC_GROW(oids->list, oids->num + 1, oids->size); + oids->list[oids->num] = &(commit->object.oid); + (oids->num)++; + } +} + struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs) { // Find a list of oids, adding the pointer to a list. @@ -698,6 +722,8 @@ struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int } else { for_each_packed_object(if_packed_commit_add_to_list, &oids, 0); } + + close_reachable(&oids); QSORT(oids.list, oids.num, commit_compare); count_distinct = 1; diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh index 969150cd21..8bf5a0c993 100755 --- a/t/t5319-graph.sh +++ b/t/t5319-graph.sh @@ -212,6 +212,20 @@ test_expect_success 'clear graph' \ _graph_git_behavior commits/20 merge/1 _graph_git_behavior commits/20 merge/2 +test_expect_success 'build graph from latest pack with closure' \ +'graph5=$(cat new-idx | git graph --write --update-head --stdin-packs) && + test_path_is_file ${packdir}/graph-${graph5}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_is_file ${packdir}/graph-head && + echo ${graph5} >expect && + cmp -n 40 expect ${packdir}/graph-head && + git graph --read --graph-id=${graph5} >output && + _graph_read_expect "21" "${packdir}" && + cmp expect output' + +_graph_git_behavior commits/20 merge/1 +_graph_git_behavior commits/20 merge/2 + test_expect_success 'setup bare repo' \ 'cd .. && git clone --bare full bare && -- 2.16.0
[PATCH 11/14] commit: integrate packed graph with commit parsing
Teach Git to inspect a packed graph to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. The only loosely-expected condition is that the commit buffer is loaded into the cache. This was checked in log-tree.c:show_log(), but the "return;" on failure produced unexpected results (i.e. the message line was never terminated). The new behavior of loading the buffer when needed prevents the unexpected behavior. If core.graph is false, then do not load the graph and behave as usual. In test script t5319-graph.sh, add output-matching conditions on read- only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commits walks. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | Signed-off-by: Derrick Stolee --- alloc.c | 1 + commit.c | 20 - commit.h | 2 + log-tree.c | 3 +- packed-graph.c | 242 +++ packed-graph.h | 18 + t/t5319-graph.sh | 114 -- 7 files changed, 387 insertions(+), 13 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadfacd..4a4dcfa2b7 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(&commit_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graphId = 0x; return c; } diff --git a/commit.c b/commit.c index cab8d4455b..253c102808 100644 --- a/commit.c +++ b/commit.c @@ -12,6 +12,7 @@ #include "prio-queue.h" #include "sha1-lookup.h" #include "wt-status.h" +#include "packed-graph.h" static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **); @@ -374,7 +375,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed) { enum object_type type; void *buffer; @@ -383,19 +384,27 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) if (!item) return -1; + + // If we already parsed, but got it from the graph, then keep going! if (item->object.parsed) return 0; + + if (check_packed && parse_packed_commit(item)) + return 0; + buffer = read_sha1_file(item->object.oid.hash, &type, &size); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(&item->object.oid)); + oid_to_hex(&item->object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(&item->object.oid)); + oid_to_hex(&item->object.oid)); } + ret = parse_commit_buffer(item, buffer, size); + if (save_commit_buffer && !ret) { set_commit_buffer(item, buffer, size); return 0; @@ -404,6 +413,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index 8c68ca1a5a..02f5f2a182 100644 --- a/commit.h +++ b/commit.h @@ -21,6 +21,7 @@ struct commit { timestamp_t date; struct commit_list *parents; struct tree *tree; + uint32_t graphId; }; extern int save_commit_buffer; @@ -60,6 +61,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); +extern int parse_commit_internal(struct commit *item, int quiet_on_mis
Re: [PATCH 00/14] Serialized Commit Graph
On 1/25/2018 10:46 AM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jan 25 2018, Derrick Stolee jotted: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. This is great, spotted / questions so far: * git graph --blah says you need to enable the config, should say "unknown option --blah ". I.e. overzelous config guard. This is a good point. * On a big repo (git show-ref -s | ~/g/git/git-graph --write --update-head) is as of writing this still hanging for me, but strace shows it's brk()-ing. Presumably just still busy, a progress bar would be very nice. Oops! This is my mistake. The correct command should be: git show-ref -s | git graph --write --update-head --stdin-commits Without "--stdin-commits" the command will walk all packed objects to look for commits and then build the graph. That's why it's taking so long. That method takes several minutes on the Linux repo, but with --stdin-commits it should take as long as "git log >/dev/null". * Shouldn't there be a pack.useGraph option so this gets auto-updated on repack? I understand this series is a WIP, so that's more a "is that the UI" than "it needs now". This will definitely be part of a follow-up patch. Thanks, -Stolee
Re: [PATCH 02/14] packed-graph: add core.graph setting
On 1/25/2018 3:17 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee wrote: The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. @@ -825,6 +825,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_apply_sparse_checkout; +extern int core_graph; Putting it here instead of say the_repository makes sense as you'd want to use this feature globally. However you can still have the config different per repository (e.g. version number of the graph setting, as one might be optimized for speed and the other for file size of the .graph file or such). So not sure if we'd rather want to put this into the repository struct. But then again the other core settings aren't there either and this feature sounds like it is repository specific only in the experimental phase; later it is expected to be on everywhere? I do think that more things should go in the repository struct. Unfortunately, that is not the world we live in. However, to make things clearer I'm following the pattern currently in master. You'll see the same with the global 'packed_graph' pointer, similar to 'packed_git'. I think these should be paired together until the repository absorbs them. If other 'core_*' variables move to the repository, I'm happy to move core_graph. If 'packed_git' moves to the repository, I'm happy to move 'packed_git'. However, if there is significant interest in moving all new state to the repository, then I'll move these values there. Let's have that discussion here instead of spread around the rest of the patch. Thanks, -Stolee
Re: [PATCH 00/14] Serialized Commit Graph
On 1/25/2018 6:06 PM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jan 25 2018, Derrick Stolee jotted: Oops! This is my mistake. The correct command should be: git show-ref -s | git graph --write --update-head --stdin-commits Without "--stdin-commits" the command will walk all packed objects to look for commits and then build the graph. That's why it's taking so long. That method takes several minutes on the Linux repo, but with --stdin-commits it should take as long as "git log >/dev/null". Thanks, it took around 15m to finish with the command I initially ran on my test repo. Then the `merge-base --is-ancestor` performance problem I was complaining about in https://public-inbox.org/git/87608bawoa@evledraar.gmail.com/ takes around 1s with your series, 5s without it. Nice. Thanks for testing this! May I ask how many commits are in your repo? One way to find out is to run 'git graph --read' and it will tell you how many commits are in the serialized graph. Thanks, -Stolee
Re: [PATCH 01/14] graph: add packed graph design document
On 1/25/2018 3:04 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee wrote: Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. How did you come up with that specific number? (Is it platform dependent?) I'd avoid a specific number to not derail the reader here into wondering how this got measured. Using a specific number was a mistake. Git can walk ~100K commits per second by parsing commits, in my tests on my machine. I'll instead say "commit count grows." +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: status needs the walk for the ahead/behind computation which is (1)? (I forget how status would need to compute a merge-base) 'status' computes the ahead/behind counts using paint_down_to_common(). This is a more robust method than simply computing merge bases, but the possible merge bases are found as a result. fetch is a networked command, which traditionally in Git is understood as "can be slow" because you might be in Australia, or the connection is slow otherwise. So giving this as an example it is not obvious that the DAG walking is the bottleneck. Maybe git-merge or "git show --remerge-diff" [1] are better examples for walk-intensive commands? [1] https://public-inbox.org/git/cover.1409860234.git...@thomasrast.ch/ never landed, so maybe that is a bad example. But for me that command is more obviously dependent on cheap walking the DAG compared to fetch. So, take my comments with a grain of salt! Actually, a 'fetch' command does the same ahead/behind calculation as 'status', and in GVFS repos we have seen that walk take 30s per branch when comparing local and remote copies a fast-moving branch. Yes, there are other (usually) more expensive things in 'fetch' so I'll drop that reference.. +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The packed graph is a file that stores the commit graph structure along +with some extra metadata to speed up graph walks. This format allows a +consumer to load the following info for a commit: + +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). This new format is specifically removing the cost of decompression and parsing (1) completely, whereas (2) we still have to walk the entire graph for now as the generation numbers are not fully used as of yet, but provided. A major goal of this work is to provide a place to store computed generation numbers so we can not walk the entire graph. I mention this here because 'git log -' is O(n) (due to commit-date heuristics that prevent walking the entire graph) while 'git log --topo-order -' is O(T) where T is the total number of reachable commits. +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. Does this mean we decrease the pressure on fast lookups in packfiles/loose objects? Yes, we do. In fact, when profiling 'git log --topo-order -1000', I noticed that 30-50% of the time (after this patch) is spent in lookup_tree(). If we can prevent checking the ODB for the existence of these trees until they are needed, we can get additional speedups. It is a bit wasteful that we are loading these trees even when we will never use them (such as computing merge bases). +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a
Re: [PATCH 01/14] graph: add packed graph design document
On 1/25/2018 4:14 PM, Junio C Hamano wrote: Derrick Stolee writes: Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt I really wanted to like having this patch at the beginning, but unfortunatelly I didn't see the actual file format description, which was a bit disappointing. An example of the things that I was curious about was how the "integer ID" is used to access into the file. If we could somehow use "integer ID" as an index into an array of fixed size elements, it would be ideal to gain "fast lookups", but because of the "list of parents" thing, it needs some trickery to do so, and that was among the things that I wanted to see how much thought went into the design, for example. There is definitely a chicken-or-the-egg situation here. I'm happy to start with the format before the design document. I can try to expand this "integer ID" concept, but you can see how I use it in the following method from patch 11/14: +int parse_packed_commit(struct commit *item) +{ + if (!core_graph) + return 0; + if (item->object.parsed) + return 1; + + prepare_packed_graph(); + if (packed_graph) { + uint32_t pos; + int found; + if (item->graphId != 0x) { + pos = item->graphId; + found = 1; + } else { + found = bsearch_graph(packed_graph, &(item->object.oid), &pos); + } + + if (found) + return fill_packed_commit(item, packed_graph, pos); + } + + return 0; +} Note that if item->graphId has a "real" value (not 0x which in hindsight should be a macro) then we navigate directly to that position in the graph. Otherwise, we use binary search to query the graph's commit list to find the position (if the commit is packed). diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: s/data shape/history shape/ may make it even clearer. +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. Commits created after a packed graph file is built may of course not appear in a packed graph file, but that is OK because they never need to be listed as parents of commits in the file. So "list of parents" can always refer to the parents using the "integer ID for fast lookup". One thing I need to test locally is what happens with boundary commits of a shallow clone. The commit's parents are not in the repo, so they will not be in the graph. I think that parse_commit_buffer() drops the parents, so the graph will treat them as root commits. Makes sense. Item 2. might want to say "The list of parents, using the fast lookup integer ID (see 5.) as reference instead of OID", though. That will be more specific, thanks. +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a root commit. When a commit A can reach roots X and Y, and Y is further than X, the distance between Y and A becomes A's generation number. "One more than the length of the path from the commit to the furthest root commit it can reach", in other words. My "Equivalently,..." sentence is mang
Re: [PATCH 02/14] packed-graph: add core.graph setting
On 1/25/2018 4:43 PM, Junio C Hamano wrote: Derrick Stolee writes: The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. Signed-off-by: Derrick Stolee --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) Before you get too married to the name "graph", is it reasonable to assume that the commit ancestry graph is the primary "graph" that should come to users' minds when a simple word "graph" is used in the context of discussing Git? I suspect not. Let's not just call this "core.graph" and "packed-graph", and in addition give some adjective before "graph". I was too focused that I wanted the word "graph" but "graph.c" already existed in source root that I came up with "packed-graph.c" just to have a separate filename. Clearly, "commit-graph" should be used instead. In v2, I'll use "/commit-graph.c" and "/builtin/commit-graph.c". Thanks, -Stolee
Re: [PATCH 03/14] packed-graph: create git-graph builtin
On 1/25/2018 4:45 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee wrote: Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. I wonder if this builtin should not respect the boolean core graph, as this new builtin commands' whole existence is to deal with these new files? As you assume this builtin as a plumbing command, I would expect it to pay less attention to config rather than more. My thought was to alert the caller "This graph isn't going to be good for anything!" and fail quickly before doing work. You do have a good point, and I think we can remove that condition here. When we integrate with other commands ('repack', 'fetch', 'clone') we will want a different setting that signals automatically writing the graph and we don't want those to fail because they are not aware of a second config setting. @@ -408,6 +408,7 @@ static struct cmd_struct commands[] = { { "fsck-objects", cmd_fsck, RUN_SETUP }, { "gc", cmd_gc, RUN_SETUP }, { "get-tar-commit-id", cmd_get_tar_commit_id }, + { "graph", cmd_graph, RUN_SETUP_GENTLY }, Why gently, though? From reading the docs (and assumptions on further patches) we'd want to abort if there is no .git dir to be found? Or is a future patch having manual logic? (e.g. if pack-dir is given, the command may be invoked from outside a git dir) You are right. I inherited this from my MIDX patch which can operate on a list of IDX files without a .git folder. The commit graph operations need an ODB. Thanks, -Stolee
Re: [PATCH 03/14] packed-graph: create git-graph builtin
On 1/25/2018 6:01 PM, Junio C Hamano wrote: Derrick Stolee writes: Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. Just to set my expectation straight. Is it fair to say that in the ideal endgame state, this will be like "git pack-objects" in that end users won't have to know about it, but would serve as a crucial building block that is invoked by other front-end commands that are more familiar to end users (just like pack-objects are used behind the scenes by repack, push, etc.)? That is my hope. Leaving that integration for later, after this feature has proven itself.
Re: [PATCH 04/14] packed-graph: add format document
On 1/25/2018 5:06 PM, Junio C Hamano wrote: Derrick Stolee writes: Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee --- Documentation/technical/graph-format.txt | 88 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== Good that this is not saying "graph format" but is explicit that it is about "commit". Do the same for the previous steps. Especially, builtin/graph.c that does not have much to do with graph.c is not a good way forward ;-) :+1: I do like the fact that later parents of octopus merges are moved out of way to make the majority of records fixed length, but I am not sure if the "up to two parents are recorded in line" is truly the best arrangement. Aren't majority of commits single-parent, thereby wasting 4 bytes almost always? Will 32-bit stay to be enough for everybody? Wouldn't it make sense to at least define them to be indices into arrays (i.e. scaled to element size), not "offsets", to recover a few lost bits? I incorrectly used the word "offset" when I mean "array position" for the edge values. What's the point of storing object id length? If you do not understand the object ID scheme, knowing only the length would not do you much good anyway, no? And if you know the hashing scheme specified by Object ID version, you already know the length, no? I'll go read the OID transition document to learn more, but I didn't know if there were plans for things like "Use SHA3 but with different hash lengths depending on user requirements". One side benefit is that we immediately know the width of our commit and tree references within the commit graph file without needing to consult a table of hash definitions. On 1/25/2018 5:18 PM, Stefan Beller wrote: git.git has ~37k non-merge commits and ~12k merge commits, (35 of them have 3 or more parents). So 75% would waste the 4 bytes of the second parent. However the first parent is still there, so any operation that only needs the first parent (git bisect --first-parent?) would still be fast. Not sure if that is common. The current API boundary does not support this, as parse_commit_gently() is not aware of the --first-parent option. The benefits of injecting that information are probably not worth the complication. On 1/25/2018 5:29 PM, Junio C Hamano wrote: Stefan Beller writes: The downside of just having one parent or pointer into the edge list would be to penalize 25% of the commit lookups with an indirection compared to ~0% (the 35 octopus'). I'd rather want to optimize for speed than disk size? (4 bytes for 37k is 145kB for git.git, which I find is not a lot). My comment is not about disk size but is about the size of working set (or "size of array element"). I do want to optimize for speed over space, at least for two-parent commits. Hopefully my clarification about offset/array position clarifies Junio's concerns here. Thanks, -Stolee
Re: [PATCH 04/14] packed-graph: add format document
On 1/25/2018 5:07 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee wrote: Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee --- Documentation/technical/graph-format.txt | 88 So this is different from Documentation/technical/packed-graph.txt, which gives high level design and this gives the details on how to set bits. 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) This is 20 or 40 for sha1 ? (binary or text representation?) 20 for binary. + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. ... offset [in bytes/words/4k blocks?] in ... bytes. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). So F[0] > 0 for git.git for example. Or another way: To lookup a 01xxx, I need to look at entry(F[00] + 1 )...entry(F[01]). Makes sense. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph. ... sorted ascending. + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the int-ids of the first two parents of + the ith commit. Stores value 0x if no parent in that position. + If there are more than two parents, the second value has its most- + significant bit on and the other bits store an offset into the Large + Edge List chunk. s/an offset into/position in/ ? (otherwise offset in bytes?) + * The next 8 bytes store the generation number of the commit and the + commit time in seconds since EPOCH. The generation number uses the + higher 30 bits of the first 4 bytes, while the commit time uses the + 32 bits of the second 4 bytes, along with the lowest 2 bits of the + lowest byte, storing the 33rd and 34th bit of the commit time. This allows for a maximum generation number of 1.073.741.823 (2^30 -1) = 1 billion, and a max time stamp of later than 2100. Do you allow negative time stamps? + + [Optional] Large Edge List (ID: