[PATCH v2 01/12] commit-graph: add 'verify' subcommand

2018-05-11 Thread Derrick Stolee
If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

During review, we noticed that a FREE_AND_NULL(graph_name) was
placed after a possible 'return', and this pattern was also in
graph_read(). Fix that case, too.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-commit-graph.txt |  6 ++
 builtin/commit-graph.c | 40 +-
 commit-graph.c |  5 +
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git commit-graph read' [--object-dir ]
+'git commit-graph verify' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..af3101291f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
+   N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+   N_("git commit-graph verify [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_verify_options[] = {
+   OPT_STRING(0, "object-dir", &opts.obj_dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_verify_options,
+builtin_commit_graph_verify_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);
+
+   if (!graph)
+   return 0;
+
+   return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv)
 
graph_name = get_commit_graph_filename(opts.obj_dir);
graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);
 
if (!graph)
die("graph file %s does not exist", graph_name);
-   FREE_AND_NULL(graph_name);
 
printf("header: %08x %d %d %d %d\n",
ntohl(*(uint32_t*)graph->data),
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "verify"))
+   return graph_verify(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..b25aaed128 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -817,3 +817,8 @@ vo

[PATCH v2 03/12] commit-graph: test that 'verify' finds corruption

2018-05-11 Thread Derrick Stolee
Add test cases to t5318-commit-graph.sh that corrupt the commit-graph
file and check that the 'git commit-graph verify' command fails. These
tests verify the header and chunk information is checked carefully.

Helped-by: Martin Ågren 
Signed-off-by: Derrick Stolee 
---
 t/t5318-commit-graph.sh | 53 +
 1 file changed, 53 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..0cb88232fa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
 '
 
+# usage: corrupt_data   []
+corrupt_data() {
+   file=$1
+   pos=$2
+   data="${3:-\0}"
+   printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'detect bad signature' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 0 "\0" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "graph signature" verify-errors
+'
+
+test_expect_success 'detect bad version number' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 4 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "graph version" verify-errors
+'
+
+test_expect_success 'detect bad hash version' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 5 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "hash version" verify-errors
+'
+
+test_expect_success 'detect too small chunk-count' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 6 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 2 verify-errors &&
+   grep "missing the OID Lookup chunk" verify-errors &&
+   grep "missing the Commit Data chunk" verify-errors
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v2 00/12] Integrate commit-graph into fsck and gc

2018-05-11 Thread Derrick Stolee
I'm sending this v2 re-roll rather quickly after the previous version
because Martin provided a framework to add tests to the 'verify'
subcommand. I took that framework and added tests for the other checks
of the commit-graph data. This also found some interesting things about
the command:

1. There were some segfaults because we were not checking for bad data
   carefully enough.

2. To avoid segfaults, we will now terminate the check early if we find
   problems earlier in the file, such as in the header, or OID lookup.

3. We were not writing newlines between reports. This now happens by
   default in graph_report().

The integration into 'fetch' is dropped (thanks Ævar!).

Derrick Stolee (12):
  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: test that 'verify' finds corruption
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: verify commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/config.txt |   6 +
 Documentation/git-commit-graph.txt   |  14 ++-
 Documentation/git-fsck.txt   |   3 +
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |  22 
 builtin/commit-graph.c   |  81 -
 builtin/fsck.c   |  21 
 builtin/gc.c |   8 ++
 commit-graph.c   | 199 ++-
 commit-graph.h   |   2 +
 commit.c |  13 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  | 173 +++
 13 files changed, 509 insertions(+), 38 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6



[PATCH v2 11/12] gc: automatically write commit-graph files

2018-05-11 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt | 6 ++
 Documentation/git-gc.txt | 4 
 builtin/gc.c | 8 
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.commitGraph::
+   If true, then gc will rewrite the commit-graph file after any
+   change to the object database. If '--auto' is used, then the
+   commit-graph will not be updated unless the threshold is met.
+   See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..8403445738 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", &aggressive_depth);
git_config_get_int("gc.auto", &gc_auto_threshold);
git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", &gc_commit_graph);
git_config_get_bool("gc.autodetach", &detach_auto);
git_config_get_expiry("gc.pruneexpire", &prune_expire);
git_config_get_expiry("gc.worktreepruneexpire", 
&prune_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_pushl(&prune, "prune", "--expire", NULL);
argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", 
NULL);
argv_array_pushl(&rerere, "rerere", "gc", NULL);
+   argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", 
NULL);
 
/* default expiry time, overwritten in gc_config */
gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   return error(FAILED_RUN, commit_graph.argv[0]);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
-- 
2.16.2.329.gfb62395de6



[PATCH v2 10/12] commit-graph: add '--reachable' option

2018-05-11 Thread Derrick Stolee
When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-commit-graph.txt |  8 ++--
 builtin/commit-graph.c | 41 ++
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index a222cfab08..cc1715a823 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index af3101291f..7cb94a4813 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph verify [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+struct hex_list {
+   char **hex_strs;
+   int hex_nr;
+   int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+  const struct object_id *oid,
+  int flags, void *cb_data)
+{
+   struct hex_list *list = (struct hex_list*)cb_data;
+
+   ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+   list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+   strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+   list->hex_nr++;
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", &opts.obj_dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", &opts.reachable,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));

[PATCH v2 08/12] commit-graph: verify commit contents against odb

2018-05-11 Thread Derrick Stolee
When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

While testing, we discovered that mutating the integer value for a
parent to be outside the accepted range causes a segmentation fault. Add
a new check in insert_parent_or_die() that prevents this fault. Check
for that error during the test, both in the typical parents and in the
list of parents for octopus merges.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 100 
 t/t5318-commit-graph.sh |  64 +++
 2 files changed, 164 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 5bb93e533c..a15ad9710d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -237,6 +237,10 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
 {
struct commit *c;
struct object_id oid;
+
+   if (pos >= g->num_commits)
+   die("invalide parent position %"PRIu64, pos);
+
hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
c = lookup_commit(&oid);
if (!c)
@@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
return 1;
 
for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)
 
cur_fanout_pos++;
}
+
+   graph_commit = lookup_commit(&cur_oid);
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report("failed to parse %s from commit-graph", 
oid_to_hex(&cur_oid));
}
 
while (cur_fanout_pos < 256) {
@@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
cur_fanout_pos++;
}
 
+   if (verify_commit_graph_error)
+   return 1;
+
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(&cur_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0)) {
+   graph_report("failed to parse %s from object database", 
oid_to_hex(&cur_oid));
+   continue;
+   }
+
+   if (oidcmp(&get_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report("root tree object ID for commit %s in 
commit-graph is %s != %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report("commit date for commit %s in commit-graph 
is %"PRItime" != %"PRItime"",
+oid_to_hex(&cur_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report("commit-graph parent list for 
commit %s is too long (%d)",
+oid_to_hex(&cur_oid),
+num_parents);
+
+   if (oidcmp(&graph_parents->item->object.oid, 
&odb_parents->item->object.oid))
+   graph_report("commit-graph parent for %s is %s 
!= %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(&

[PATCH v2 07/12] commit-graph: load a root tree from specific graph

2018-05-11 Thread Derrick Stolee
When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b0fd1d5320..5bb93e533c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+   return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6



[PATCH v2 06/12] commit: force commit to parse from object database

2018-05-11 Thread Derrick Stolee
In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee 
---
 commit.c | 13 +
 commit.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..7c92350373 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, &type, &size);
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(&item->object.oid));
+   oid_to_hex(&item->object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(&item->object.oid));
+   oid_to_hex(&item->object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6



[PATCH v2 09/12] fsck: verify commit-graph

2018-05-11 Thread Derrick Stolee
If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c | 21 +
 t/t5318-commit-graph.sh| 21 ++---
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or 
other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 -
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+   const char *verify_argv[] = { "commit-graph", "verify", NULL, 
NULL, NULL, NULL };
+   commit_graph_verify.argv = verify_argv;
+   commit_graph_verify.git_cmd = 1;
+
+   if (run_command(&commit_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+
+   prepare_alt_odb();
+   for (alt = alt_odb_list; alt; alt = alt->next) {
+   verify_argv[2] = "--object-dir";
+   verify_argv[3] = alt->path;
+   if (run_command(&commit_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+   }
+
return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 5ab268a024..91c8406d97 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' 
'
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'build graph using --reachable' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph write --reachable &&
+   test_path_is_file $objdir/info/commit-graph &&
+   graph_read_expect "11" "large_edges"
+'
+
+graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
cd "$TRASH_DIRECTORY" &&
git clone --bare --no-local full bare &&
@@ -335,7 +345,7 @@ test_expect_success 'detect OID not in object database' '
cd "$TRASH_DIRECTORY/full" &&
cp $objdir/info/commit-graph commit-graph-backup &&
test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
-   corrupt_data $objdir/info/commit-graph 1134 "\01" &&
+   corrupt_data $objdir/info/commit-graph 1134 "\00" &&
test_must_fail git commit-graph verify 2>err &&
grep -v "^\+" err > verify-errors &&
test_line_count = 3 verify-errors &&
@@ -348,7 +358,7 @@ test_expect_success 'detect incorrect tree OID' '
cd "$TRASH_DIRECTORY/full" &&
cp $objdir/info/commit-graph commit-graph-backup &&
test_when_finished mv commit-graph-backup $objdir/info/commit-graph &

[PATCH v2 12/12] commit-graph: update design document

2018-05-11 Thread Derrick Stolee
The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/commit-graph.txt | 22 --
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
 - 'log --topo-order'
 - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6



[PATCH v2 05/12] commit-graph: verify fanout and lookup table

2018-05-11 Thread Derrick Stolee
While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Add tests that check these corruptions are caught by the verify
subcommand. Most of the tests check the full output matches the exact
error we inserted, but since our OID order test triggers incorrect
fanout values (with possibly different numbers of output lines) we
focus only that the correct error is written in that case.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 36 
 t/t5318-commit-graph.sh | 31 +++
 2 files changed, 67 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 4dfff7e752..b0fd1d5320 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -848,6 +848,9 @@ static void graph_report(const char *fmt, ...)
 
 int verify_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report("no commit-graph file loaded");
return 1;
@@ -862,5 +865,38 @@ int verify_commit_graph(struct commit_graph *g)
if (!g->chunk_commit_data)
graph_report("commit-graph is missing the Commit Data chunk");
 
+   if (verify_commit_graph_error)
+   return 1;
+
+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
+   graph_report("commit-graph has incorrect oid order: %s 
then %s",
+oid_to_hex(&prev_oid),
+oid_to_hex(&cur_oid));
+
+   oidcpy(&prev_oid, &cur_oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report("commit-graph has incorrect fanout value: 
fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 0cb88232fa..6fb306b0da 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -293,4 +293,35 @@ test_expect_success 'detect too small chunk-count' '
grep "missing the Commit Data chunk" verify-errors
 '
 
+test_expect_success 'detect incorrect chunk lookup value' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 25 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "improper chunk offset" verify-errors
+'
+
+test_expect_success 'detect incorrect fanout' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 128 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "fanout value" verify-errors
+'
+
+test_expect_success 'detect incorrect OID order' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 1272 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep "incorrect oid order" err
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v2 04/12] commit-graph: parse commit from chosen graph

2018-05-11 Thread Derrick Stolee
Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2db20e49a..4dfff7e752 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
@@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item)
return 0;
if (item->object.parsed)
return 1;
+
+   if (find_commit_in_graph(item, g, &pos))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+   if (!core_commit_graph)
+   return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.16.2.329.gfb62395de6



Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation

2018-05-14 Thread Derrick Stolee

On 5/11/2018 2:03 PM, Jeff King wrote:

Commit 941ba8db57 (Eliminate recursion in setting/clearing
marks in commit list, 2012-01-14) used a clever double-loop
to avoid allocations for single-parent chains of history.
However, it did so only when following parents of parents
(which was an uncommon case), and _always_ incurred at least
one allocation to populate the list of pending parents in
the first place.

We can turn this into zero-allocation in the common case by
iterating directly over the initial parent list, and then
following up on any pending items we might have discovered.


This change appears to improve performance, but I was unable to measure 
any difference between this commit and the one ahead, even when merging 
ds/generation-numbers (which significantly reduces the other costs). I 
was testing 'git status' and 'git rev-list --boundary 
master...origin/master' in the Linux repo with my copy of master 70,000+ 
commits behind origin/master.


It's still a good change, but I was hoping to find a measurable benefit :(



Signed-off-by: Jeff King 
---
Again, try "-w" for more readability.

  revision.c | 44 +---
  1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 89ff9a99ce..cbe041128e 100644
--- a/revision.c
+++ b/revision.c
@@ -115,32 +115,38 @@ static void commit_stack_clear(struct commit_stack *stack)
stack->nr = stack->alloc = 0;
  }
  
-void mark_parents_uninteresting(struct commit *commit)

+static void mark_one_parent_uninteresting(struct commit *commit,
+ struct commit_stack *pending)
  {
-   struct commit_stack pending = COMMIT_STACK_INIT;
struct commit_list *l;
  
+	if (commit->object.flags & UNINTERESTING)

+   return;
+   commit->object.flags |= UNINTERESTING;
+
+   /*
+* Normally we haven't parsed the parent
+* yet, so we won't have a parent of a parent
+* here. However, it may turn out that we've
+* reached this commit some other way (where it
+* wasn't uninteresting), in which case we need
+* to mark its parents recursively too..
+*/
for (l = commit->parents; l; l = l->next)
-   commit_stack_push(&pending, l->item);
+   commit_stack_push(pending, l->item);
+}
  
-	while (pending.nr > 0) {

-   struct commit *commit = commit_stack_pop(&pending);
+void mark_parents_uninteresting(struct commit *commit)
+{
+   struct commit_stack pending = COMMIT_STACK_INIT;
+   struct commit_list *l;
  
-		if (commit->object.flags & UNINTERESTING)

-   return;
-   commit->object.flags |= UNINTERESTING;
+   for (l = commit->parents; l; l = l->next)
+   mark_one_parent_uninteresting(l->item, &pending);
  
-		/*

-* Normally we haven't parsed the parent
-* yet, so we won't have a parent of a parent
-* here. However, it may turn out that we've
-* reached this commit some other way (where it
-* wasn't uninteresting), in which case we need
-* to mark its parents recursively too..
-*/
-   for (l = commit->parents; l; l = l->next)
-   commit_stack_push(&pending, l->item);
-   }
+   while (pending.nr > 0)
+   mark_one_parent_uninteresting(commit_stack_pop(&pending),
+ &pending);
  
  	commit_stack_clear(&pending);

  }




Re: [PATCH 0/4] a few mark_parents_uninteresting cleanups

2018-05-14 Thread Derrick Stolee

On 5/11/2018 2:00 PM, Jeff King wrote:

This is a follow-up to the discussion from February:

   
https://public-inbox.org/git/1519522496-73090-1-git-send-email-dsto...@microsoft.com/

There I theorized that some of these extra has_object_file() checks were
unnecessary. After poking around a bit, I've convinced myself that this
is the case, so here are some patches.

After Stolee's fix in ebbed3ba04 (revision.c: reduce object database
queries, 2018-02-24), I doubt this will provide any noticeable speedup.
IMHO the value is just in simplifying the logic.

The first two patches are the actual has_object_file simplifications.
The second two are an attempt to fix some head-scratching I had while
reading mark_parents_uninteresting(). I hope the result is easier to
follow, but I may have just shuffled one confusion for another. :)

   [1/4]: mark_tree_contents_uninteresting(): drop missing object check
   [2/4]: mark_parents_uninteresting(): drop missing object check
   [3/4]: mark_parents_uninteresting(): replace list with stack
   [4/4]: mark_parents_uninteresting(): avoid most allocation

  revision.c | 90 ++
  1 file changed, 50 insertions(+), 40 deletions(-)

-Peff


This series looks good to me. I found Patch 3 hard to read in the diff, 
so I just looked at the final result and the new arrangement is very 
clear about how it should behave.


Reviewed-by: Derrick Stolee 


Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-14 Thread Derrick Stolee

On 5/12/2018 10:00 AM, Jakub Narebski wrote:

Derrick Stolee  writes:

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.

Before thinking about adding more data to the commit-graph, I think
instead we need to finish taking advantage of the data that is already
there. This means landing the generation number patch [1] (I think
this is close, so I'll send a v6 this week if there is no new
feedback.) and the auto-compute patch [2] (this could use more
feedback, but I'll send a v1 based on the RFC feedback if no one
chimes in).

[1]
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/
     [PATCH v5 00/11] Compute and consume generation numbers

[2]
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/
     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

Right, so the RFC might be a bit premature; I wanted the discussion to
be out there to think about when adding new uses of existing features.


DIGRESSION: it is commendable that you are sending patches in small,
easy digestible chunks / patch series.  It is much easier to review 10+
series than 80+ behemoth (though I understand it is not always possible
to subdivide patch series into smaller self-contained sub-series).

On the other hand, it would be nice to have some roadmap about series
and features to be sent in the future, if possible.  Something like what
was done when 'git rebase --interactive' was getting builtinified: moved
(in parts) to C.  It would be great to have such roadmap with milestones
achieved and milestones to be done in the cover letter for series.


I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 
anticipated.


I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:


Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths




The big wins remaining from this data are `git tag --merged` and `git
log --graph`. The `tag` scenario is probably easier: this can be done
by replacing the revision-walk underlying the call to use
paint_down_to_common() instead. Requires adding an external method to
commit.c, but not too much code.

I wonder if there is some significant reason behind `git tag --merged`
using its own codepath, beside historical reasons.  Maybe performance is
better with current code?

Utilizing paint_down_to_common() there, beside reducing amount of code
you would have to modify, would also unify code (and possibly reduce
amount of code).  That's very nice.


The tougher challenge is `git log --graph`. The revision walk
machinery currently uses two precompute phases before iterating
results to the pager: limit_list() and sort_in_topological_order();
these correspond to two phases of Kahn's algorithm for topo-sort
(compute in-degrees, then walk by peeling commits with in-degree
zero). This requires O(N) time, where N is the number of reachable
commits. Instead, we could make this be O(W) time to output one page
of results, where W is (roughly) the number of reachable commits with
generation number above the last reported result.

A reminder: Kahn's algorithm (described for example in [1] and [3])
looks like this:

   L ← Empty list that will contain the sorted elements
   S ← Collection of all nodes with no incoming edge
   while S is non-empty do
   remove a node 'n' from S
   add 'n' to tail of L
   for each parent 'm' of 'n' do
   decrease the in-degree of 'm'
   if 'm' has in-degree of 0 then
   insert 'm' into S

[1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
[2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-soluti

Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation

2018-05-14 Thread Derrick Stolee

On 5/14/2018 9:09 AM, Jeff King wrote:

On Mon, May 14, 2018 at 08:47:33AM -0400, Derrick Stolee wrote:


On 5/11/2018 2:03 PM, Jeff King wrote:

Commit 941ba8db57 (Eliminate recursion in setting/clearing
marks in commit list, 2012-01-14) used a clever double-loop
to avoid allocations for single-parent chains of history.
However, it did so only when following parents of parents
(which was an uncommon case), and _always_ incurred at least
one allocation to populate the list of pending parents in
the first place.

We can turn this into zero-allocation in the common case by
iterating directly over the initial parent list, and then
following up on any pending items we might have discovered.

This change appears to improve performance, but I was unable to measure any
difference between this commit and the one ahead, even when merging
ds/generation-numbers (which significantly reduces the other costs). I was
testing 'git status' and 'git rev-list --boundary master...origin/master' in
the Linux repo with my copy of master 70,000+ commits behind origin/master.

I think you'd want to go the other way: this is marking uninteresting
commits, so you'd want origin/master..master, which would make those 70k
commits uninteresting.


Thanks for the tip. Running 'git rev-list origin/master..master' 100 
times gave the following times, on average (with ds/generation-numbers):


PATCH 3/4: 0.19 s
PATCH 4/4: 0.16 s
    Rel %: -16%



I still doubt it will create that much difference, though. It's
more or less saving a single malloc per commit we traverse. Certainly
without the .commits file that's a drop in the bucket compared to
inflating the object data, but I wouldn't be surprised if we end up with
a few mallocs even after your patches.

Anyway, thanks for poking at it. :)

-Peff




Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand

2018-05-14 Thread Derrick Stolee

On 5/12/2018 9:31 AM, Martin Ågren wrote:

On 11 May 2018 at 23:15, Derrick Stolee  wrote:


 graph_name = get_commit_graph_filename(opts.obj_dir);
 graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);

 if (!graph)
 die("graph file %s does not exist", graph_name);
-   FREE_AND_NULL(graph_name);

This is probably because of something I said, but this does not look
correct. The `die()` would typically print "(null)" or segfault. If the
`die()` means we don't free `graph_name`, that should be fine.

You're still leaking `graph` here (possibly nothing this patch should
worry about) and in `graph_verify()`. UNLEAK-ing it immediately before
calling `verify_commit_graph()` should be ok. I also think punting on
this UNLEAK-business entirely would be ok; I was just a bit surprised to
see one variable getting freed and the other one ignored.


Thanks, Martin. I was just blindly searching for FREE_AND_NULL() and 
shouldn't have been so careless.


-Stolee


Re: [PATCH v2 02/12] commit-graph: verify file header information

2018-05-14 Thread Derrick Stolee

On 5/12/2018 9:35 AM, Martin Ågren wrote:

+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+   va_list ap;
+   struct strbuf sb = STRBUF_INIT;
+   verify_commit_graph_error = 1;
+
+   va_start(ap, fmt);
+   strbuf_vaddf(&sb, fmt, ap);
+
+   fprintf(stderr, "%s\n", sb.buf);
+   strbuf_release(&sb);
+   va_end(ap);
+}


That's a good idea. Makes that patch a bit less trivial and this one a 
bit less difficult.


Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb

2018-05-14 Thread Derrick Stolee

On 5/12/2018 5:17 PM, Martin Ågren wrote:

On 11 May 2018 at 23:15, Derrick Stolee  wrote:

When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

While testing, we discovered that mutating the integer value for a
parent to be outside the accepted range causes a segmentation fault. Add
a new check in insert_parent_or_die() that prevents this fault. Check
for that error during the test, both in the typical parents and in the
list of parents for octopus merges.

This paragraph and the corresponding fix and test feel like a separate
patch to me. (The commit message of it could be "To test the next patch,
we threw invalid data at `git commit-graph verify, and it crashed in
pre-existing code, so let's fix that first" -- there is definitely a
connection.) Is this important enough to fast-track to master in time
for 2.18? My guess would be no.


+
+   if (pos >= g->num_commits)
+   die("invalide parent position %"PRIu64, pos);

s/invalide/invalid/


@@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
 return 1;

 for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
 hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);

 if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)

 cur_fanout_pos++;
 }
+
+   graph_commit = lookup_commit(&cur_oid);
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report("failed to parse %s from commit-graph", 
oid_to_hex(&cur_oid));
 }

Could this end up giving ridiculous amounts of output? It would depend
on the input, I guess.


 while (cur_fanout_pos < 256) {
@@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
 cur_fanout_pos++;
 }

+   if (verify_commit_graph_error)
+   return 1;

Well, here we give up before running into *too* much problem.


+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(&cur_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0)) {
+   graph_report("failed to parse %s from object database", 
oid_to_hex(&cur_oid));
+   continue;
+   }
+
+   if (oidcmp(&get_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report("root tree object ID for commit %s in 
commit-graph is %s != %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report("commit date for commit %s in commit-graph is %"PRItime" 
!= %"PRItime"",
+oid_to_hex(&cur_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report("commit-graph parent list for commit %s 
is too long (%d)",
+oid_to_hex(&cur_oid),
+num_parents);
+
+   if (oidcmp(&graph_parents->item->object.oid, 
&odb_parents->item->object.oid))
+   graph_report("commit-graph parent for %s is %s != 
%s",
+oid_to_hex(&cur_oid),
+

Re: [PATCH v2 14/14] commit.h: delete 'util' field in struct commit

2018-05-14 Thread Derrick Stolee

On 5/14/2018 1:30 PM, Duy Nguyen wrote:

On Mon, May 14, 2018 at 6:07 PM, Duy Nguyen  wrote:

On Mon, May 14, 2018 at 04:52:29PM +0900, Junio C Hamano wrote:

Nguyễn Thái Ngọc Duy   writes:


diff --git a/commit.h b/commit.h
index 838f6a6b26..70371e111e 100644
--- a/commit.h
+++ b/commit.h
@@ -18,12 +18,16 @@ struct commit_list {

  struct commit {
 struct object object;
-   void *util;
 unsigned int index;
 timestamp_t date;
 struct commit_list *parents;
 struct tree *tree;
 uint32_t graph_pos;
+   /*
+* Do not add more fields here unless it's _very_ often
+* used. Use commit-slab to associate more data with a commit
+* instead.
+*/
  };

That's a logical consequence and a natural endgame of this
pleasent-to-read series.  THanks.

Unfortunately we are gaining more and more stuff in "struct commit"
with recent topics, and we may want to see if we can evict some of
them out to shrink it again.

Sigh.. ds/lazy-load-trees already enters 'next' so a fixup patch is
something like this.

Sorry I take this patch back. I didn't realize it was a rename and the
old field named 'tree' was already there (I vaguely recalled some
"tree" in this struct but didn't stop to think about it). Moving
graph_pos out is an option, but only 32-bit arch gains from it (64-bit
arch will have 4 bytes padding anyway) so probably not worth the
effort. "generation" field should probably be moved out though.


I'm happy to take a look at this patch and figure out the best way to 
integrate it with the changes I've been doing.


I disagree with the removal of "generation". My intention is to make the 
commit-graph feature a standard feature that most repos (of reasonable 
size) want to have. In that case, 'graph_pos' and 'generation' will be 
set during every parse and used by every commit walk. This is just my 
gut reaction.


As I investigate these changes, I'll try to see what performance hit is 
caused by converting the graph_pos and/or generation to commit slabs. 
(Again, I'm assuming the slabs will make this slower. I may be wrong here.)


Thanks,
-Stolee


Re: worktrees vs. alternates

2018-05-16 Thread Derrick Stolee

On 5/16/2018 6:33 AM, Ævar Arnfjörð Bjarmason wrote:
[big snip]


And here's where this isn't at all like "worktree", each of those 100
will have their own "master" branch, and they can all create 100
different branches called "topic" that can be different.


This is the biggest difference. You cannot have the same ref checked out 
in multiple worktrees, as they both may edit that ref. The alternates 
allow you to share data in a "read only" fashion. If you have one repo 
that is the "base" repo that manages that objects dir, then that is 
probably a good way to reduce the duplication. I'm not familiar with 
what happens when a "child" repo does 'git gc' or 'git repack', will it 
delete the local objects that is sees exist in the alternate?


GVFS uses alternates in this same way: we create a drive-wide "shared 
object cache" that GVFS manages. We put our prefetch packs filled with 
commits and trees in there, and any loose objects that are downloaded 
via the object virtualization are placed as loose objects in the 
alternate. We also store the multi-pack-index and commit-graph in that 
alternate. This means that the only objects in each src dir are those 
created by the developer doing their normal work.


Thanks,
-Stolee



Re: Git log range reverse bug

2018-05-16 Thread Derrick Stolee

Hi Mendi,

On 5/16/2018 2:19 PM, Mehdi Zeinali wrote:

Git Version: Version: 2.14.2

When reversing a range in git log, it does not start from the expected commit:

$ git show 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b
commit 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b
Author: Some Name 
Date:   Mon Nov 3 19:01:53 2014 +
.
.
.

$ git show
Author: Some Other Name 
Date:   Wed May 16 16:49:10 2018 +
.
.
.

$ git log --reverse 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b..HEAD


This command is asking for the commits reachable from HEAD but NOT 
reachable from 8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b. To see 
8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b in the results, you would need 
to add "--boundary" to the command. That may still not show 
8e11b4a41ec21e47fb0bf8b76e1edba739f57a9b as the first commit, as there 
may be multiple, earlier boundary commits.


Thanks,
-Stolee


Re: What's cooking in git.git (May 2018, #02; Thu, 17)

2018-05-17 Thread Derrick Stolee

On 5/17/2018 2:01 AM, Junio C Hamano wrote:

* ds/generation-numbers (2018-05-02) 11 commits
  - commit-graph.txt: update design document
  - merge: check config before loading commits
  - commit: use generation number in remove_redundant()
  - commit: add short-circuit to paint_down_to_common()
  - commit: use generation numbers for in_merge_bases()
  - ref-filter: use generation number for --contains
  - commit-graph: always load commit-graph information
  - commit: use generations in paint_down_to_common()
  - commit-graph: compute generation numbers
  - commit: add generation number to struct commmit
  - ref-filter: fix outdated comment on in_commit_list
  (this branch is used by ds/commit-graph-lockfile-fix; uses 
ds/lazy-load-trees.)

  A recently added "commit-graph" datafile has learned to store
  pre-computed generation numbers to speed up the decisions to stop
  history traversal.

  Is this ready for 'next' with ds/commit-graph-lockfile-fix?
  A commit with triple 'm' needs its title amended, though.


With the lockfile fix, it should be ready. I've been giving this 
significant testing on my machine and a few other developers here. The 
next version of GVFS is shipping with this code and with GVFS 
controlling the maintenance of the commit-graph file. That code has been 
cooking with our CI builds for a while, with full functional tests 
against the Windows repository. The only bugs we've found are the fix in 
"merge: check config before loading commits" and in 
ds/commit-graph-lockfile-fix.


Sorry for the triple-m.

Thanks,
-Stolee


commit-graph: change in "best" merge-base when ambiguous

2018-05-21 Thread Derrick Stolee

Hello all,

While working on the commit-graph feature, I made a test commit that 
sets core.commitGraph and gc.commitGraph to true by default AND runs 
'git commit-graph write --reachable' after each 'git commit' command. 
This helped me find instances in the test suite where the commit-graph 
feature changes existing functionality. Most of these were in regards to 
grafts, replace-objects, and shallow-clones (as expected) or when trying 
to find a corrupt or hidden commit (the commit-graph hides this 
corrupt/missing data). However, there was one interesting case that I'd 
like to mention on-list.


In t6024-recursive-merge.sh, we have the following commit structure:

    # 1 - A - D - F
    #   \   X   /
    # B   X
    #   X   \
    # 2 - C - E - G

When merging F to G, there are two "best" merge-bases, A and C. With 
core.commitGraph=false, 'git merge-base F G' returns A, while it returns 
C when core.commitGraph=true. This is due to the new walk order when 
using generation numbers, although I have not dug deep into the code to 
point out exactly where the choice between A and C is made. Likely it's 
just whatever order they are inserted into a list.


In the Discussion section of the `git merge-base` docs [1], we have the 
following:


    When the history involves criss-cross merges, there can be more 
than one best common ancestor for two commits. For example, with this 
topology:


    ---1---o---A
        \ /
     X
        / \
    ---2---o---o---B

    both 1 and 2 are merge-bases of A and B. Neither one is better than 
the other (both are best merge bases). When the --all option is not 
given,     it is unspecified which best one is output.


This means our official documentation mentions that we do not have a 
concrete way to differentiate between these choices. This makes me think 
that this change in behavior is not a bug, but it _is_ a change in 
behavior. It's worth mentioning, but I don't think there is any value in 
making sure `git merge-base` returns the same output.


Does anyone disagree? Is this something we should solidify so we always 
have a "definitive" merge-base?


The biggest reason I think we should avoid sticking to the existing 
behavior is that the current behavior depends on the walk order. That 
means we would not be able to concretely define a tie-breaker without 
changing the existing behavior anyway.


Thanks,
-Stolee

[1] https://git-scm.com/docs/git-merge-base#_discussion



Re: commit-graph: change in "best" merge-base when ambiguous

2018-05-22 Thread Derrick Stolee

On 5/22/2018 1:39 AM, Michael Haggerty wrote:

On 05/21/2018 08:10 PM, Derrick Stolee wrote:

[...]
In the Discussion section of the `git merge-base` docs [1], we have the
following:

     When the history involves criss-cross merges, there can be more than
one best common ancestor for two commits. For example, with this topology:

     ---1---o---A
         \ /
      X
         / \
     ---2---o---o---B

     both 1 and 2 are merge-bases of A and B. Neither one is better than
the other (both are best merge bases). When the --all option is not
given,     it is unspecified which best one is output.

This means our official documentation mentions that we do not have a
concrete way to differentiate between these choices. This makes me think
that this change in behavior is not a bug, but it _is_ a change in
behavior. It's worth mentioning, but I don't think there is any value in
making sure `git merge-base` returns the same output.

Does anyone disagree? Is this something we should solidify so we always
have a "definitive" merge-base?
[...]

This may be beyond the scope of what you are working on, but there are
significant advantages to selecting a "best" merge base from among the
candidates. Long ago [1] I proposed that the "best" merge base is the
merge base candidate that minimizes the number of non-merge commits that
are in

 git rev-list $candidate..$branch

that are already in master:

 git rev-list $master

(assuming merging branch into master), which is equivalent to choosing
the merge base that minimizes

 git rev-list --count $candidate..$branch

In fact, this criterion is symmetric if you exchange branch ↔ master,
which is a nice property, and indeed generalizes pretty simply to
computing the merge base of more than two commits.

In that email I also included some data showing that the "best" merge
base almost always results in either the same or a shorter diff than the
more or less arbitrary algorithm that we currently use. Sometimes the
difference in diff length is dramatic.

To me it feels like the best *deterministic* merge base would be based
on the above criterion, maybe with first-parent reachability, commit
times, and SHA-1s used (in that order) to break ties.


Thanks, everyone, for your perspective on this. I'm walking away with 
these conclusions:


1. While this is a change in behavior, it is not a regression. We do not 
need to act immediately to preserve old behavior in these ambiguous cases.


2. We should (eventually) define tie-breaking conditions. I like 
Michael's suggestion above.


Thanks,
-Stolee


[PATCH v3 07/20] commit-graph: verify catches corrupt signature

2018-05-24 Thread Derrick Stolee
This is the first of several commits that add a test to check that
'git commit-graph verify' catches corruption in the commit-graph
file. The first test checks that the command catches an error in
the file signature. This is a check that exists in the existing
commit-graph reading code.

Add a helper method 'corrupt_graph_and_verify' to the test script
t5318-commit-graph.sh. This helper corrupts the commit-graph file
at a certain location, runs 'git commit-graph verify', and reports
the output to the 'err' file. This data is filtered to remove the
lines added by 'test_must_fail' when the test is run verbosely.
Then, the output is checked to contain a specific error message.

Signed-off-by: Derrick Stolee 
---
 t/t5318-commit-graph.sh | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..bd64481c7a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+# the verify tests below expect the commit-graph to contain
+# exactly the commits reachable from the commits/8 branch.
+# If the file changes the set of commits in the list, then the
+# offsets into the binary file will result in different edits
+# and the tests will likely break.
+
 test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
+   git rev-parse commits/8 | git commit-graph write --stdin-commits &&
git commit-graph verify >output
 '
 
+GRAPH_BYTE_VERSION=4
+GRAPH_BYTE_HASH=5
+
+# usage: corrupt_graph_and_verify   
+# Manipulates the commit-graph file at the position
+# by inserting the data, then runs 'git commit-graph verify'
+# and places the output in the file 'err'. Test 'err' for
+# the given string.
+corrupt_graph_and_verify() {
+   pos=$1
+   data="${2:-\0}"
+   grepstr=$3
+   cd "$TRASH_DIRECTORY/full" &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" 
conv=notrunc &&
+   test_must_fail git commit-graph verify 2>test_err &&
+   grep -v "^+" test_err >err
+   grep "$grepstr" err
+}
+
+test_expect_success 'detect bad signature' '
+   corrupt_graph_and_verify 0 "\0" \
+   "graph signature"
+'
+
+test_expect_success 'detect bad version' '
+   corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+   "graph version"
+'
+
+test_expect_success 'detect bad hash version' '
+   corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \
+   "hash version"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 03/20] commit-graph: parse commit from chosen graph

2018-05-24 Thread Derrick Stolee
Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 82295f0975..78ba0edc80 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+static int parse_commit_in_graph_one(struct commit_graph *g, struct commit 
*item)
 {
uint32_t pos;
 
@@ -318,9 +318,21 @@ int parse_commit_in_graph(struct commit *item)
return 0;
if (item->object.parsed)
return 1;
+
+   if (find_commit_in_graph(item, g, &pos))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+   if (!core_commit_graph)
+   return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.16.2.329.gfb62395de6



[PATCH v3 04/20] commit: force commit to parse from object database

2018-05-24 Thread Derrick Stolee
In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee 
---
 commit.c | 9 +++--
 commit.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..6eaed0174c 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,7 +403,7 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, &type, &size);
if (!buffer)
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6



[PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'

2018-05-24 Thread Derrick Stolee
Thanks for all the feedback on v2. I've tried to make this round's
review a bit easier by splitting up the commits into smaller pieces.
Also, the test script now has less boilerplate and uses variables and
clear arithmetic to explain which bytes are being modified.

One other change worth mentioning: in "commit-graph: add '--reachable'
option" I put the ref-iteration into a new external
'write_commit_graph_reachable()' method inside commit-graph.c. This
makes the 'gc: automatically write commit-graph files' a simpler change.

Thanks,
-Stolee

Derrick Stolee (20):
  commit-graph: UNLEAK before die()
  commit-graph: fix GRAPH_MIN_SIZE
  commit-graph: parse commit from chosen graph
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: add 'verify' subcommand
  commit-graph: verify catches corrupt signature
  commit-graph: verify required chunks are present
  commit-graph: verify corrupt OID fanout and lookup
  commit-graph: verify objects exist
  commit-graph: verify root tree OIDs
  commit-graph: verify parent list
  commit-graph: verify generation number
  commit-graph: verify commit date
  commit-graph: test for corrupted octopus edge
  commit-graph: verify contents match checksum
  fsck: verify commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/config.txt |   6 +
 Documentation/git-commit-graph.txt   |  14 +-
 Documentation/git-fsck.txt   |   3 +
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |  22 ---
 builtin/commit-graph.c   |  59 +++-
 builtin/fsck.c   |  21 +++
 builtin/gc.c |   6 +
 commit-graph.c   | 234 +--
 commit-graph.h   |   3 +
 commit.c |   9 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  | 196 ++
 13 files changed, 539 insertions(+), 39 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6



[PATCH v3 06/20] commit-graph: add 'verify' subcommand

2018-05-24 Thread Derrick Stolee
If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-commit-graph.txt |  6 ++
 builtin/commit-graph.c | 38 ++
 commit-graph.c | 26 ++
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 82 insertions(+)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git commit-graph read' [--object-dir ]
+'git commit-graph verify' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f0875b8bf3..0433dd6e20 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
+   N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+   N_("git commit-graph verify [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_verify_options[] = {
+   OPT_STRING(0, "object-dir", &opts.obj_dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_verify_options,
+builtin_commit_graph_verify_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);
+
+   if (!graph)
+   return 0;
+
+   return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -163,6 +199,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "verify"))
+   return graph_verify(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 25893ec096..55b41664ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -836,3 +836,29 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
oids.nr = 0;
 }
+
+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+   va_list ap;
+   struct strbuf sb = STRBUF_INIT;
+   verify_commit_graph_error = 1;
+
+   va_start(ap, fmt);
+   strbuf_vaddf(&sb, fmt, ap);
+
+   fprintf(stderr, "%s\n", sb.buf);
+   strbuf_release(&sb);
+   va_end(ap);
+}
+
+int verify_commit_graph(struct commit_graph *g)
+{
+   if (!g) {
+   graph_report("no commit-graph file loaded");
+   return 

[PATCH v3 12/20] commit-graph: verify parent list

2018-05-24 Thread Derrick Stolee
The commit-graph file stores parents in a two-column portion of the
commit data chunk. If there is only one parent, then the second column
stores 0x to indicate no second parent.

The 'verify' subcommand checks the parent list for the commit loaded
from the commit-graph and the one parsed from the object database. Test
these checks for corrupt parents, too many parents, and wrong parents.

The octopus merge will be tested in a later commit.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 25 +
 t/t5318-commit-graph.sh | 18 ++
 2 files changed, 43 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 19ea369fc6..fff22dc0c3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -921,6 +921,7 @@ int verify_commit_graph(struct commit_graph *g)
 
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
 
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
@@ -938,6 +939,30 @@ int verify_commit_graph(struct commit_graph *g)
 oid_to_hex(&cur_oid),
 
oid_to_hex(get_commit_tree_oid(graph_commit)),
 
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   if (odb_parents == NULL) {
+   graph_report("commit-graph parent list for 
commit %s is too long",
+oid_to_hex(&cur_oid));
+   break;
+   }
+
+   if (oidcmp(&graph_parents->item->object.oid, 
&odb_parents->item->object.oid))
+   graph_report("commit-graph parent for %s is %s 
!= %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(&graph_parents->item->object.oid),
+
oid_to_hex(&odb_parents->item->object.oid));
+
+   graph_parents = graph_parents->next;
+   odb_parents = odb_parents->next;
+   }
+
+   if (odb_parents != NULL)
+   graph_report("commit-graph parent list for commit %s 
terminates early",
+oid_to_hex(&cur_oid));
}
 
return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 21cc8e82f3..12f0d7f54d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -269,6 +269,9 @@ GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET 
+ $HASH_LEN \* 8`
 GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 
+ 10`
 GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 
$NUM_COMMITS`
 GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
+GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
+GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
+GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -348,4 +351,19 @@ test_expect_success 'detect incorrect tree OID' '
"root tree OID for commit"
 '
 
+test_expect_success 'detect incorrect parent int-id' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_PARENT "\01" \
+   "invalid parent"
+'
+
+test_expect_success 'detect extra parent int-id' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_EXTRA_PARENT "\00" \
+   "is too long"
+'
+
+test_expect_success 'detect incorrect tree OID' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_WRONG_PARENT "\01" \
+   "commit-graph parent for"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 01/20] commit-graph: UNLEAK before die()

2018-05-24 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 builtin/commit-graph.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..f0875b8bf3 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv)
graph_name = get_commit_graph_filename(opts.obj_dir);
graph = load_commit_graph_one(graph_name);
 
-   if (!graph)
+   if (!graph) {
+   UNLEAK(graph_name);
die("graph file %s does not exist", graph_name);
+   }
+
FREE_AND_NULL(graph_name);
 
printf("header: %08x %d %d %d %d\n",
-- 
2.16.2.329.gfb62395de6



[PATCH v3 14/20] commit-graph: verify commit date

2018-05-24 Thread Derrick Stolee
Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 6 ++
 t/t5318-commit-graph.sh | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index ead92460c1..d2b291aca2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -981,6 +981,12 @@ int verify_commit_graph(struct commit_graph *g)
 oid_to_hex(&cur_oid),
 graph_commit->generation,
 max_generation + 1);
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report("commit date for commit %s in commit-graph 
is %"PRItime" != %"PRItime,
+oid_to_hex(&cur_oid),
+graph_commit->date,
+odb_commit->date);
}
 
return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 673b0d37d5..58adb8246d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -273,6 +273,7 @@ GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + 
$HASH_LEN`
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
+GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -372,4 +373,9 @@ test_expect_success 'detect incorrect generation number' '
"generation"
 '
 
+test_expect_success 'detect incorrect commit date' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATE "\01" \
+   "commit date"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 11/20] commit-graph: verify root tree OIDs

2018-05-24 Thread Derrick Stolee
The 'verify' subcommand must compare the commit content parsed from the
commit-graph and compare it against the content in the object database.
Use lookup_commit() and parse_commit_in_graph_one() to parse the commits
from the graph and compare against a commit that is loaded separately
and parsed directly from the object database.

Add checks for the root tree OID.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 17 -
 t/t5318-commit-graph.sh |  7 +++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0420ebcd87..19ea369fc6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g)
return verify_commit_graph_error;
 
for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g)
 
cur_fanout_pos++;
}
+
+   graph_commit = lookup_commit(&cur_oid);
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report("failed to parse %s from commit-graph",
+oid_to_hex(&cur_oid));
}
 
while (cur_fanout_pos < 256) {
@@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g)
return verify_commit_graph_error;
 
for (i = 0; i < g->num_commits; i++) {
-   struct commit *odb_commit;
+   struct commit *graph_commit, *odb_commit;
 
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
+   graph_commit = lookup_commit(&cur_oid);
odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
if (parse_commit_internal(odb_commit, 0, 0)) {
graph_report("failed to parse %s from object database",
 oid_to_hex(&cur_oid));
continue;
}
+
+   if (oidcmp(&get_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report("root tree OID for commit %s in 
commit-graph is %s != %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
}
 
return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 996a016239..21cc8e82f3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
 GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
 GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
 GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 
+ 10`
+GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 
$NUM_COMMITS`
+GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -341,4 +343,9 @@ test_expect_success 'detect OID not in object database' '
"from object database"
 '
 
+test_expect_success 'detect incorrect tree OID' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_TREE "\01" \
+   "root tree OID for commit"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 10/20] commit-graph: verify objects exist

2018-05-24 Thread Derrick Stolee
In the 'verify' subcommand, load commits directly from the object
database to ensure they exist. Parse by skipping the commit-graph.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 20 
 t/t5318-commit-graph.sh |  7 +++
 2 files changed, 27 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index cbd1aae514..0420ebcd87 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
 {
struct commit *c;
struct object_id oid;
+
+   if (pos >= g->num_commits)
+   die("invalid parent position %"PRIu64, pos);
+
hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
c = lookup_commit(&oid);
if (!c)
@@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g)
cur_fanout_pos++;
}
 
+   if (verify_commit_graph_error)
+   return verify_commit_graph_error;
+
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *odb_commit;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0)) {
+   graph_report("failed to parse %s from object database",
+oid_to_hex(&cur_oid));
+   continue;
+   }
+   }
+
return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index c050ef980b..996a016239 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
 '
 
+NUM_COMMITS=9
 HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
@@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
 GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
 GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
 GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
+GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 
+ 10`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' '
"incorrect OID order"
 '
 
+test_expect_success 'detect OID not in object database' '
+   corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \
+   "from object database"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 19/20] gc: automatically write commit-graph files

2018-05-24 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt |  6 ++
 Documentation/git-gc.txt |  4 
 builtin/gc.c |  6 ++
 t/t5318-commit-graph.sh  | 14 ++
 4 files changed, 30 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.commitGraph::
+   If true, then gc will rewrite the commit-graph file after any
+   change to the object database. If '--auto' is used, then the
+   commit-graph will not be updated unless the threshold is met.
+   See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..efd214a59f 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,7 @@
 #include "argv-array.h"
 #include "commit.h"
 #include "packfile.h"
+#include "commit-graph.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -34,6 +35,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", &aggressive_depth);
git_config_get_int("gc.auto", &gc_auto_threshold);
git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", &gc_commit_graph);
git_config_get_bool("gc.autodetach", &detach_auto);
git_config_get_expiry("gc.pruneexpire", &prune_expire);
git_config_get_expiry("gc.worktreepruneexpire", 
&prune_worktrees_expire);
@@ -480,6 +483,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   write_commit_graph_reachable(get_object_directory(), 0);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a659620332..d20b17586f 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+test_expect_success 'check that gc clears commit-graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit --allow-empty -m "blank" &&
+   git commit-graph write --reachable &&
+   cp $objdir/info/commit-graph commit-graph-before-gc &&
+   git reset --hard HEAD~1 &&
+   git config gc.commitGraph true &&
+   git gc &&
+   cp $objdir/info/commit-graph commit-graph-after-gc &&
+   ! test_cmp commit-graph-before-gc commit-graph-after-gc &&
+   git commit-graph write --reachable &&
+   test_cmp commit-graph-after-gc $objdir/info/commit-graph
+'
+
 # the verify tests below expect the commit-graph to contain
 # exactly the commits reachable from the commits/8 branch.
 # If the file changes the set of commits in the list, then the
-- 
2.16.2.329.gfb62395de6



[PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup

2018-05-24 Thread Derrick Stolee
In the commit-graph file, the OID fanout chunk provides an index into
the OID lookup. The 'verify' subcommand should find incorrect values
in the fanout.

Similarly, the 'verify' subcommand should find out-of-order values in
the OID lookup.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 36 
 t/t5318-commit-graph.sh | 22 ++
 2 files changed, 58 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 06e3e4f9ba..cbd1aae514 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...)
 
 int verify_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report("no commit-graph file loaded");
return 1;
@@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g)
if (!g->chunk_commit_data)
graph_report("commit-graph is missing the Commit Data chunk");
 
+   if (verify_commit_graph_error)
+   return verify_commit_graph_error;
+
+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
+   graph_report("commit-graph has incorrect OID order: %s 
then %s",
+oid_to_hex(&prev_oid),
+oid_to_hex(&cur_oid));
+
+   oidcpy(&prev_oid, &cur_oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report("commit-graph has incorrect fanout value: 
fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 4ef3fe3dc2..c050ef980b 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
 '
 
+HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
@@ -258,6 +259,12 @@ GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET 
+ \
  1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
 GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
+GRAPH_FANOUT_OFFSET=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+ $GRAPH_CHUNK_LOOKUP_WIDTH \* $GRAPH_CHUNK_LOOKUP_ROWS`
+GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
+GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
+GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
+GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -312,4 +319,19 @@ test_expect_success 'detect missing commit data chunk' '
"missing the Commit Data chunk"
 '
 
+test_expect_success 'detect incorrect fanout' '
+   corrupt_graph_and_verify $GRAPH_BYTE_FANOUT1 "\01" \
+   "fanout value"
+'
+
+test_expect_success 'detect incorrect fanout' '
+   corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
+   "fanout value"
+'
+
+test_expect_success 'detect incorrect OID order' '
+   corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ORDER "\01" \
+   "incorrect OID order"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 16/20] commit-graph: verify contents match checksum

2018-05-24 Thread Derrick Stolee
The commit-graph file ends with a SHA1 hash of the previous contents. If
a commit-graph file has errors but the checksum hash is correct, then we
know that the problem is a bug in Git and not simply file corruption
after-the-fact.

Compute the checksum right away so it is the first error that appears,
and make the message translatable since this error can be "corrected" by
a user by simply deleting the file and recomputing. The rest of the
errors are useful only to developers.

Be sure to continue checking the rest of the file data if the checksum
is wrong. This is important for our tests, as we break the checksum as
we modify bytes of the commit-graph file.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 16 ++--
 t/t5318-commit-graph.sh |  6 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2b291aca2..a33600c584 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
 static int verify_commit_graph_error;
 
 static void graph_report(const char *fmt, ...)
@@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...)
 int verify_commit_graph(struct commit_graph *g)
 {
uint32_t i, cur_fanout_pos = 0;
-   struct object_id prev_oid, cur_oid;
+   struct object_id prev_oid, cur_oid, checksum;
+   struct hashfile *f;
+   int devnull;
 
if (!g) {
graph_report("no commit-graph file loaded");
@@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g)
if (verify_commit_graph_error)
return verify_commit_graph_error;
 
+   devnull = open("/dev/null", O_WRONLY);
+   f = hashfd(devnull, NULL);
+   hashwrite(f, g->data, g->data_len - g->hash_len);
+   finalize_hashfile(f, checksum.hash, CSUM_CLOSE);
+   if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) {
+   graph_report(_("the commit-graph file has incorrect checksum 
and is likely corrupt"));
+   verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
+   }
+
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit;
 
@@ -916,7 +928,7 @@ int verify_commit_graph(struct commit_graph *g)
cur_fanout_pos++;
}
 
-   if (verify_commit_graph_error)
+   if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH)
return verify_commit_graph_error;
 
for (i = 0; i < g->num_commits; i++) {
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 240aef6add..2680a2ebff 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -279,6 +279,7 @@ GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
 GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
 GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
+GRAPH_BYTE_FOOTER=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4 \* $NUM_OCTOPUS_EDGES`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopus 
merge' '
"invalid parent"
 '
 
+test_expect_success 'detect invalid checksum hash' '
+   corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
+   "incorrect checksum"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE

2018-05-24 Thread Derrick Stolee
The GRAPH_MIN_SIZE macro should be the smallest size of a parsable
commit-graph file. However, the minimum number of chunks was wrong.
It is possible to write a commit-graph file with zero commits, and
that violates this macro's value.

Rewrite the macro, and use extra macros to better explain the magic
constants.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..82295f0975 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -33,10 +33,11 @@
 
 #define GRAPH_LAST_EDGE 0x8000
 
+#define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
 #define GRAPH_CHUNKLOOKUP_WIDTH 12
-#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
-   GRAPH_OID_LEN + 8)
+#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+   + GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
 
 char *get_commit_graph_filename(const char *obj_dir)
 {
-- 
2.16.2.329.gfb62395de6



[PATCH v3 05/20] commit-graph: load a root tree from specific graph

2018-05-24 Thread Derrick Stolee
When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 78ba0edc80..25893ec096 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -358,14 +358,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
-   BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
+   BUG("get_commit_tree_in_graph_one called from non-commit-graph 
commit");
+
+   return load_tree_for_commit(g, (struct commit *)c);
+}
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6



[PATCH v3 08/20] commit-graph: verify required chunks are present

2018-05-24 Thread Derrick Stolee
The commit-graph file requires the following three chunks:

* OID Fanout
* OID Lookup
* Commit Data

If any of these are missing, then the 'verify' subcommand should
report a failure. This includes the chunk IDs malformed or the
chunk count is truncated.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  |  9 +
 t/t5318-commit-graph.sh | 29 +
 2 files changed, 38 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 55b41664ee..06e3e4f9ba 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -860,5 +860,14 @@ int verify_commit_graph(struct commit_graph *g)
return 1;
}
 
+   verify_commit_graph_error = 0;
+
+   if (!g->chunk_oid_fanout)
+   graph_report("commit-graph is missing the OID Fanout chunk");
+   if (!g->chunk_oid_lookup)
+   graph_report("commit-graph is missing the OID Lookup chunk");
+   if (!g->chunk_commit_data)
+   graph_report("commit-graph is missing the Commit Data chunk");
+
return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index bd64481c7a..4ef3fe3dc2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -249,6 +249,15 @@ test_expect_success 'git commit-graph verify' '
 
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
+GRAPH_BYTE_CHUNK_COUNT=6
+GRAPH_CHUNK_LOOKUP_OFFSET=8
+GRAPH_CHUNK_LOOKUP_WIDTH=12
+GRAPH_CHUNK_LOOKUP_ROWS=5
+GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
+GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+ 1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
+GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+   2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -283,4 +292,24 @@ test_expect_success 'detect bad hash version' '
"hash version"
 '
 
+test_expect_success 'detect bad chunk count' '
+   corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\02" \
+   "missing the Commit Data chunk"
+'
+
+test_expect_success 'detect missing OID fanout chunk' '
+   corrupt_graph_and_verify $GRAPH_BYTE_OID_FANOUT_ID "\0" \
+   "missing the OID Fanout chunk"
+'
+
+test_expect_success 'detect missing OID lookup chunk' '
+   corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \
+   "missing the OID Lookup chunk"
+'
+
+test_expect_success 'detect missing commit data chunk' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \
+   "missing the Commit Data chunk"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 20/20] commit-graph: update design document

2018-05-24 Thread Derrick Stolee
The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/commit-graph.txt | 22 --
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
 - 'log --topo-order'
 - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6



[PATCH v3 15/20] commit-graph: test for corrupted octopus edge

2018-05-24 Thread Derrick Stolee
The commit-graph file has an extra chunk to store the parent int-ids for
parents beyond the first parent for octopus merges. Our test repo has a
single octopus merge that we can manipulate to demonstrate the 'verify'
subcommand detects incorrect values in that chunk.

Signed-off-by: Derrick Stolee 
---
 t/t5318-commit-graph.sh | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 58adb8246d..240aef6add 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -248,6 +248,7 @@ test_expect_success 'git commit-graph verify' '
 '
 
 NUM_COMMITS=9
+NUM_OCTOPUS_EDGES=2
 HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
@@ -274,6 +275,10 @@ GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr 
$GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
 GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
+GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
+GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
+   $GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
+GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -378,4 +383,9 @@ test_expect_success 'detect incorrect commit date' '
"commit date"
 '
 
+test_expect_success 'detect incorrect parent for octopus merge' '
+   corrupt_graph_and_verify $GRAPH_BYTE_OCTOPUS "\01" \
+   "invalid parent"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 17/20] fsck: verify commit-graph

2018-05-24 Thread Derrick Stolee
If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c | 21 +
 t/t5318-commit-graph.sh|  8 
 3 files changed, 32 insertions(+)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or 
other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 -
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+   const char *verify_argv[] = { "commit-graph", "verify", NULL, 
NULL, NULL, NULL };
+   commit_graph_verify.argv = verify_argv;
+   commit_graph_verify.git_cmd = 1;
+
+   if (run_command(&commit_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+
+   prepare_alt_odb();
+   for (alt = alt_odb_list; alt; alt = alt->next) {
+   verify_argv[2] = "--object-dir";
+   verify_argv[3] = alt->path;
+   if (run_command(&commit_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+   }
+
return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2680a2ebff..4941937163 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -394,4 +394,12 @@ test_expect_success 'detect invalid checksum hash' '
"incorrect checksum"
 '
 
+test_expect_success 'git fsck (checks commit-graph)' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git fsck &&
+   corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
+   "incorrect checksum" &&
+   test_must_fail git fsck
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v3 18/20] commit-graph: add '--reachable' option

2018-05-24 Thread Derrick Stolee
When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
write_commit_graph_reachable() after performing cleanup of the object
database.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-commit-graph.txt |  8 ++--
 builtin/commit-graph.c | 16 
 commit-graph.c | 32 
 commit-graph.h |  1 +
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index a222cfab08..dececb79d7 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with `--stdin-commits` or `--reachable`.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+`--stdin-packs` or `--reachable`.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with `--stdin-commits`
+or `--stdin-packs`.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0433dd6e20..20ce6437ae 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph verify [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +24,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -130,6 +131,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", &opts.obj_dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", &opts.reachable,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -143,11 +146,16 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
+   if (opts.reachable) {
+   write_commit_graph_reachable(opts.obj_dir, opts.append);
+   return 0;
+   }
+
if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
lines_nr = 0;
diff --git a/commit-graph.c b/commit-graph.c
index a33600c584..057d734926 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -6,6 +6,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "refs.h"
 #include "revision.h"
 #include "sha1-lookup.h"
 #include "commit-graph.h"
@@ -651,6 +652,37 @@ static void compute_generation_numbers(struct 
packed_commit_list* commits)
  

[PATCH v3 13/20] commit-graph: verify generation number

2018-05-24 Thread Derrick Stolee
While iterating through the commit parents, perform the generation
number calculation and compare against the value stored in the
commit-graph.

The tests demonstrate that having a different set of parents affects
the generation number calculation, and this value propagates to
descendants. Hence, we drop the single-line condition on the output.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c  | 18 ++
 t/t5318-commit-graph.sh |  6 ++
 2 files changed, 24 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index fff22dc0c3..ead92460c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -922,6 +922,7 @@ int verify_commit_graph(struct commit_graph *g)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
struct commit_list *graph_parents, *odb_parents;
+   uint32_t max_generation = 0;
 
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
@@ -956,6 +957,9 @@ int verify_commit_graph(struct commit_graph *g)
 
oid_to_hex(&graph_parents->item->object.oid),
 
oid_to_hex(&odb_parents->item->object.oid));
 
+   if (graph_parents->item->generation > max_generation)
+   max_generation = 
graph_parents->item->generation;
+
graph_parents = graph_parents->next;
odb_parents = odb_parents->next;
}
@@ -963,6 +967,20 @@ int verify_commit_graph(struct commit_graph *g)
if (odb_parents != NULL)
graph_report("commit-graph parent list for commit %s 
terminates early",
 oid_to_hex(&cur_oid));
+
+   /*
+* If one of our parents has generation GENERATION_NUMBER_MAX, 
then
+* our generation is also GENERATION_NUMBER_MAX. Decrement to 
avoid
+* extra logic in the following condition.
+*/
+   if (max_generation == GENERATION_NUMBER_MAX)
+   max_generation--;
+
+   if (graph_commit->generation != max_generation + 1)
+   graph_report("commit-graph generation for commit %s is 
%u != %u",
+oid_to_hex(&cur_oid),
+graph_commit->generation,
+max_generation + 1);
}
 
return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 12f0d7f54d..673b0d37d5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -272,6 +272,7 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
+GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
 
 # usage: corrupt_graph_and_verify   
 # Manipulates the commit-graph file at the position
@@ -366,4 +367,9 @@ test_expect_success 'detect incorrect tree OID' '
"commit-graph parent for"
 '
 
+test_expect_success 'detect incorrect generation number' '
+   corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
+   "generation"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



Re: [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption

2018-05-24 Thread Derrick Stolee

On 5/21/2018 2:53 PM, Jakub Narebski wrote:

+corrupt_data() {
+   file=$1
+   pos=$2
+   data="${3:-\0}"
+   printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}

First, if we do this that way (and not by adding a test helper), the use
of this function should be, I think, protected using appropriate test
prerequisite.  Not everyone has 'dd' tool installed, for example on
MS Windows.


Windows does not, but it is also missing many things this test suite 
needs. 'dd' is included in the Git for Windows SDK. I rebased this 
series onto Git for Windows and the tests passed when run in an SDK shell.


Thanks,
-Stolee


Re: [PATCH v3 01/20] commit-graph: UNLEAK before die()

2018-05-24 Thread Derrick Stolee

On 5/24/2018 6:47 PM, Stefan Beller wrote:

On Thu, May 24, 2018 at 9:25 AM, Derrick Stolee  wrote:

Signed-off-by: Derrick Stolee 
---
  builtin/commit-graph.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..f0875b8bf3 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv)
 graph_name = get_commit_graph_filename(opts.obj_dir);
 graph = load_commit_graph_one(graph_name);

-   if (!graph)
+   if (!graph) {
+   UNLEAK(graph_name);
 die("graph file %s does not exist", graph_name);

Unrelated to this patch: Is the command that ends up die()ing here
a plumbing or porcelain, or: Do we want to translate the message here?

In a lot of commands that show paths we single quote them '%s',
(speaking from experience with a lot of submodule path code)


This is for the 'git commit-graph read' command, which is plumbing (and 
'read' is really only for testing). I don't think this message requires 
translation.


I'll keep the quotes in mind for the future.

Thanks,

-Stolee



Re: [PATCH] t990X: use '.git/objects' as 'deep inside .git' path

2018-05-29 Thread Derrick Stolee

On 5/27/2018 12:49 AM, Michael Haggerty wrote:

On Sat, May 26, 2018 at 8:47 AM, Christian Couder
 wrote:

Tests t9902-completion.sh and t9903-bash-prompt.sh each have tests
that check what happens when we are "in the '.git' directory" and
when we are "deep inside the '.git' directory".

To test the case when we are "deep inside the '.git' directory" the
test scripts used to perform a `cd .git/refs/heads`.

As there are plans to implement other ref storage systems, let's
use '.git/objects' instead of '.git/refs/heads' as the "deep inside
the '.git' directory" path.

Seems reasonable to me. +1.

Michael

Looks good to me, too.

Thanks,
-Stolee


Re: [PATCH v3 03/20] commit-graph: parse commit from chosen graph

2018-05-29 Thread Derrick Stolee

On 5/27/2018 6:23 AM, Jakub Narebski wrote:

Derrick Stolee  writes:


Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

If I understand it properly the problem is that when verifying against
the object database we want to check one single commit-graph file, not
concatenation of data from commit-graph file for the repository and
commit-graph files from its alternates -- like prepare_commit_graph()
does; which is called by parse_commit_in_graph().


Signed-off-by: Derrick Stolee 

O.K., so you introduce here a layer of indirection; parse_commit_in_graph()
now just uses parse_commit_in_graph_one(), passing core_commit_graph
(or the_commit_graph) to it, after checking that core_commit_graph is set
(which handles the case when feature is not turned off) and loading
commit-graph file.

Nice and simple 'split function' refactoring, with new function taking
over when there is commit graph file prepared.


So, after the changes:
* parse_commit_in_graph() is responsible for checking whether to use
   commit-graph feature and ensuring that data from commit-graph is
   loaded, where it passes the control to parse_commit_in_graph_one()
* parse_commit_in_graph_one() checks whether commit-graph feature is
   turned on, whether commit we are interested in was already parsed,
   and then uses fill_commit_in_graph() to actually get the data
* fill_commit_in_graph() gets data out of commit-graph file, extracting
   it from commit data chunk (and if needed large edges chunk).

All those functions return 1 if they got data from commit-graph, and 0
if they didn't.


One minor nitpick / complaint / question is about naming of global
variables used here, namely:
* static struct commit_graph *commit_graph
   from commit-graph.c for global storage of commit-graph[s] data
* int core_commit_graph
   from environment.c for storing core.commitGraph config

But I see that at least the latter is common convention in Git source
code; I guess that the former maybe follows convention as used for "the
index" and "the repository" - additionally it is static / file-local.


See also `struct packed_git *packed_git;` from cache.h.




---
  commit-graph.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 82295f0975..78ba0edc80 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
}
  }
  
-int parse_commit_in_graph(struct commit *item)

+static int parse_commit_in_graph_one(struct commit_graph *g, struct commit 
*item)
  {
uint32_t pos;
  
@@ -318,9 +318,21 @@ static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)

if (!core_commit_graph)
return 0;

All right, we check that commit-graph feature is enabled because
parse_commit_in_graph_one() will be used standalone, not by being
invoked from parse_commit_in_graph().

This check is fast.


if (item->object.parsed)
return 1;

Sidenote: I just wonder why object.parsed and not for example
object.graph_pos is used to checck whether the object was filled if
possible with commit-graph data...


Here we are filling all data, including commit date and parents. We use 
load_commit_graph_info() when we only need the graph_pos and generation 
values.





+
+   if (find_commit_in_graph(item, g, &pos))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+   if (!core_commit_graph)
+   return 0;

All right, this check is here to short-circuit and make it so git does
not even try to lod commit-graph file[s] if the feature is disabled.


+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
  }




Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'

2018-05-29 Thread Derrick Stolee

On 5/29/2018 12:27 AM, Junio C Hamano wrote:

Derrick Stolee  writes:


Thanks for all the feedback on v2. I've tried to make this round's
review a bit easier by splitting up the commits into smaller pieces.
Also, the test script now has less boilerplate and uses variables and
clear arithmetic to explain which bytes are being modified.

One other change worth mentioning: in "commit-graph: add '--reachable'
option" I put the ref-iteration into a new external
'write_commit_graph_reachable()' method inside commit-graph.c. This
makes the 'gc: automatically write commit-graph files' a simpler change.

I finally managed to find time to resolve conflicts this topic has
with other topics (of your own included, if I am not mistaken).
Please double check the resolution when I push out the day's
integration result later today.


Sorry about the confusion. I was operating on my copy of 
ds/generation-numbers (34fdd43339) but fetching just now I see you 
updated that branch to 1472978ec6. I reset my local branch to 
ds/commit-graph-fsk (53dd1e6600). The only diff I see between my v3 
branch and that commit are the changes from ds/commit-graph-lockfile-fix 
(33286dcd6d).


Thanks,
-Stolee


Re: [PATCH v3 07/20] commit-graph: verify catches corrupt signature

2018-05-29 Thread Derrick Stolee

On 5/28/2018 10:05 AM, Jakub Narebski wrote:

Derrick Stolee  writes:


This is the first of several commits that add a test to check that
'git commit-graph verify' catches corruption in the commit-graph
file. The first test checks that the command catches an error in
the file signature. This is a check that exists in the existing
commit-graph reading code.

Good start.

This handles 3 out of 5 checks in load_commit_graph_one().  The
remaining are:
* too short file (length smaller than minimal commit-graph size)
* more than one chunk of one of 4 defined types


Add a helper method 'corrupt_graph_and_verify' to the test script
t5318-commit-graph.sh. This helper corrupts the commit-graph file
at a certain location, runs 'git commit-graph verify', and reports
the output to the 'err' file. This data is filtered to remove the
lines added by 'test_must_fail' when the test is run verbosely.
Then, the output is checked to contain a specific error message.

Thanks for an explanation.


Signed-off-by: Derrick Stolee 
---
  t/t5318-commit-graph.sh | 43 +++
  1 file changed, 43 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..bd64481c7a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
  '
  
+# the verify tests below expect the commit-graph to contain

+# exactly the commits reachable from the commits/8 branch.
+# If the file changes the set of commits in the list, then the
+# offsets into the binary file will result in different edits
+# and the tests will likely break.
+
  test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
+   git rev-parse commits/8 | git commit-graph write --stdin-commits &&
git commit-graph verify >output
  '

I don't quite understand what the change is meant to do.


This gives us a constant commit-graph file to work with in the later tests.

To get the "independent test" structure you want for the tests that are 
coming, we need to do one of the following:


1. Write a new commit-graph file for every test (slows things down).
2. Do all corruption/verify checks in a single test (reduces the 
information from a failed test, as it only reports the first failure).


I don't like either of these options, so I went with this "prepare" step.


Also, as I said earlier, I'd prefer if tests were as indpendent of each
other as possible, to make running individual tests (e.g. only
previously falling tests) easier.

I especially do not like mixing running actual test with setting up the
repository for future tests, as here.

  
+GRAPH_BYTE_VERSION=4

+GRAPH_BYTE_HASH=5
+
+# usage: corrupt_graph_and_verify   
+# Manipulates the commit-graph file at the position
+# by inserting the data, then runs 'git commit-graph verify'
+# and places the output in the file 'err'. Test 'err' for
+# the given string.

Very nice to have this description.


+corrupt_graph_and_verify() {
+   pos=$1
+   data="${2:-\0}"
+   grepstr=$3
+   cd "$TRASH_DIRECTORY/full" &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" 
conv=notrunc &&

Using 'printf' with octal is much more portable than relying on 'echo'
supporting octal escape sequences (or supporting escape sequences at
all).


+   test_must_fail git commit-graph verify 2>test_err &&
+   grep -v "^+" test_err >err
+   grep "$grepstr" err

Shouldn't this last 'grep' be 'test_i18ngrep' instead, to allow for
translated messages from 'git commit-graph verify' / 'git fsck'?


+}

This function makes actual tests short and simple, without duplicated
code.  Very good.


+
+test_expect_success 'detect bad signature' '
+   corrupt_graph_and_verify 0 "\0" \
+   "graph signature"
+'
+
+test_expect_success 'detect bad version' '
+   corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+   "graph version"
+'
+
+test_expect_success 'detect bad hash version' '
+   corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \

When we move from SHA-1 (hash version 1) to NewHash (hash version 2),
this test would start failing... which is actually not a bad idea.


+   "hash version"
+'
+
  test_done




Re: [PATCH] commit-graph: fix a sparse 'integer as NULL pointer' warning

2018-05-29 Thread Derrick Stolee

n 5/29/2018 4:01 PM, Ramsay Jones wrote:

Signed-off-by: Ramsay Jones 
---

Hi Derrick,

If you need to re-roll your 'ds/commit-graph-fsck' branch (pu@a84e06bc0f),
could you please squash this into the relevant patch (commit 80453b4529,
"commit-graph: add 'verify' subcommand", 2018-05-24).

[No, No, that was the one in graph_read(). :-D ]


Thanks for helping rid me of my bad habits! I'll get this one in v4.



Thanks!

ATB,
Ramsay Jones

  builtin/commit-graph.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 20ce6437a..6abfde5e6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -39,7 +39,7 @@ static struct opts_commit_graph {
  
  static int graph_verify(int argc, const char **argv)

  {
-   struct commit_graph *graph = 0;
+   struct commit_graph *graph = NULL;
char *graph_name;
  
  	static struct option builtin_commit_graph_verify_options[] = {


Re: [RFC PATCH 00/35] object-store: lookup_commit

2018-05-29 Thread Derrick Stolee

On 5/29/2018 8:47 PM, Stefan Beller wrote:

This applies on the merge of nd/commit-util-to-slab and sb/object-store-grafts,
and is available at http://github.com/stefanbeller/ as branch 
object-store-lookup-commit
as the merge has some merge conflicts as well as syntactical conflicts 
(upload-pack.c
and fetch-pack.c introduce new calls of functions that would want to take a 
repository struct
in the object-store-grafts series)

As layed out in 
https://public-inbox.org/git/20180517225154.9200-1-sbel...@google.com/
this is getting close to finishing the set of object store series though the 
last
unfinished part of this RFC hints at new work on the plate:
* To give this series a nice polish, we'd want to convert parse_commit, too.
   But that requires the conversion of the new commit graph. Maybe we need
   to split this series into 2.


I'll take a look at this series tomorrow. I've been working in 
ds/commit-graph-fsck to make many of the methods take a 'struct 
commit_graph *' parameter, which could easily be 'r->commit_graph' for a 
'struct the_repository *r' instead of the global 'commit_graph'. Those 
graph-local methods will make the transformation to be repo-local a lot 
easier. (There still may be some need to insert a repository, though.)


Since you are working on the commit-slab stuff in this patch, I'll 
(continue to) delay my patch series on using the commit-slab for 
generation numbers to avoid collisions with your work.


Thanks,

-Stolee



Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand

2018-05-30 Thread Derrick Stolee

On 5/27/2018 6:55 PM, Jakub Narebski wrote:

Derrick Stolee  writes:


If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use.

So this commit is simply getting the boilerplate out of the way for
implementing 'git commit-graph verify' subcommand.  Good.


Add a simple test that
ensures the command returns a zero error code.

Nice.


If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

All right.  I assume that as it is explicit verification call, it does
ignore core.commitGraph setting, isn't it?


Signed-off-by: Derrick Stolee 
---
  Documentation/git-commit-graph.txt |  6 ++
  builtin/commit-graph.c | 38 ++
  commit-graph.c | 26 ++
  commit-graph.h |  2 ++
  t/t5318-commit-graph.sh| 10 ++
  5 files changed, 82 insertions(+)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
  
  [verse]
  'git commit-graph read' [--object-dir ]
+'git commit-graph verify' [--object-dir ]
  'git commit-graph write'  [--object-dir ]

In alphabetical order, good.

  
  
@@ -52,6 +53,11 @@ existing commit-graph file.

  Read a graph file given by the commit-graph file and output basic
  details about the graph file. Used for debugging purposes.
  
+'verify'::

+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+

All right, good enough description.

  
  EXAMPLES

  
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f0875b8bf3..0433dd6e20 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
  static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
+   N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
  };

In alphabetical order, same as in the manpage for git-commit-graph.

  
+static const char * const builtin_commit_graph_verify_usage[] = {

+   N_("git commit-graph verify [--object-dir ]"),
+   NULL
+};
+
  static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
  } opts;
  
+

+static int graph_verify(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_verify_options[] = {
+   OPT_STRING(0, "object-dir", &opts.obj_dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_verify_options,
+builtin_commit_graph_verify_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();

Getting the boilerplate of implementing the command mostly out of the
way.  Good.


+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);

So we are verifying only the commit-graph file belonging directly to
current repository, as I have expected.  This is needed to for warnings
and error messages from the 'verify' action, to be able to tell in which
file there are problems.

This means that it is possible that there would be problems with
commit-graph files that running 'git commit-graph verify' would not
find, because they are in commit-graph file in one of the alternates.

It is very easy, though, to check all commit-graph files that would be
read and its data concatenated when using commit-graph feature
(e.g. 'git commit-graph read', IIRC):

   $ git commit-graph verify
   $ for obj_dir in $(cat .git/objects/info/alternates) do;
 git commit-graph --object-dir="$obj_dir";
 done

Note: I have not checked

Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup

2018-05-30 Thread Derrick Stolee



On 5/30/2018 9:34 AM, Jakub Narebski wrote:

Derrick Stolee  writes:


In the commit-graph file, the OID fanout chunk provides an index into
the OID lookup. The 'verify' subcommand should find incorrect values
in the fanout.

Similarly, the 'verify' subcommand should find out-of-order values in
the OID lookup.

O.K., the OID Lookup chunk is checked together with OID Fanout chunk
because those two chunks are tightly connected: OID Fanout is fanout
into OID Lookup.


Signed-off-by: Derrick Stolee 
---
  commit-graph.c  | 36 
  t/t5318-commit-graph.sh | 22 ++
  2 files changed, 58 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 06e3e4f9ba..cbd1aae514 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...)
  
  int verify_commit_graph(struct commit_graph *g)

  {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;

Minor nitpick about the naming convention: why cur_*, and not curr_*?


+
if (!g) {
graph_report("no commit-graph file loaded");
return 1;
@@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g)
if (!g->chunk_commit_data)
graph_report("commit-graph is missing the Commit Data chunk");
  
+	if (verify_commit_graph_error)

+   return verify_commit_graph_error;
+

Before checking that commits in OID Lookup are sorted, and that OID
Fanout correctly points into OID Lookup (thus also checking that OID
Lookup is sorted), it would be good to verify that sizes of those chunks
are correct.

Out of 4 standrd chunks, 1 of them (OIDF) has constant size, and 2 of
them have size given by number of commits and hash size
  - OID Fanout is (256 * 4 bytes)
  - OID Lookup is (N * H bytes),
where N is number of commits, and H is hash size

The one that is more significant is if number of commits gets corrupted
upwards, and walking through OID Lookup would lead us out of bounds,
outside the file size.

IIRC we have checked that all chunks fit within file size, isn't it?


+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);

Why do you use hashcpy() here, but oidcpy() below?


We are copying from a section of binary data, not from a struct object_id *.




+
+   if (i && oidcmp(&prev_oid, &cur_oid) >= 0)

All right, OIDs needs to be sorted in ascending lexicographical order,
and the above condition checks that this constraint is fullfilled.


+   graph_report("commit-graph has incorrect OID order: %s then 
%s",
+oid_to_hex(&prev_oid),
+oid_to_hex(&cur_oid));

Nice informative error message.


+
+   oidcpy(&prev_oid, &cur_oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }

The k-th entry of fanout, F[k], stores the number of OIDs with first
byte at most k.

Let's examine it in detail on a simple example:

fanout lookup
-- --
0 : 0  ---> 0: 1
1 : 2  -\   1: 1
2 : 2  --\> 2: 3
3 : 3  ---> 3: 4

We are checking that after most significant byte (first byte) changes,
then all fanout position up to the current first byte value (exclusive)
point to current position in OID Lookup chunk.

Seems all right; it would be nice to come up with rigorous proof that it
is all right.


+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report("commit-graph has incorrect fanout value: 
fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }

All right, this checks that all remaining fanout entries, up and
including the 255-ith entry store the total number of commits, N.


+
return verify_commit_graph_error;
  }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 4ef3fe3dc2..c050ef980b 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git comm

Re: [PATCH 03/35] object: add repository argument to lookup_unknown_object

2018-05-30 Thread Derrick Stolee

On 5/29/2018 8:47 PM, Stefan Beller wrote:

From: Jonathan Nieder 

Add a repository argument to allow callers of lookup_unknown_object to
be more specific about which repository to handle. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.

As with the previous commits, use a macro to catch callers passing a
repositor other than the_repository at compile time.

The included coccinelle semantic patch will adapt any new callers in
the diff produced by `make coccicheck`.


I don't see any coccinelle change in this patch.

Perhaps it is in a later commit?

Thanks,

-Stolee



Signed-off-by: Jonathan Nieder 
Signed-off-by: Stefan Beller 
---
  builtin/fsck.c   | 2 +-
  builtin/pack-objects.c   | 2 +-
  http-push.c  | 2 +-
  object.c | 2 +-
  object.h | 3 ++-
  refs.c   | 2 +-
  t/helper/test-example-decorate.c | 6 +++---
  upload-pack.c| 2 +-
  walker.c | 2 +-
  9 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 98fdeef5407..700739804fc 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -638,7 +638,7 @@ static int fsck_cache_tree(struct cache_tree *it)
  
  static void mark_object_for_connectivity(const struct object_id *oid)

  {
-   struct object *obj = lookup_unknown_object(oid->hash);
+   struct object *obj = lookup_unknown_object(the_repository, oid->hash);
obj->flags |= HAS_OBJ;
  }
  
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c

index 8c108109985..6eae39cf858 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2689,7 +2689,7 @@ static void add_objects_in_unpacked_packs(struct rev_info 
*revs)
  
  		for (i = 0; i < p->num_objects; i++) {

nth_packed_object_oid(&oid, p, i);
-   o = lookup_unknown_object(oid.hash);
+   o = lookup_unknown_object(the_repository, oid.hash);
if (!(o->flags & OBJECT_ADDED))
mark_in_pack_object(o, p, &in_pack);
o->flags |= OBJECT_ADDED;
diff --git a/http-push.c b/http-push.c
index 2615c823d60..04d95bd5da8 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1429,7 +1429,7 @@ static void one_remote_ref(const char *refname)
 * may be required for updating server info later.
 */
if (repo->can_update_info_refs && !has_object_file(&ref->old_oid)) {
-   obj = lookup_unknown_object(ref->old_oid.hash);
+   obj = lookup_unknown_object(the_repository, ref->old_oid.hash);
fprintf(stderr, "  fetch %s for %s\n",
oid_to_hex(&ref->old_oid), refname);
add_fetch_request(obj);
diff --git a/object.c b/object.c
index 4de4fa58d59..def3c71cac2 100644
--- a/object.c
+++ b/object.c
@@ -177,7 +177,7 @@ void *object_as_type(struct object *obj, enum object_type 
type, int quiet)
}
  }
  
-struct object *lookup_unknown_object(const unsigned char *sha1)

+struct object *lookup_unknown_object_the_repository(const unsigned char *sha1)
  {
struct object *obj = lookup_object(the_repository, sha1);
if (!obj)
diff --git a/object.h b/object.h
index fa41d711f44..778f83bf0f7 100644
--- a/object.h
+++ b/object.h
@@ -139,7 +139,8 @@ struct object *parse_object_or_die(const struct object_id 
*oid, const char *name
  struct object *parse_object_buffer(const struct object_id *oid, enum 
object_type type, unsigned long size, void *buffer, int *eaten_p);
  
  /** Returns the object, with potentially excess memory allocated. **/

-struct object *lookup_unknown_object(const unsigned  char *sha1);
+#define lookup_unknown_object(r, s) lookup_unknown_object_##r(s)
+struct object *lookup_unknown_object_the_repository(const unsigned char *sha1);
  
  struct object_list *object_list_insert(struct object *item,

   struct object_list **list_p);
diff --git a/refs.c b/refs.c
index 23d53957deb..3b9e8463656 100644
--- a/refs.c
+++ b/refs.c
@@ -301,7 +301,7 @@ static int filter_refs(const char *refname, const struct 
object_id *oid,
  
  enum peel_status peel_object(const struct object_id *name, struct object_id *oid)

  {
-   struct object *o = lookup_unknown_object(name->hash);
+   struct object *o = lookup_unknown_object(the_repository, name->hash);
  
  	if (o->type == OBJ_NONE) {

int type = oid_object_info(the_repository, name, NULL);
diff --git a/t/helper/test-example-decorate.c b/t/helper/test-example-decorate.c
index 081115bf8eb..33e727f7fc5 100644
--- a/t/helper/test-example-decorate.c
+++ b/t/helper/test-example-decorate.c
@@ -26,8 +26,8 @@ int cmd__example_decorate(int argc, const char **argv)
 * Add 2 objects, one with a non-NULL decoration and one with a NULL
  

Re: [PATCH 04/35] object: add repository argument to parse_object_buffer

2018-05-30 Thread Derrick Stolee

On 5/29/2018 8:47 PM, Stefan Beller wrote:

Add a repository argument to allow the callers of parse_object_buffer
to be more specific about which repository to act on. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.

As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.

Add the cocci patch that converted the callers.


Again, I'm missing the cocci patch here.



Signed-off-by: Jonathan Nieder 
Signed-off-by: Stefan Beller 
---
  builtin/fast-export.c| 3 ++-
  builtin/fsck.c   | 6 --
  builtin/index-pack.c | 3 ++-
  builtin/unpack-objects.c | 3 ++-
  object.c | 5 +++--
  object.h | 3 ++-
  ref-filter.c | 3 ++-
  7 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 24d42842f9d..a34ab9768f4 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -245,7 +245,8 @@ static void export_blob(const struct object_id *oid)
die ("Could not read blob %s", oid_to_hex(oid));
if (check_object_signature(oid, buf, size, type_name(type)) < 0)
die("sha1 mismatch in blob %s", oid_to_hex(oid));
-   object = parse_object_buffer(oid, type, size, buf, &eaten);
+   object = parse_object_buffer(the_repository, oid, type,
+size, buf, &eaten);
}
  
  	if (!object)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 700739804fc..e6d6eb266eb 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -392,7 +392,8 @@ static int fsck_obj_buffer(const struct object_id *oid, 
enum object_type type,
 * verify_packfile(), data_valid variable for details.
 */
struct object *obj;
-   obj = parse_object_buffer(oid, type, size, buffer, eaten);
+   obj = parse_object_buffer(the_repository, oid, type, size, buffer,
+ eaten);
if (!obj) {
errors_found |= ERROR_OBJECT;
return error("%s: object corrupt or missing", oid_to_hex(oid));
@@ -522,7 +523,8 @@ static struct object *parse_loose_object(const struct 
object_id *oid,
if (!contents && type != OBJ_BLOB)
die("BUG: read_loose_object streamed a non-blob");
  
-	obj = parse_object_buffer(oid, type, size, contents, &eaten);

+   obj = parse_object_buffer(the_repository, oid, type, size,
+ contents, &eaten);
  
  	if (!eaten)

free(contents);
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index e2f670bef9e..0dd10693597 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -848,7 +848,8 @@ static void sha1_object(const void *data, struct 
object_entry *obj_entry,
 * we do not need to free the memory here, as the
 * buf is deleted by the caller.
 */
-   obj = parse_object_buffer(oid, type, size, buf,
+   obj = parse_object_buffer(the_repository, oid, type,
+ size, buf,
  &eaten);
if (!obj)
die(_("invalid %s"), type_name(type));
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 9a4d2708123..8e454c48649 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -265,7 +265,8 @@ static void write_object(unsigned nr, enum object_type type,
int eaten;
hash_object_file(buf, size, type_name(type), &obj_list[nr].oid);
added_object(nr, type, buf, size);
-   obj = parse_object_buffer(&obj_list[nr].oid, type, size, buf,
+   obj = parse_object_buffer(the_repository, &obj_list[nr].oid,
+ type, size, buf,
  &eaten);
if (!obj)
die("invalid %s", type_name(type));
diff --git a/object.c b/object.c
index def3c71cac2..4250ddd3f7f 100644
--- a/object.c
+++ b/object.c
@@ -186,7 +186,7 @@ struct object *lookup_unknown_object_the_repository(const 
unsigned char *sha1)
return obj;
  }
  
-struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p)

+struct object *parse_object_buffer_the_repository(const struct object_id *oid, 
enum object_type type, unsigned long size, void *buffer, int *eaten_p)
  {
struct object *obj;
*eaten_p = 0;
@@ -278,7 +278,8 @@ struct object *parse_object_the_repository(const struct 
object_id *oid)
return NULL;
}
  
-		obj = parse_object_buffer(oid, type, size, buffer, &eaten);

+ 

Re: [PATCH 06/35] blob: add repository argument to lookup_blob

2018-05-30 Thread Derrick Stolee

On 5/29/2018 8:47 PM, Stefan Beller wrote:

Add a repository argument to allow the callers of lookup_blob
to be more specific about which repository to act on. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.

As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.

Add the cocci patch that converted the callers.
(this is the last time I'll point this out. I'll wait until I read to 
the end to see if the coccinelle patch is included in the series.)


Signed-off-by: Jonathan Nieder 
Signed-off-by: Stefan Beller 
---
  blob.c   | 2 +-
  blob.h   | 3 ++-
  builtin/fast-export.c| 2 +-
  builtin/fsck.c   | 3 ++-
  builtin/index-pack.c | 2 +-
  builtin/merge-tree.c | 3 ++-
  builtin/unpack-objects.c | 2 +-
  fsck.c   | 2 +-
  http-push.c  | 3 ++-
  list-objects.c   | 2 +-
  object.c | 4 ++--
  reachable.c  | 2 +-
  revision.c   | 4 ++--
  tag.c| 2 +-
  walker.c | 3 ++-
  15 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/blob.c b/blob.c
index dada295698c..17b9314f0a0 100644
--- a/blob.c
+++ b/blob.c
@@ -5,7 +5,7 @@
  
  const char *blob_type = "blob";
  
-struct blob *lookup_blob(const struct object_id *oid)

+struct blob *lookup_blob_the_repository(const struct object_id *oid)
  {
struct object *obj = lookup_object(the_repository, oid->hash);
if (!obj)
diff --git a/blob.h b/blob.h
index 44606168310..08bc34487a0 100644
--- a/blob.h
+++ b/blob.h
@@ -9,7 +9,8 @@ struct blob {
struct object object;
  };
  
-struct blob *lookup_blob(const struct object_id *oid);

+#define lookup_blob(r, o) lookup_blob_##r(o)
+struct blob *lookup_blob_the_repository(const struct object_id *oid);
  
  int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
  
diff --git a/builtin/fast-export.c b/builtin/fast-export.c

index a34ab9768f4..23ca46e6008 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -237,7 +237,7 @@ static void export_blob(const struct object_id *oid)
  
  	if (anonymize) {

buf = anonymize_blob(&size);
-   object = (struct object *)lookup_blob(oid);
+   object = (struct object *)lookup_blob(the_repository, oid);
eaten = 0;
} else {
buf = read_object_file(oid, &type, &size);
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 82a5acaf76e..6e2b204e938 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -808,7 +808,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
mode = active_cache[i]->ce_mode;
if (S_ISGITLINK(mode))
continue;
-   blob = lookup_blob(&active_cache[i]->oid);
+   blob = lookup_blob(the_repository,
+  &active_cache[i]->oid);
if (!blob)
continue;
obj = &blob->object;
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 0dd10693597..51c5244a15a 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -832,7 +832,7 @@ static void sha1_object(const void *data, struct 
object_entry *obj_entry,
if (strict || do_fsck_object) {
read_lock();
if (type == OBJ_BLOB) {
-   struct blob *blob = lookup_blob(oid);
+   struct blob *blob = lookup_blob(the_repository, oid);
if (blob)
blob->object.flags |= FLAG_CHECKED;
else
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index 8a8d5797520..f8023bae1e2 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -2,6 +2,7 @@
  #include "tree-walk.h"
  #include "xdiff-interface.h"
  #include "object-store.h"
+#include "repository.h"
  #include "blob.h"
  #include "exec-cmd.h"
  #include "merge-blobs.h"
@@ -170,7 +171,7 @@ static struct merge_list *create_entry(unsigned stage, 
unsigned mode, const stru
res->stage = stage;
res->path = path;
res->mode = mode;
-   res->blob = lookup_blob(oid);
+   res->blob = lookup_blob(the_repository, oid);
return res;
  }
  
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c

index 8e454c48649..dfea7790fde 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -254,7 +254,7 @@ static void write_object(unsigned nr, enum object_type type,
added_object(nr, type, buf, size);
free(buf);
  
-		blob = lookup_blob(&obj_list[nr].oid);

+   blob = lookup_blob(the_repository, &obj_list[nr].oid);
if (blob)
blob->

Re: [PATCH 27/35] commit-slabs: remove realloc counter outside of slab struct

2018-05-30 Thread Derrick Stolee

On 5/29/2018 8:48 PM, Stefan Beller wrote:

The realloc counter is declared outside the struct for the given slabname,
which makes it harder for a follow up patch to move the declaration of the
struct around as then the counter variable would need special treatment.

As the reallocation counter is currently unused we can just remove it.
If we ever need to count the reallocations again, we can reintroduce
the counter as part of 'struct slabname' in commit-slab-decl.h.


It's worth noting that this patch is different from the other mechanical 
patches. I do agree that we should remove this unused portion of the 
slab API. Likely this was built for testing purposes.


CC'ing Junio, who I found introduced this in a84b794ad "commit-slab: 
introduce a macro to define a slab for new type". Looking at that diff, 
the realloc portion did not appear to exist in the previous 
in-degree-specific version of slabs.




Signed-off-by: Stefan Beller 
---
  commit-slab-impl.h | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/commit-slab-impl.h b/commit-slab-impl.h
index 87a9cadfcca..ac1e6d409ad 100644
--- a/commit-slab-impl.h
+++ b/commit-slab-impl.h
@@ -11,8 +11,6 @@
  
  #define implement_commit_slab(slabname, elemtype, scope)		\

\
-static int stat_ ##slabname## realloc; \
-   \
  scope void init_ ##slabname## _with_stride(struct slabname *s,
\
   unsigned stride) \
  { \
@@ -54,7 +52,6 @@ scope elemtype *slabname## _at_peek(struct slabname *s,   
\
if (!add_if_missing)\
return NULL;\
REALLOC_ARRAY(s->slab, nth_slab + 1);\
-   stat_ ##slabname## realloc++;   \
for (i = s->slab_count; i <= nth_slab; i++)   \
s->slab[i] = NULL;   \
s->slab_count = nth_slab + 1;\


Re: [RFC PATCH 00/35] object-store: lookup_commit

2018-05-30 Thread Derrick Stolee

On 5/29/2018 11:18 PM, Stefan Beller wrote:

On Tue, May 29, 2018 at 6:05 PM, Derrick Stolee  wrote:

On 5/29/2018 8:47 PM, Stefan Beller wrote:

This applies on the merge of nd/commit-util-to-slab and
sb/object-store-grafts,
and is available at http://github.com/stefanbeller/ as branch
object-store-lookup-commit
as the merge has some merge conflicts as well as syntactical conflicts
(upload-pack.c
and fetch-pack.c introduce new calls of functions that would want to take
a repository struct
in the object-store-grafts series)

As layed out in
https://public-inbox.org/git/20180517225154.9200-1-sbel...@google.com/
this is getting close to finishing the set of object store series though
the last
unfinished part of this RFC hints at new work on the plate:
* To give this series a nice polish, we'd want to convert parse_commit,
too.
But that requires the conversion of the new commit graph. Maybe we need
to split this series into 2.


I'll take a look at this series tomorrow. I've been working in
ds/commit-graph-fsck to make many of the methods take a 'struct commit_graph
*' parameter, which could easily be 'r->commit_graph' for a 'struct
the_repository *r' instead of the global 'commit_graph'. Those graph-local
methods will make the transformation to be repo-local a lot easier. (There
still may be some need to insert a repository, though.)

Since you are working on the commit-slab stuff in this patch, I'll (continue
to) delay my patch series on using the commit-slab for generation numbers to
avoid collisions with your work.

This series touches the commit slabs in the last few commits only.
I don't think there will be huge conflicts, as all I had to do was
move the definition from a global to inside the parsed object store.


I went through the whole series and didn't find any issue with the code. 
The commit messages reference coccinelle scripts that I never saw, so 
those should be struck from the messages.


If anyone else plans to review, these two commits are worth special 
attention compared to the others:


[PATCH 27/35] commit-slabs: remove realloc counter outside of slab struct
[PATCH 28/35] commit.c: migrate the commit buffer to the parsed object store

I found no issue with them, but they do actually change behavior slightly as 
opposed to the other commits which have no change other than to relax the 
dependence on the_repository as a global. I also like the pattern of storing 
the slabs inside the parsed object structure in the_repository. I'll use a 
similar pattern when looking into using slabs for generation number.

For Junio: I know this will conflict with ds/commit-graph-fsck (specifically, I 
add new references to lookup_commit() and manipulate parse_commit_gently()). If 
this series is ready to be queued quickly, I can rebase my v4 onto these 
commits as they are placed into 'pu' (and please eject ds/commit-graph-fsck at 
that time).

Thanks,
-Stolee



Re: [PATCH v3 10/20] commit-graph: verify objects exist

2018-05-31 Thread Derrick Stolee

On 5/30/2018 3:22 PM, Jakub Narebski wrote:

Derrick Stolee  writes:


In the 'verify' subcommand, load commits directly from the object
database to ensure they exist. Parse by skipping the commit-graph.

All right, before we check that the commit data matches, we need to
check that all the commits in cache (in the serialized commit graph) are
present in real data (in the object database of the repository).

What's nice of this series is that the operation that actually removes
unreachable commits from the object database, that is `git gc`, would
also update commit-gaph file.


Signed-off-by: Derrick Stolee 
---
  commit-graph.c  | 20 
  t/t5318-commit-graph.sh |  7 +++
  2 files changed, 27 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index cbd1aae514..0420ebcd87 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
  {
struct commit *c;
struct object_id oid;
+
+   if (pos >= g->num_commits)
+   die("invalid parent position %"PRIu64, pos);
+

This change is not described in the commit message.
This change should go in "commit-graph: verify parent list" which adds a 
test that fails without it. Thanks.



hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
c = lookup_commit(&oid);
if (!c)
@@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g)
cur_fanout_pos++;
}
  
+	if (verify_commit_graph_error)

+   return verify_commit_graph_error;

All right, so we by default short-circuit so that errors found earlier
wouldn't cause crash when checking other things.

Is it needed, though, in this case?  Though I guess it is better to be
conservative; lso by terminating after a batch of one type of errors we
don't get many different error messages from the same error (i.e. error
propagation).


+
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *odb_commit;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());

Do we really need to keep all those commits from the object database in
memory (in the object::obj_hash hash table)?  Perhaps using something
like Flywheel / Recycler pattern would be a better solution (if
possible)?

Though perhaps this doesn't matter much with respect to memory use.


+   if (parse_commit_internal(odb_commit, 0, 0)) {

Just a reminder to myself: the signature is

   int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)


Hmmm... I wonder if with two boolean paramaters wouldn't it be better to
use flags parameter, i.e.

   int parse_commit_internal(struct commit *item, int flags)

   ...

   parse_commit_internal(commit, QUIET_ON_MISSING | USE_COMMIT_GRAPH)

But I guess that it is not worth it (especially for internal-ish
function).


+   graph_report("failed to parse %s from object database",
+oid_to_hex(&cur_oid));

Wouldn't parse_commit_internal() show it's own error message, in
addition to the one above?


+   continue;
+   }
+   }
+
return verify_commit_graph_error;
  }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index c050ef980b..996a016239 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
  '
  
+NUM_COMMITS=9

  HASH_LEN=20
  GRAPH_BYTE_VERSION=4
  GRAPH_BYTE_HASH=5
@@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
  GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
  GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
  GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
+GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 
+ 10`

All right, so we modify 10-the byte of OID of 5-th commit out of 9,
am I correct here?

  
  # usage: corrupt_graph_and_verify   

  # Manipulates the commit-graph file at the position
@@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' '
"incorrect OID order"
  '
  
+test_expect_success 'detect OID not in object database' '

+   corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \
+   "from object database"
+'

I guess that if we ensure that OIDs are constant, you have chosen the
change to actually corrupt the OID in OID Lookup chunk to point to OID
that is not in the object database (while still not violating the
constraint that OID in OID Lookup chunk needs to be sorted).


+
  test_done

All right (well, except for `expr ... ` --> $(( ... )) change).





Re: [PATCH v3 11/20] commit-graph: verify root tree OIDs

2018-05-31 Thread Derrick Stolee

On 5/30/2018 6:24 PM, Jakub Narebski wrote:

Derrick Stolee  writes:


The 'verify' subcommand must compare the commit content parsed from the
commit-graph and compare it against the content in the object database.

You have "compare" twice in the above sentence.


Use lookup_commit() and parse_commit_in_graph_one() to parse the commits
from the graph and compare against a commit that is loaded separately
and parsed directly from the object database.

All right, that looks like a nice extension of what was done in previous
patch.  We want to check that cache (serialized commit graph) matches
reality (object database).


Add checks for the root tree OID.

All right; isn't it that now we check almost all information from
commit-graph that hs match in object database (with exception of commit
parents, I think).


Signed-off-by: Derrick Stolee 
---
  commit-graph.c  | 17 -
  t/t5318-commit-graph.sh |  7 +++
  2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0420ebcd87..19ea369fc6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g)
return verify_commit_graph_error;

NOTE: we will be checking Commit Data chunk; I think it would be good
idea to verify that size of Commit Data chunk matches (N * (H + 16) bytes)
that format gives us, so that we don't accidentally red outside of
memory if something got screwed up (like number of commits being wrong,
or file truncated).


This is actually how we calculate 'num_commits' during 
load_commit_graph_one():


    if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
    {
        graph->num_commits = (chunk_offset - last_chunk_offset)
                     / graph->hash_len;
    }

So, if the chunk doesn't match N*(H+16), we detect this because 
FANOUT[255] != N.


(There is one caveat here: (chunk_offset - last_chunk_offset) may not be 
a multiple of hash_len, and "accidentally" truncate to N in the 
division. I'll add more careful checks for this.)


We also check out-of-bounds offsets in that method.



  
  	for (i = 0; i < g->num_commits; i++) {

+   struct commit *graph_commit;
+
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
  
  		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)

@@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g)
  
  			cur_fanout_pos++;

}
+
+   graph_commit = lookup_commit(&cur_oid);

So now I see why we add all commits to memory (to hash structure).


+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report("failed to parse %s from commit-graph",
+oid_to_hex(&cur_oid));

All right, this verifies that commit in OID Lookup chunk has parse-able
data in Commit Data chunk, isn't it?


}
  
  	while (cur_fanout_pos < 256) {

@@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g)
return verify_commit_graph_error;
  
  	for (i = 0; i < g->num_commits; i++) {

-   struct commit *odb_commit;
+   struct commit *graph_commit, *odb_commit;
  
  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
  
+		graph_commit = lookup_commit(&cur_oid);

odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());

All right, so we have commit data from graph, and commit data from the
object database.


if (parse_commit_internal(odb_commit, 0, 0)) {
graph_report("failed to parse %s from object database",
 oid_to_hex(&cur_oid));
continue;
}
+
+   if (oidcmp(&get_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report("root tree OID for commit %s in commit-graph is 
%s != %s",
+oid_to_hex(&cur_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));

Maybe explicitly say that it doesn't match the value from the object
database; OTOH this may be too verbose.


}
  
  	return verify_commit_graph_error;

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 996a016239..21cc8e82f3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
  GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
  GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN

[RFC PATCH 0/6] Fix commit-graph/graft/replace/shallow combo

2018-05-31 Thread Derrick Stolee
The commit-graph file stores a condensed version of the commit history.
This helps speed up several operations involving commit walks. This
feature does not work well if those commits "change" using features like
commit grafts, replace objects, or shallow clones.

Since the commit-graph feature is optional, hidden behind the
'core.commitGraph' config option, and requires manual maintenance (until
ds/commit-graph-fsck' is merged), these issues only arise for expert
users who decided to opt-in.

This RFC is a first attempt at rectify the issues that come up when
these features interact. I have not succeeded entirely, as I will
discuss below.

The first two "DO NOT MERGE" commits are not intended to be part of the
main product, but are ways to help the full test suite run while
computing and consuming commit-graph files. This is acheived by calling
write_commit_graph_reachable() from `git fetch` and `git commit`. I
believe this is the only dependence on ds/commit-graph-fsck. The other
commits should only depend on ds/commit-graph-lockfile-fix.

Running the full test suite after these DO NO MERGE commits results in
the following test failures, which I categorize by type of failure.

The following tests are red herrings. Most work by deleting a commit
from the object database and trying to demonstrate a failure. Others
rely on how certain objects are loaded. These are not bugs, but will
add noise if you run the tests suite with this patch.

t0410-partial-clone.sh  Failed tests:  5, 8
t5307-pack-missing-commit.shFailed tests:  3-4
t6011-rev-list-with-bad-commit.sh   Failed test:  4
t6024-recursive-merge.shFailed test:  4

The following tests are fixed in "commit-graph: enable replace-object
and grafts".

t6001-rev-list-graft.sh Failed tests:  3, 5, 7, 9, 11, 13
t6050-replace.shFailed tests:  11-15, 23-24, 29

The following tests involve shallow clones.

t5500-fetch-pack.sh Failed tests:  30-31, 34, 348-349
t5537-fetch-shallow.sh  Failed tests:  4-7, 9
t5802-connect-helper.sh Failed test:  3

These tests are mostly fixed by three commits:

* commit-graph: avoid writing when repo is shallow
* fetch: destroy commit graph on shallow parameters
* commit-graph: revert to odb on missing parents

Each of these cases cover a different interaction that occurs with
shallow clones. Some are due to a commit becoming shallow, while others
are due to a commit becoming unshallow (and hence invalidating its
generation number).

After these changes, there is one test case that still fails, and I
cannot understand why:

t5500-fetch-pack.sh Failed test:  348

This test fails when performing a `git fetch --shallow-exclude`. When I
halt the test using `t5500-fetch-pack.sh -d -i` and navigate to the
directory to replay the fetch it performs as expected. After struggling
with it for a while, I figured I should just put this series up for
discussion. Maybe someone with more experience in shallow clones can
point out the obvious issues I'm having.

Thanks,
-Stolee

Derrick Stolee (6):
  DO NOT MERGE: compute commit-graph on every commit
  DO NOT MERGE: write commit-graph on every fetch
  commit-graph: enable replace-object and grafts
  commit-graph: avoid writing when repo is shallow
  fetch: destroy commit graph on shallow parameters
  commit-graph: revert to odb on missing parents

 builtin/commit.c  |  5 +
 builtin/fetch.c   | 10 +
 builtin/gc.c  |  2 +-
 builtin/replace.c |  3 +++
 commit-graph.c| 65 +++
 commit-graph.h|  9 
 commit.c  |  5 +
 environment.c |  2 +-
 8 files changed, 95 insertions(+), 6 deletions(-)

-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 5/6] fetch: destroy commit graph on shallow parameters

2018-05-31 Thread Derrick Stolee
The commit-graph feature is not compatible with history-rewriting
features such as shallow clones. When running a 'git fetch' with
any of the shallow/unshallow options, destroy the commit-graph
file and override core.commitGraph to be false.

Signed-off-by: Derrick Stolee 
---
 builtin/fetch.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index af896e4b74..2001dca881 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1452,6 +1452,12 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
}
}
 
+   if (update_shallow || depth || deepen_since || deepen_not.nr ||
+   deepen_relative || unshallow || update_shallow || 
is_repository_shallow()) {
+   destroy_commit_graph(get_object_directory());
+   core_commit_graph = 0;
+   }
+
if (remote) {
if (filter_options.choice || repository_format_partial_clone)
fetch_one_setup_partial(remote);
-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 2/6] DO NOT MERGE: write commit-graph on every fetch

2018-05-31 Thread Derrick Stolee
THIS IS ONLY FOR TESTING TO INCREASE TEST COVERAGE

Signed-off-by: Derrick Stolee 
---
 builtin/fetch.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 8ee998ea2e..af896e4b74 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -20,6 +20,7 @@
 #include "utf8.h"
 #include "packfile.h"
 #include "list-objects-filter-options.h"
+#include "commit-graph.h"
 
 static const char * const builtin_fetch_usage[] = {
N_("git fetch [] [ [...]]"),
@@ -1480,6 +1481,9 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
 
close_all_packs();
 
+   if (core_commit_graph)
+   write_commit_graph_reachable(get_object_directory(), 1);
+
argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
if (verbosity < 0)
argv_array_push(&argv_gc_auto, "--quiet");
-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 1/6] DO NOT MERGE: compute commit-graph on every commit

2018-05-31 Thread Derrick Stolee
Also enable core.commitGraph and gc.commitGraph. Helps to test the
commit-graph feature with the rest of the test infrastructure.

Signed-off-by: Derrick Stolee 
---
 builtin/commit.c | 5 +
 builtin/gc.c | 2 +-
 environment.c| 2 +-
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index e8e8d13be4..8751b816c1 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -32,6 +32,7 @@
 #include "column.h"
 #include "sequencer.h"
 #include "mailmap.h"
+#include "commit-graph.h"
 
 static const char * const builtin_commit_usage[] = {
N_("git commit [] [--] ..."),
@@ -1623,5 +1624,9 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
 
UNLEAK(err);
UNLEAK(sb);
+
+   if (core_commit_graph)
+   write_commit_graph_reachable(get_object_directory(), 1);
+
return 0;
 }
diff --git a/builtin/gc.c b/builtin/gc.c
index efd214a59f..999c09d8e2 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -35,7 +35,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
-static int gc_commit_graph = 0;
+static int gc_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
diff --git a/environment.c b/environment.c
index 8853e2f0dd..fdb2d56d90 100644
--- a/environment.c
+++ b/environment.c
@@ -62,7 +62,7 @@ enum push_default_type push_default = 
PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
-int core_commit_graph;
+int core_commit_graph = 1;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 4/6] commit-graph: avoid writing when repo is shallow

2018-05-31 Thread Derrick Stolee
Shallow clones do not interact well with the commit-graph feature for
several reasons. Instead of doing the hard thing to fix those
interactions, instead prevent reading or writing a commit-graph file for
shallow repositories.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 95af4ed519..80e377b90f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -208,6 +208,9 @@ static void prepare_commit_graph(void)
return;
prepare_commit_graph_run_once = 1;
 
+   if (is_repository_shallow())
+   return;
+
obj_dir = get_object_directory();
prepare_commit_graph_one(obj_dir);
prepare_alt_odb();
@@ -711,6 +714,15 @@ void write_commit_graph(const char *obj_dir,
int num_extra_edges;
struct commit_list *parent;
 
+   /*
+* Shallow clones are not supproted, as they create bad
+* generation skips as they are un-shallowed.
+*/
+   if (is_repository_shallow()) {
+   warning("writing a commit-graph in a shallow repository is not 
supported");
+   return;
+   }
+
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 6/6] commit-graph: revert to odb on missing parents

2018-05-31 Thread Derrick Stolee
The commit-graph format includes a way to specify a parent is
"missing" from the commit-graph (i.e. we do not have a record of
that parent in our list of object IDs, and hence cannot provide
a graph position). For mose cases, this does not occur due to
the close_reachable() method adding all reachable commits. However,
in a shallow clone, we will try to record the parents of a commit
on the shallow boundary, but the parents are not in the repository.

The GRAPH_PARENT_MISSING value that is stored in the format is
purposeful, especially for future plans to make the commit-graph file
incremental or transporting sections of a commit-graph file across
the network.

In the meantime, check if a commit has a missing parent while filling
its details from the commit-graph. If a parent is missing, still
assign the generation number and graph position for that item, but
report that the commit-graph failed to fill the contents. Then the
caller is responsible for filling the rest of the data from a commit
buffer.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 33 ++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 80e377b90f..3e33d061fe 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -278,17 +278,44 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
struct commit_list **pptr;
const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
 
-   item->object.parsed = 1;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
item->graph_pos = pos;
 
+   /*
+* If we have any edges marked as GRAPH_PARENT_MISSING, we must not 
parse any
+* more of this object and leave it to the commit buffer to parse.
+*/
+   edge_value = get_be32(commit_data + g->hash_len);
+   if (edge_value == GRAPH_PARENT_MISSING)
+   return 0;
+   if (edge_value == GRAPH_PARENT_NONE)
+   goto continue_parsing;
+
+   edge_value = get_be32(commit_data + g->hash_len + 4);
+   if (edge_value == GRAPH_PARENT_MISSING)
+   return 0;
+   if (edge_value == GRAPH_PARENT_NONE)
+   goto continue_parsing;
+   if (!(edge_value & GRAPH_OCTOPUS_EDGES_NEEDED))
+   goto continue_parsing;
+
+   parent_data_ptr = (uint32_t*)(g->chunk_large_edges +
+ 4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
+   do {
+   edge_value = get_be32(parent_data_ptr);
+   if (edge_value == GRAPH_PARENT_MISSING)
+   return 0;
+   parent_data_ptr++;
+   } while (!(edge_value & GRAPH_LAST_EDGE));
+   
+continue_parsing:
+   item->object.parsed = 1;
item->maybe_tree = NULL;
 
date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
-   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
pptr = &item->parents;
 
edge_value = get_be32(commit_data + g->hash_len);
-- 
2.16.2.338.gcfe06ae955



[RFC PATCH 3/6] commit-graph: enable replace-object and grafts

2018-05-31 Thread Derrick Stolee
Create destroy_commit_graph() method to delete the commit-graph file
when history is altered by a replace-object call. If the commit-graph
is rebuilt after that, we will load the correct object while reading
the commit-graph.

When parsing a commit, first check if the commit was grafted. If so,
then ignore the commit-graph for that commit and insted use the parents
loaded by parsing the commit buffer and comparing against the graft
file. 

Signed-off-by: Derrick Stolee 
---
 builtin/replace.c |  3 +++
 commit-graph.c| 20 +++-
 commit-graph.h|  9 +
 commit.c  |  5 +
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/builtin/replace.c b/builtin/replace.c
index 9f01f3fc7f..d553aadcdc 100644
--- a/builtin/replace.c
+++ b/builtin/replace.c
@@ -15,6 +15,7 @@
 #include "parse-options.h"
 #include "run-command.h"
 #include "tag.h"
+#include "commit-graph.h"
 
 static const char * const git_replace_usage[] = {
N_("git replace [-f]  "),
@@ -468,6 +469,8 @@ int cmd_replace(int argc, const char **argv, const char 
*prefix)
usage_msg_opt("--raw only makes sense with --edit",
  git_replace_usage, options);
 
+   destroy_commit_graph(get_object_directory());
+
switch (cmdmode) {
case MODE_DELETE:
if (argc < 1)
diff --git a/commit-graph.c b/commit-graph.c
index e9195dfb17..95af4ed519 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -6,6 +6,7 @@
 #include "pack.h"
 #include "packfile.h"
 #include "commit.h"
+#include "dir.h"
 #include "object.h"
 #include "refs.h"
 #include "revision.h"
@@ -240,15 +241,22 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
 {
struct commit *c;
struct object_id oid;
+   const unsigned char *real;
 
if (pos >= g->num_commits)
die("invalid parent position %"PRIu64, pos);
 
hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+
+   real = lookup_replace_object(oid.hash);
+
c = lookup_commit(&oid);
if (!c)
die("could not find commit %s", oid_to_hex(&oid));
-   c->graph_pos = pos;
+
+   if (!hashcmp(real, oid.hash))
+   c->graph_pos = pos;
+
return &commit_list_insert(c, pptr)->next;
 }
 
@@ -1019,3 +1027,13 @@ int verify_commit_graph(struct commit_graph *g)
 
return verify_commit_graph_error;
 }
+
+void destroy_commit_graph(const char *obj_dir)
+{
+   char *graph_name;
+   close_commit_graph();
+
+   graph_name = get_commit_graph_filename(obj_dir);
+   remove_path(graph_name);
+   FREE_AND_NULL(graph_name);
+}
diff --git a/commit-graph.h b/commit-graph.h
index 9a06a5f188..1d17da1582 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -56,4 +56,13 @@ void write_commit_graph(const char *obj_dir,
 
 int verify_commit_graph(struct commit_graph *g);
 
+/*
+ * Delete the commit-graph file in the given object directory.
+ *
+ * WARNING: this deletes data, so should only be used when
+ * performing history-altering actions like replace-object
+ * or grafts.
+ */
+void destroy_commit_graph(const char *obj_dir);
+
 #endif
diff --git a/commit.c b/commit.c
index 6eaed0174c..2fe31cde77 100644
--- a/commit.c
+++ b/commit.c
@@ -403,6 +403,11 @@ int parse_commit_internal(struct commit *item, int 
quiet_on_missing, int use_com
return -1;
if (item->object.parsed)
return 0;
+
+   prepare_commit_graft();
+   if (commit_graft_pos(item->object.oid.hash) >= 0)
+   use_commit_graph = 0;
+
if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, &type, &size);
-- 
2.16.2.338.gcfe06ae955



Re: [RFC PATCH 0/6] Fix commit-graph/graft/replace/shallow combo

2018-05-31 Thread Derrick Stolee

On 5/31/2018 2:33 PM, Stefan Beller wrote:

On Thu, May 31, 2018 at 10:40 AM, Derrick Stolee  wrote:

The commit-graph file stores a condensed version of the commit history.
This helps speed up several operations involving commit walks. This
feature does not work well if those commits "change" using features like
commit grafts, replace objects, or shallow clones.

Since the commit-graph feature is optional, hidden behind the
'core.commitGraph' config option, and requires manual maintenance (until
ds/commit-graph-fsck' is merged), these issues only arise for expert
users who decided to opt-in.

This RFC is a first attempt at rectify the issues that come up when
these features interact. I have not succeeded entirely, as I will
discuss below.

The first two "DO NOT MERGE" commits are not intended to be part of the
main product, but are ways to help the full test suite run while
computing and consuming commit-graph files. This is acheived by calling
write_commit_graph_reachable() from `git fetch` and `git commit`. I
believe this is the only dependence on ds/commit-graph-fsck. The other
commits should only depend on ds/commit-graph-lockfile-fix.

I applied these patches on top of ds/commit-graph-fsck
(it would have been nice if you mentioned that it applies there)
Running the test suite with all patches applied results in:

./t0410-partial-clone.sh(Wstat: 256 Tests: 15 Failed: 2)
   Failed tests:  5, 8
./t5307-pack-missing-commit.sh  (Wstat: 256 Tests: 5 Failed: 2)
   Failed tests:  3-4
./t5500-fetch-pack.sh   (Wstat: 256 Tests: 353 Failed: 1)
   Failed test:  348
./t6011-rev-list-with-bad-commit.sh (Wstat: 256 Tests: 6 Failed: 1)
   Failed test:  4
./t6024-recursive-merge.sh  (Wstat: 256 Tests: 6 Failed: 1)
   Failed test:  4

which you identified as 4x noise and t5500 as not understood.

$ GIT_TRACE=1 ./t5500-fetch-pack.sh -d -i -v -x
[...]
expecting success:
git -C shallow12 fetch --shallow-exclude one origin &&
git -C shallow12 log --pretty=tformat:%s origin/master >actual &&
test_write_lines three two >expected &&
test_cmp expected actual

++ git -C shallow12 fetch --shallow-exclude one origin
trace: built-in: git fetch --shallow-exclude one origin
trace: run_command: unset GIT_PREFIX; 'git-upload-pack
'\''/u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/.'\'''
trace: run_command: git --shallow-file  pack-objects --revs --thin
--stdout --shallow --progress --delta-base-offset --include-tag
trace: built-in: git pack-objects --revs --thin --stdout --shallow
--progress --delta-base-offset --include-tag
remote: Counting objects: 4, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 4 (delta 0), reused 0 (delta 0)
trace: run_command: git --shallow-file  unpack-objects --pack_header=2,4
trace: built-in: git unpack-objects --pack_header=2,4
Unpacking objects: 100% (4/4), done.
trace: run_command: git rev-list --objects --stdin --not --all --quiet
trace: built-in: git rev-list --objects --stdin --not --all --quiet
trace: run_command: unset GIT_PREFIX; 'git-upload-pack
'\''/u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/.'\'''
trace: run_command: git pack-objects --revs --thin --stdout --progress
--delta-base-offset
trace: built-in: git pack-objects --revs --thin --stdout --progress
--delta-base-offset
remote: Total 0 (delta 0), reused 0 (delta 0)
trace: run_command: git unpack-objects --pack_header=2,0
trace: built-in: git unpack-objects --pack_header=2,0
trace: run_command: git rev-list --objects --stdin --not --all --quiet
trace: built-in: git rev-list --objects --stdin --not --all --quiet
 From file:///u/git/t/trash directory.t5500-fetch-pack/shallow-exclude/.
  * [new tag] one-> one
  * [new tag] two-> two
run_processes_parallel: preparing to run up to 1 tasks
run_processes_parallel: done
trace: run_command: git gc --auto
trace: built-in: git gc --auto
++ git -C shallow12 log --pretty=tformat:%s origin/master
trace: built-in: git log '--pretty=tformat:%s' origin/master
++ test_write_lines three two
++ printf '%s\n' three two
++ test_cmp expected actual
++ diff -u expected actual
--- expected 2018-05-31 18:14:25.944540810 +
+++ actual 2018-05-31 18:14:25.944540810 +
@@ -1,2 +1,3 @@
  three
  two
+one
error: last command exited with $?=1
not ok 348 - fetch exclude tag one
#
# git -C shallow12 fetch --shallow-exclude one origin &&
# git -C shallow12 log --pretty=tformat:%s origin/master >actual &&
# test_write_lines three two >expected &&
# test_cmp expected actual
#



After these changes, there is one test case that still fails, and I
cannot understand why:

t5500-fetch-pack.sh Failed test:  348

This test fails when performing a `git fetch -

ds/generation-numbers (was Re: What's cooking in git.git (Jun 2018, #01; Fri, 1))

2018-06-01 Thread Derrick Stolee

On 6/1/2018 3:21 AM, Junio C Hamano wrote:


* ds/commit-graph-lockfile-fix (2018-05-22) 1 commit
   (merged to 'next' on 2018-05-24 at 3d12a02b0c)
  + commit-graph: fix UX issue when .lock file exists
  (this branch is used by ds/commit-graph-fsck; uses ds/generation-numbers.)

  Update to ds/generation-numbers topic.

  Wait for ds/generation-numbers


* ds/generation-numbers (2018-05-22) 11 commits
   (merged to 'next' on 2018-05-24 at 56fc38a1b6)
  + commit-graph.txt: update design document
  + merge: check config before loading commits
  + commit: use generation number in remove_redundant()
  + commit: add short-circuit to paint_down_to_common()
  + commit: use generation numbers for in_merge_bases()
  + ref-filter: use generation number for --contains
  + commit-graph: always load commit-graph information
  + commit: use generations in paint_down_to_common()
  + commit-graph: compute generation numbers
  + commit: add generation number to struct commit
  + ref-filter: fix outdated comment on in_commit_list
  (this branch is used by ds/commit-graph-fsck and 
ds/commit-graph-lockfile-fix.)

  A recently added "commit-graph" datafile has learned to store
  pre-computed generation numbers to speed up the decisions to stop
  history traversal.

  Will cook in 'next'.


On Wednesday, these were marked as "Will merge to 'master'" What changed?

Thanks,
-Stolee


Re: [RFC PATCH 4/6] commit-graph: avoid writing when repo is shallow

2018-06-01 Thread Derrick Stolee

On 5/31/2018 10:30 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


Shallow clones do not interact well with the commit-graph feature for
several reasons. Instead of doing the hard thing to fix those
interactions, instead prevent reading or writing a commit-graph file for
shallow repositories.

The latter instead would want to vanish, I would guess.


Do you mean that we should call destroy_commit_graph() if we detect a 
shallow repository during write_commit_graph(), then I can make that change.





Signed-off-by: Derrick Stolee 
---
  commit-graph.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 95af4ed519..80e377b90f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -208,6 +208,9 @@ static void prepare_commit_graph(void)
return;
prepare_commit_graph_run_once = 1;
  
+	if (is_repository_shallow())

+   return;
+
obj_dir = get_object_directory();
prepare_commit_graph_one(obj_dir);
prepare_alt_odb();
@@ -711,6 +714,15 @@ void write_commit_graph(const char *obj_dir,
int num_extra_edges;
struct commit_list *parent;
  
+	/*

+* Shallow clones are not supproted, as they create bad
+* generation skips as they are un-shallowed.
+*/
+   if (is_repository_shallow()) {
+   warning("writing a commit-graph in a shallow repository is not 
supported");
+   return;
+   }
+
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;




Re: t5318-commit-graph.sh breaks travis gettext poison job

2018-06-01 Thread Derrick Stolee

On 6/1/2018 12:17 PM, Duy Nguyen wrote:

In case you're not checking travis, [1] reports

Test Summary Report
---
t5318-commit-graph.sh(Wstat: 256 Tests: 62
Failed: 2)
   Failed tests:  61-62
   Non-zero exit status: 1

This usually means you're grepping an i18n string

[1] https://travis-ci.org/git/git/jobs/386532326



Thanks, Duy. I believe this is related to the place Szeder pointed out 
[1] with the one translatable string in verify_commit_graph().


I've fixed this in my local branch, and plan to push v4 when review of 
v3 is complete.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180530123529.32090-1-szeder@gmail.com/

    Re: [PATCH v3 16/20] commit-graph: verify contents match checksum


Re: ds/generation-numbers (was Re: What's cooking in git.git (Jun 2018, #01; Fri, 1))

2018-06-01 Thread Derrick Stolee

On 6/1/2018 7:15 PM, Junio C Hamano wrote:

Derrick Stolee  writes:

   A recently added "commit-graph" datafile has learned to store
   pre-computed generation numbers to speed up the decisions to stop
   history traversal.

   Will cook in 'next'.

On Wednesday, these were marked as "Will merge to 'master'" What changed?

Nothing has changed.  "Will merge to 'master'" means "This is now in
'next', and unless there is some blocking breakage found, this topic
will be merged to 'master' eventually".  It does not even say if
that eventuality is before or after the release the current cycle is
working towards.

When it comes near pre-release feature freeze, things in 'next' need
to be sifted into various bins, and their labels are updated.  The
ones that are too big to be comfortably merged to the upcoming
release after -rc0 has passed (i.e. biggies are better merged early
to the 'master' in the cycle to give it full cycle of larger
exposure) will be kept in 'next' so that it can go first after the
final, the ones that are low risk but with higher importance will be
merged to 'master' before the release, the ones that are trivial,
distracting and lower value (i.e.  the ones that force i18n teams
extra work) may be held in 'next', and the ones that deserve a
chance to freshly restart are marked to be kicked back to 'pu'.
Etc. etc.


Thanks, Junio. This explanation is what I expected. I suppose the small 
extra bit of information of "Will cook in 'next' until after next 
release" would have answered my question in advance. Thanks for the 
patience as I get used to your workflow.


I am a little disappointed this didn't make 2.18 because this gives some 
of the biggest speedups for typically painful computations (like 'git 
branch --contains'). The generation numbers are what give us more than a 
constant-multiple speedup; what is in master only parses commit 
relationships faster, doesn't reduce the number of commits walked. It 
also means we will release a version of Git that writes commit-graph 
file with GENERATION_ZERO and so we will never be able to deprecate that 
logic in the code.


Thanks,

-Stolee




[PATCH 01/14] graph: add packed graph design document

2018-01-25 Thread Derrick Stolee
Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/packed-graph.txt | 185 +++
 1 file changed, 185 insertions(+)
 create mode 100644 Documentation/technical/packed-graph.txt

diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.
+
+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit. The recursive definition
+is easier to use for computation and the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+Design Details
+--
+
+- A graph file is stored in a file named 'graph-.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file OID is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-.graph' file in the same pack
+  directory.
+
+- The core.graph config setting must be on to create or consume graph files.
+
+- The graph file is only a supplemental structure. If a user downgrades
+  or disables the 'core.graph' config setting, then the existing ODB is
+  sufficient.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+Current Limitations
+---
+
+- Only one graph file is used at one time. This allows the integer ID to
+  seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'graph' builtin. These are not
+  updated automatically during clone or fetch.
+
+- There is no '--verify' option for the 'graph' builtin to verify the
+  contents of the graph file.
+
+- The graph only considers commits existing in packfiles and does not
+  walk to fill in reachable commits. [Small]
+
+- When rewriting 

[PATCH 03/14] packed-graph: create git-graph builtin

2018-01-25 Thread Derrick Stolee
Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt |  7 +++
 Makefile|  1 +
 builtin.h   |  1 +
 builtin/graph.c | 36 
 command-list.txt|  1 +
 git.c   |  1 +
 6 files changed, 47 insertions(+)
 create mode 100644 Documentation/git-graph.txt
 create mode 100644 builtin/graph.c

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
new file mode 100644
index 00..de5a3c07e6
--- /dev/null
+++ b/Documentation/git-graph.txt
@@ -0,0 +1,7 @@
+git-graph(1)
+
+
+NAME
+
+git-graph - Write and verify Git commit graphs (.graph files)
+
diff --git a/Makefile b/Makefile
index 1a9b23b679..d8b0d0457a 100644
--- a/Makefile
+++ b/Makefile
@@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o
 BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
+BUILTIN_OBJS += builtin/graph.o
 BUILTIN_OBJS += builtin/grep.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..ae7e816908 100644
--- a/builtin.h
+++ b/builtin.h
@@ -168,6 +168,7 @@ extern int cmd_format_patch(int argc, const char **argv, 
const char *prefix);
 extern int cmd_fsck(int argc, const char **argv, const char *prefix);
 extern int cmd_gc(int argc, const char **argv, const char *prefix);
 extern int cmd_get_tar_commit_id(int argc, const char **argv, const char 
*prefix);
+extern int cmd_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_grep(int argc, const char **argv, const char *prefix);
 extern int cmd_hash_object(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/builtin/graph.c b/builtin/graph.c
new file mode 100644
index 00..a902dc8646
--- /dev/null
+++ b/builtin/graph.c
@@ -0,0 +1,36 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+
+static char const * const builtin_graph_usage[] ={
+   N_("git graph [--pack-dir ]"),
+   NULL
+};
+
+static struct opts_graph {
+   const char *pack_dir;
+} opts;
+
+int cmd_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_graph_options[] = {
+   { OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+   N_("dir"),
+   N_("The pack directory to store the graph") },
+   OPT_END(),
+   };
+
+   if (!core_graph)
+   die("core.graph is false");
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_graph_usage, builtin_graph_options);
+
+   return 0;
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..d9c17cb9f8 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -61,6 +61,7 @@ git-format-patchmainporcelain
 git-fsckancillaryinterrogators
 git-gc  mainporcelain
 git-get-tar-commit-id   ancillaryinterrogators
+git-graph   plumbingmanipulators
 git-grepmainporcelain   info
 git-gui mainporcelain
 git-hash-object plumbingmanipulators
diff --git a/git.c b/git.c
index c870b9719c..29f8b6e7dd 100644
--- a/git.c
+++ b/git.c
@@ -408,6 +408,7 @@ static struct cmd_struct commands[] = {
{ "fsck-objects", cmd_fsck, RUN_SETUP },
{ "gc", cmd_gc, RUN_SETUP },
{ "get-tar-commit-id", cmd_get_tar_commit_id },
+   { "graph", cmd_graph, RUN_SETUP_GENTLY },
{ "grep", cmd_grep, RUN_SETUP_GENTLY },
{ "hash-object", cmd_hash_object },
{ "help", cmd_help },
-- 
2.16.0



[PATCH 00/14] Serialized Commit Graph

2018-01-25 Thread Derrick Stolee
As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

git config core.graph true
git show-ref -s | git graph --write --update-head

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.graph' setting.

[1] 
https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

P.S. I'm sending this patch from my gmail address to avoid Outlook
munging the URLs included in the design document.

Derrick Stolee (14):
  graph: add packed graph design document
  packed-graph: add core.graph setting
  packed-graph: create git-graph builtin
  packed-graph: add format document
  packed-graph: implement construct_graph()
  packed-graph: implement git-graph --write
  packed-graph: implement git-graph --read
  graph: implement git-graph --update-head
  packed-graph: implement git-graph --clear
  packed-graph: teach git-graph --delete-expired
  commit: integrate packed graph with commit parsing
  packed-graph: read only from specific pack-indexes
  packed-graph: close under reachability
  packed-graph: teach git-graph to read commits

 Documentation/config.txt |   3 +
 Documentation/git-graph.txt  | 102 
 Documentation/technical/graph-format.txt |  88 
 Documentation/technical/packed-graph.txt | 185 +++
 Makefile |   2 +
 alloc.c  |   1 +
 builtin.h|   1 +
 builtin/graph.c  | 231 +
 cache.h  |   1 +
 command-list.txt |   1 +
 commit.c |  20 +-
 commit.h |   2 +
 config.c |   5 +
 environment.c|   1 +
 git.c|   1 +
 log-tree.c   |   3 +-
 packed-graph.c   | 840 +++
 packed-graph.h   |  65 +++
 packfile.c   |   4 +-
 packfile.h   |   2 +
 t/t5319-graph.sh | 271 ++
 21 files changed, 1822 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/git-graph.txt
 create mode 100644 Documentation/technical/graph-format.txt
 create mode 100644 Documentation/technical/packed-graph.txt
 create mode 100644 builtin/graph.c
 create mode 100644 packed-graph.c
 

[PATCH 04/14] packed-graph: add format document

2018-01-25 Thread Derrick Stolee
Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/graph-format.txt | 88 
 1 file changed, 88 insertions(+)
 create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'C', 'G', 'P', 'H'}
+
+   1-byte version number:
+   Currently, the only valid version is 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)
+
+   1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes listing the table of contents for the chunks:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.
+   (Chunks are ordered contiguously in the file, so you can infer
+   the length using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of commits (N).
+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all commits in the graph.
+
+   Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+   * The first H bytes are for the OID of the root tree.
+   * The next 8 bytes are for the int-ids of the first two parents of
+ the ith commit. Stores value 0x if no parent in that 
position.
+ If there are more than two parents, the second value has its most-
+ significant bit on and the other bits store an offset into the 
Large
+ Edge List chunk.
+   * The next 8 bytes store the generation number of the commit and the
+ commit time in seconds since EPOCH. The generation number uses the
+ higher 30 bits of the first 4 bytes, while the commit time uses 
the
+ 32 bits of the second 4 bytes, along with the lowest 2 bits of the
+ lowest byte, storing the 33rd and 34th bit of the commit time.
+
+   [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'})
+   This list of 4-byte values store the second through nth parents for
+   all octoput merges. The second parent value in the commit data is a
+   negative number pointing into this list. Then iterate through this
+   list starting at that position until reaching a value with the most-
+   significant bit on. The other bits correspond to the int-id of the
+   last parent.
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
-- 
2.16.0



[PATCH 02/14] packed-graph: add core.graph setting

2018-01-25 Thread Derrick Stolee
The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..e7b98fa14f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.graph::
+   Enable git commit graph feature. Allows writing and reading from .graph 
files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index d8b975a571..655a81ac90 100644
--- a/cache.h
+++ b/cache.h
@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern int core_graph;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index e617c2018d..fee90912d8 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.graph")) {
+   core_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..0c56a3d869 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = 
OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
+int core_graph;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
-- 
2.16.0



[PATCH 09/14] packed-graph: implement git-graph --clear

2018-01-25 Thread Derrick Stolee
Teach Git to delete the current 'graph_head' file and the packed graph
it references. This is a good safety valve if somehow the file is
corrupted and needs to be recalculated. Since the packed graph is a
summary of contents already in the ODB, it can be regenerated.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt | 16 ++--
 builtin/graph.c | 31 ++-
 t/t5319-graph.sh|  7 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index ac20aa67a9..f690699570 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -11,6 +11,7 @@ SYNOPSIS
 [verse]
 'git graph' --write  [--pack-dir ]
 'git graph' --read  [--pack-dir ]
+'git graph' --clear [--pack-dir ]
 
 OPTIONS
 ---
@@ -18,16 +19,21 @@ OPTIONS
Use given directory for the location of packfiles, graph-head,
and graph files.
 
+--clear::
+   Delete the graph-head file and the graph file it references.
+   (Cannot be combined with --read or --write.)
+
 --read::
Read a graph file given by the graph-head file and output basic
-   details about the graph file. (Cannot be combined with --write.)
+   details about the graph file. (Cannot be combined with --clear
+   or --write.)
 
 --graph-id::
When used with --read, consider the graph file graph-.graph.
 
 --write::
Write a new graph file to the pack directory. (Cannot be combined
-   with --read.)
+   with --clear or --read.)
 
 --update-head::
When used with --write, update the graph-head file to point to
@@ -61,6 +67,12 @@ $ git graph --write --update-head
 $ git graph --read --graph-id=
 
 
+* Delete /graph-head and the file it references.
++
+
+$ git graph --clear --pack-dir=
+
+
 CONFIGURATION
 -
 
diff --git a/builtin/graph.c b/builtin/graph.c
index 0760d99f43..ac15febc46 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -10,6 +10,7 @@
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
N_("git graph --write [--pack-dir ] [--update-head]"),
NULL
@@ -17,6 +18,7 @@ static char const * const builtin_graph_usage[] ={
 
 static struct opts_graph {
const char *pack_dir;
+   int clear;
int read;
const char *graph_id;
int write;
@@ -25,6 +27,29 @@ static struct opts_graph {
struct object_id old_graph_oid;
 } opts;
 
+static int graph_clear(void)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   char *old_path;
+
+   if (!opts.has_existing)
+   return 0;
+
+   strbuf_addstr(&head_path, opts.pack_dir);
+   strbuf_addstr(&head_path, "/");
+   strbuf_addstr(&head_path, "graph-head");
+   if (remove_path(head_path.buf))
+   die("failed to remove path %s", head_path.buf);
+   strbuf_release(&head_path);
+
+   old_path = get_graph_filename_oid(opts.pack_dir, &opts.old_graph_oid);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+   free(old_path);
+
+   return 0;
+}
+
 static int graph_read(void)
 {
struct object_id graph_oid;
@@ -105,6 +130,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('c', "clear", &opts.clear,
+   N_("clear graph file and graph-head")),
OPT_BOOL('r', "read", &opts.read,
N_("read graph file")),
OPT_BOOL('w', "write", &opts.write,
@@ -129,7 +156,7 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 builtin_graph_options,
 builtin_graph_usage, 0);
 
-   if (opts.write + opts.read > 1)
+   if (opts.write + opts.read + opts.clear > 1)
usage_with_options(builtin_graph_usage, builtin_graph_options);
 
if (!opts.pack_dir) {
@@ -141,6 +168,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 
opts.has_existing = !!get_graph_head_oid(opts.pack_dir, 
&opts.old_graph_oid);
 
+   if (opts.clear)
+   return graph_clear();
if (opts.read)
return graph_read();
 

[PATCH 08/14] graph: implement git-graph --update-head

2018-01-25 Thread Derrick Stolee
It is possible to have multiple packed graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-graph to write 'graph_head' upon
writing a new packed graph file.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt | 38 --
 builtin/graph.c | 38 +++---
 packed-graph.c  | 25 +
 packed-graph.h  |  1 +
 t/t5319-graph.sh| 12 ++--
 5 files changed, 107 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index 0939c3f1be..ac20aa67a9 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -12,19 +12,53 @@ SYNOPSIS
 'git graph' --write  [--pack-dir ]
 'git graph' --read  [--pack-dir ]
 
+OPTIONS
+---
+--pack-dir::
+   Use given directory for the location of packfiles, graph-head,
+   and graph files.
+
+--read::
+   Read a graph file given by the graph-head file and output basic
+   details about the graph file. (Cannot be combined with --write.)
+
+--graph-id::
+   When used with --read, consider the graph file graph-.graph.
+
+--write::
+   Write a new graph file to the pack directory. (Cannot be combined
+   with --read.)
+
+--update-head::
+   When used with --write, update the graph-head file to point to
+   the written graph file.
+
 EXAMPLES
 
 
+* Output the OID of the graph file pointed to by /graph-head.
++
+
+$ git graph --pack-dir=
+
+
 * Write a graph file for the packed commits in your local .git folder.
 +
 
-$ git midx --write
+$ git graph --write
+
+
+* Write a graph file for the packed commits in your local .git folder,
+* and update graph-head.
++
+
+$ git graph --write --update-head
 
 
 * Read basic information from a graph file.
 +
 
-$ git midx --read --graph-id=
+$ git graph --read --graph-id=
 
 
 CONFIGURATION
diff --git a/builtin/graph.c b/builtin/graph.c
index bc66722924..0760d99f43 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -11,7 +11,7 @@
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ]"),
+   N_("git graph --write [--pack-dir ] [--update-head]"),
NULL
 };
 
@@ -20,6 +20,9 @@ static struct opts_graph {
int read;
const char *graph_id;
int write;
+   int update_head;
+   int has_existing;
+   struct object_id old_graph_oid;
 } opts;
 
 static int graph_read(void)
@@ -30,8 +33,8 @@ static int graph_read(void)
 
if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ)
get_oid_hex(opts.graph_id, &graph_oid);
-   else
-   die("no graph id specified");
+   else if (!get_graph_head_oid(opts.pack_dir, &graph_oid))
+   die("no graph-head exists.");
 
graph_file = get_graph_filename_oid(opts.pack_dir, &graph_oid);
graph = load_packed_graph_one(graph_file, opts.pack_dir);
@@ -62,10 +65,33 @@ static int graph_read(void)
return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id 
*graph_id)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   int fd;
+   struct lock_file lk = LOCK_INIT;
+
+   strbuf_addstr(&head_path, pack_dir);
+   strbuf_addstr(&head_path, "/");
+   strbuf_addstr(&head_path, "graph-head");
+
+   fd = hold_lock_file_for_update(&lk, head_path.buf, LOCK_DIE_ON_ERROR);
+   strbuf_release(&head_path);
+
+   if (fd < 0)
+   die_errno("unable to open graph-head");
+
+   write_in_full(fd, oid_to_hex(graph_id), GIT_MAX_HEXSZ);
+   commit_lock_file(&lk);
+}
+
 static int graph_write(void)
 {
struct object_id *graph_id = construct_graph(opts.pack_dir);
 
+   if (opts.update_head)
+   update_head_file(opts.pack_dir, graph_id);
+
if (graph_id)
printf("%s\n", oid_to_hex(graph_id));
 
@@ -83,6 +109,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("read graph file")),
OPT_BOOL('w', "write", &opts.write,
N_("write graph file")),
+ 

[PATCH 06/14] packed-graph: implement git-graph --write

2018-01-25 Thread Derrick Stolee
Teach git-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt | 26 ++
 builtin/graph.c | 37 ++--
 t/t5319-graph.sh| 83 +
 3 files changed, 143 insertions(+), 3 deletions(-)
 create mode 100755 t/t5319-graph.sh

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index de5a3c07e6..be6bc38814 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -5,3 +5,29 @@ NAME
 
 git-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+
+[verse]
+'git graph' --write  [--pack-dir ]
+
+EXAMPLES
+
+
+* Write a graph file for the packed commits in your local .git folder.
++
+
+$ git midx --write
+
+
+CONFIGURATION
+-
+
+core.graph::
+   The graph command will fail if core.graph is false.
+   Also, the written graph files will be ignored by other commands
+   unless core.graph is true.
+
+GIT
+---
+Part of the linkgit:git[1] suite
\ No newline at end of file
diff --git a/builtin/graph.c b/builtin/graph.c
index a902dc8646..09f5552338 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -6,31 +6,62 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "parse-options.h"
+#include "packed-graph.h"
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_graph {
const char *pack_dir;
+   int write;
 } opts;
 
+static int graph_write(void)
+{
+   struct object_id *graph_id = construct_graph(opts.pack_dir);
+
+   if (graph_id)
+   printf("%s\n", oid_to_hex(graph_id));
+
+   free(graph_id);
+   return 0;
+}
+
 int cmd_graph(int argc, const char **argv, const char *prefix)
 {
static struct option builtin_graph_options[] = {
{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('w', "write", &opts.write,
+   N_("write graph file")),
OPT_END(),
};
 
-   if (!core_graph)
-   die("core.graph is false");
-
if (argc == 2 && !strcmp(argv[1], "-h"))
usage_with_options(builtin_graph_usage, builtin_graph_options);
 
+   git_config(git_default_config, NULL);
+   if (!core_graph)
+   die("git-graph requires core.graph=true.");
+
+   argc = parse_options(argc, argv, prefix,
+builtin_graph_options,
+builtin_graph_usage, 0);
+
+   if (!opts.pack_dir) {
+   struct strbuf path = STRBUF_INIT;
+   strbuf_addstr(&path, get_object_directory());
+   strbuf_addstr(&path, "/pack");
+   opts.pack_dir = strbuf_detach(&path, NULL);
+   }
+
+   if (opts.write)
+   return graph_write();
+
return 0;
 }
 
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
new file mode 100755
index 00..52e979dfd3
--- /dev/null
+++ b/t/t5319-graph.sh
@@ -0,0 +1,83 @@
+#!/bin/sh
+
+test_description='packed graph'
+. ./test-lib.sh
+
+test_expect_success \
+'setup full repo' \
+'rm -rf .git &&
+ mkdir full &&
+ cd full &&
+ git init &&
+ git config core.graph true &&
+ git config pack.threads 1 &&
+ packdir=".git/objects/pack"'
+
+test_expect_success \
+'write graph with no packs' \
+'git graph --write --pack-dir .'
+
+test_expect_success \
+'create commits and repack' \
+'for i in $(test_seq 5)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success \
+'write graph' \
+'graph1=$(git graph --write) &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph'
+
+test_expect_success \
+'Add more commits' \
+'git reset --hard commits/3 &&
+ for i in $(test_seq 6 10)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/7

[PATCH 14/14] packed-graph: teach git-graph to read commits

2018-01-25 Thread Derrick Stolee
Teach git-graph to read commits from stdin when the --stdin-commits
flag is specified. Commits reachable from these commits are added to
the graph. This is a much faster way to construct the graph than
inspecting all packed objects, but is restricted to known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee 
---
 builtin/graph.c  | 33 +
 packed-graph.c   | 18 +++---
 packed-graph.h   |  3 ++-
 t/t5319-graph.sh | 18 ++
 4 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/builtin/graph.c b/builtin/graph.c
index 3cace3a18c..708889677b 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -25,6 +25,7 @@ static struct opts_graph {
int update_head;
int delete_expired;
int stdin_packs;
+   int stdin_commits;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -116,22 +117,36 @@ static int graph_write(void)
 {
struct object_id *graph_id;
char **pack_indexes = NULL;
+   char **commits = NULL;
int num_packs = 0;
-   int size_packs = 0;
+   int num_commits = 0;
+   char **lines = NULL;
+   int num_lines = 0;
+   int size_lines = 0;
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
-   size_packs = 128;
-   ALLOC_ARRAY(pack_indexes, size_packs);
+   size_lines = 128;
+   ALLOC_ARRAY(lines, size_lines);
 
while (strbuf_getline(&buf, stdin) != EOF) {
-   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
-   pack_indexes[num_packs++] = buf.buf;
+   ALLOC_GROW(lines, num_lines + 1, size_lines);
+   lines[num_lines++] = buf.buf;
strbuf_detach(&buf, NULL);
}
+
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+   num_packs = num_lines;
+   }
+   if (opts.stdin_commits) {
+   commits = lines;
+   num_commits = num_lines;
+   }
}
 
-   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs);
+   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs,
+  commits, num_commits);
 
if (opts.update_head)
update_head_file(opts.pack_dir, graph_id);
@@ -170,6 +185,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("delete expired head graph file")),
OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
N_("only scan packfiles listed by stdin")),
+   OPT_BOOL('C', "stdin-commits", &opts.stdin_commits,
+   N_("start walk at commits listed by stdin")),
{ OPTION_STRING, 'G', "graph-id", &opts.graph_id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
diff --git a/packed-graph.c b/packed-graph.c
index c93515f18e..94e1a97000 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -662,7 +662,8 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs)
+struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs,
+ char **commit_hex, int nr_commits)
 {
// Find a list of oids, adding the pointer to a list.
struct packed_oid_list oids;
@@ -719,10 +720,21 @@ struct object_id *construct_graph(const char *pack_dir, 
char **pack_indexes, int
for_each_object_in_pack(p, 
if_packed_commit_add_to_list, &oids);
close_pack(p);
}
-   } else {
-   for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
}
 
+   if (commit_hex) {
+   for (i = 0; i < nr_commits; i++) {
+   const char *end;
+

[PATCH 05/14] packed-graph: implement construct_graph()

2018-01-25 Thread Derrick Stolee
Teach Git to write a packed graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

Signed-off-by: Derrick Stolee 
---
 Makefile   |   1 +
 packed-graph.c | 375 +
 packed-graph.h |  20 +++
 3 files changed, 396 insertions(+)
 create mode 100644 packed-graph.c
 create mode 100644 packed-graph.h

diff --git a/Makefile b/Makefile
index d8b0d0457a..59439e13a1 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
 LIB_OBJS += oidmap.o
 LIB_OBJS += oidset.o
+LIB_OBJS += packed-graph.o
 LIB_OBJS += packfile.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
diff --git a/packed-graph.c b/packed-graph.c
new file mode 100644
index 00..9be9811667
--- /dev/null
+++ b/packed-graph.c
@@ -0,0 +1,375 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "packed-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+char* get_graph_filename_oid(const char *pack_dir,
+ struct object_id *oid)
+{
+   size_t len;
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(&head_path, pack_dir);
+   strbuf_addstr(&head_path, "/graph-");
+   strbuf_addstr(&head_path, oid_to_hex(oid));
+   strbuf_addstr(&head_path, ".graph");
+
+   return strbuf_detach(&head_path, &len);
+}
+
+static void write_graph_chunk_fanout(
+   struct sha1file *f,
+   struct commit **commits, int nr_commits)
+{
+   uint32_t i, count = 0;
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+
+   /*
+   * Write the first-level table (the list is sorted,
+   * but we use a 256-entry lookup to be able to avoid
+   * having to do eight extra binary search iterations).
+   */
+   for (i = 0; i < 256; i++) {
+   uint32_t swap_count;
+
+   while (list < last) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   swap_count = htonl(count);
+   sha1write(f, &swap_count, 4);
+   }
+}
+
+static void write_graph_chunk_oids(
+   struct sha1file *f, int hash_len,
+   struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   uint32_t i;
+   for (i = 0; i < nr_commits; i++) {
+   sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+   list++;
+   }
+}
+
+static int commit_pos(struct commit **commits, int nr_commits, const struct 
object_id *oid, uint32_t *pos)
+{
+   uint32_t first = 0, last = nr_commits;
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   struct object_id *current;
+   int cmp;
+
+   current = &(commits[mid]->object.oid);
+   cmp = oidcmp(oid, current);
+   if (!cmp) {
+   *pos = mid;
+   return 1;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   *pos = first;
+   return 0;
+}
+
+static void write_graph_chunk_data(
+   struct sha1file *f, int hash_len,
+   struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_large_edges = 0;
+
+   while (list < last) {
+   struct commit_list *parent;
+   uint32_t intId, swapIntId;
+   uint32_t packedDate[2];
+
+   parse_commit(*list);
+   sha1write(f, (*list)->tree->object.oid.hash, hash_len);
+
+   parent = (*list)->parents;
+
+   

[PATCH 07/14] packed-graph: implement git-graph --read

2018-01-25 Thread Derrick Stolee
Teach git-graph to read packed graph files and summarize their contents.

Use the --read option to verify the contents of a graph file in the
graph tests.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt |   7 +++
 builtin/graph.c |  54 
 packed-graph.c  | 147 +++-
 packed-graph.h  |  25 
 t/t5319-graph.sh|  50 +--
 5 files changed, 260 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index be6bc38814..0939c3f1be 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git graph' --write  [--pack-dir ]
+'git graph' --read  [--pack-dir ]
 
 EXAMPLES
 
@@ -20,6 +21,12 @@ EXAMPLES
 $ git midx --write
 
 
+* Read basic information from a graph file.
++
+
+$ git midx --read --graph-id=
+
+
 CONFIGURATION
 -
 
diff --git a/builtin/graph.c b/builtin/graph.c
index 09f5552338..bc66722924 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -10,15 +10,58 @@
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --read [--graph-id=]"),
N_("git graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_graph {
const char *pack_dir;
+   int read;
+   const char *graph_id;
int write;
 } opts;
 
+static int graph_read(void)
+{
+   struct object_id graph_oid;
+   struct packed_graph *graph = 0;
+   const char *graph_file;
+
+   if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ)
+   get_oid_hex(opts.graph_id, &graph_oid);
+   else
+   die("no graph id specified");
+
+   graph_file = get_graph_filename_oid(opts.pack_dir, &graph_oid);
+   graph = load_packed_graph_one(graph_file, opts.pack_dir);
+
+   if (!graph)
+   die("graph file %s does not exist.\n", graph_file);
+
+   printf("header: %08x %02x %02x %02x %02x\n",
+   ntohl(graph->hdr->graph_signature),
+   graph->hdr->graph_version,
+   graph->hdr->hash_version,
+   graph->hdr->hash_len,
+   graph->hdr->num_chunks);
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   printf("pack_dir: %s\n", graph->pack_dir);
+   return 0;
+}
+
 static int graph_write(void)
 {
struct object_id *graph_id = construct_graph(opts.pack_dir);
@@ -36,8 +79,14 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('r', "read", &opts.read,
+   N_("read graph file")),
OPT_BOOL('w', "write", &opts.write,
N_("write graph file")),
+   { OPTION_STRING, 'M', "graph-id", &opts.graph_id,
+   N_("oid"),
+   N_("An OID for a specific graph file in the pack-dir."),
+   PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
OPT_END(),
};
 
@@ -52,6 +101,9 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 builtin_graph_options,
 builtin_graph_usage, 0);
 
+   if (opts.write + opts.read > 1)
+   usage_with_options(builtin_graph_usage, builtin_graph_options);
+
if (!opts.pack_dir) {
struct strbuf path = STRBUF_INIT;
strbuf_addstr(&path, get_object_directory());
@@ -59,6 +111,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
opts.pack_dir = strbuf_detach(&path, NULL);
}
 
+   if (opts.read)
+   return graph_read();
if (opts.write)
return graph_write();
 
diff --git a/packed-graph.c b/packed-graph.c
index 9be981

[PATCH 10/14] packed-graph: teach git-graph --delete-expired

2018-01-25 Thread Derrick Stolee
Teach git-graph to delete the graph previously referenced by 'graph_head'
when writing a new graph file and updating 'graph_head'. This prevents
data creep by storing a list of useless graphs. Be careful to not delete
the graph if the file did not change.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt |  8 ++--
 builtin/graph.c | 14 +-
 t/t5319-graph.sh| 37 +++--
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index f690699570..f4f1048d28 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -39,6 +39,10 @@ OPTIONS
When used with --write, update the graph-head file to point to
the written graph file.
 
+--delete-expired::
+   When used with --write and --update-head, delete the graph file
+   previously referenced by graph-head.
+
 EXAMPLES
 
 
@@ -55,10 +59,10 @@ $ git graph --write
 
 
 * Write a graph file for the packed commits in your local .git folder,
-* and update graph-head.
+* update graph-head, and delete the old graph-.graph file.
 +
 
-$ git graph --write --update-head
+$ git graph --write --update-head --delete-expired
 
 
 * Read basic information from a graph file.
diff --git a/builtin/graph.c b/builtin/graph.c
index ac15febc46..adf779b601 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
NULL
 };
 
@@ -23,6 +23,7 @@ static struct opts_graph {
const char *graph_id;
int write;
int update_head;
+   int delete_expired;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -120,6 +121,15 @@ static int graph_write(void)
if (graph_id)
printf("%s\n", oid_to_hex(graph_id));
 
+   if (opts.delete_expired && opts.update_head && opts.has_existing &&
+   oidcmp(graph_id, &opts.old_graph_oid)) {
+   char *old_path = get_graph_filename_oid(opts.pack_dir, 
&opts.old_graph_oid);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+
+   free(old_path);
+   }
+
free(graph_id);
return 0;
 }
@@ -138,6 +148,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("write graph file")),
OPT_BOOL('u', "update-head", &opts.update_head,
N_("update graph-head to written graph file")),
+   OPT_BOOL('d', "delete-expired", &opts.delete_expired,
+   N_("delete expired head graph file")),
{ OPTION_STRING, 'M', "graph-id", &opts.graph_id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
index 311fb9dd67..a70c7bbb02 100755
--- a/t/t5319-graph.sh
+++ b/t/t5319-graph.sh
@@ -80,9 +80,42 @@ test_expect_success 'write graph with merges' \
  _graph_read_expect "18" "${packdir}" &&
  cmp expect output'
 
+test_expect_success 'Add more commits' \
+'git reset --hard commits/3 &&
+ for i in $(test_seq 16 20)
+ do
+git commit --allow-empty -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success 'write graph with merges' \
+'graph3=$(git graph --write --update-head --delete-expired) &&
+ test_path_is_file ${packdir}/graph-${graph3}.graph &&
+ test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_is_file ${packdir}/graph-head &&
+ echo ${graph3} >expect &&
+ cmp -n 40 expect ${packdir}/graph-head &&
+ git graph --read --graph-id=${graph3} >output &&
+ _graph_read_expect "23" "${packdir}" &&
+ cmp expect output'
+
+test_expect_success 'write graph with nothing new' \
+'graph4=$(git graph --write --update-head --delete-expired) &&
+ test_

[PATCH 12/14] packed-graph: read only from specific pack-indexes

2018-01-25 Thread Derrick Stolee
Teach git-graph to inspect the objects only in a certain list of
pack-indexes within the given pack directory. This allows updating
the graph iteratively, since we add all commits stored in a previous
packed graph.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-graph.txt | 12 
 builtin/graph.c | 26 +++---
 packed-graph.c  | 27 +++
 packed-graph.h  |  2 +-
 packfile.c  |  4 ++--
 packfile.h  |  2 ++
 t/t5319-graph.sh| 10 ++
 7 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index f4f1048d28..b68a61ddea 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -43,6 +43,11 @@ OPTIONS
When used with --write and --update-head, delete the graph file
previously referenced by graph-head.
 
+--stdin-packs::
+   When used with --write, generate the new graph by walking objects
+   only in the specified packfiles and any commits in the
+   existing graph-head.
+
 EXAMPLES
 
 
@@ -65,6 +70,13 @@ $ git graph --write
 $ git graph --write --update-head --delete-expired
 
 
+* Write a graph file, extending the current graph file using commits
+* in , update graph-head, and delete the old graph-.graph file.
++
+
+$ echo  | git graph --write --update-head --delete-expired 
--stdin-packs
+
+
 * Read basic information from a graph file.
 +
 
diff --git a/builtin/graph.c b/builtin/graph.c
index adf779b601..3cace3a18c 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
NULL
 };
 
@@ -24,6 +24,7 @@ static struct opts_graph {
int write;
int update_head;
int delete_expired;
+   int stdin_packs;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -113,7 +114,24 @@ static void update_head_file(const char *pack_dir, const 
struct object_id *graph
 
 static int graph_write(void)
 {
-   struct object_id *graph_id = construct_graph(opts.pack_dir);
+   struct object_id *graph_id;
+   char **pack_indexes = NULL;
+   int num_packs = 0;
+   int size_packs = 0;
+
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   size_packs = 128;
+   ALLOC_ARRAY(pack_indexes, size_packs);
+
+   while (strbuf_getline(&buf, stdin) != EOF) {
+   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
+   pack_indexes[num_packs++] = buf.buf;
+   strbuf_detach(&buf, NULL);
+   }
+   }
+
+   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs);
 
if (opts.update_head)
update_head_file(opts.pack_dir, graph_id);
@@ -150,7 +168,9 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("update graph-head to written graph file")),
OPT_BOOL('d', "delete-expired", &opts.delete_expired,
N_("delete expired head graph file")),
-   { OPTION_STRING, 'M', "graph-id", &opts.graph_id,
+   OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
+   N_("only scan packfiles listed by stdin")),
+   { OPTION_STRING, 'G', "graph-id", &opts.graph_id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
diff --git a/packed-graph.c b/packed-graph.c
index 343b231973..0dc68a077e 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -401,7 +401,7 @@ static int fill_packed_commit(struct commit *item, struct 
packed_graph *g, uint3
  *  2. date
  *  3. parents.
  *
- * Returns 1 iff the commit was found in the packed graph.
+ * Returns 1 if and only if the commit was found in the packed graph.
  *
  * See parse_commit_buffer() for the fallback after this call.
  */
@@ -427,7 +427,7 @@ int parse_packed_commit(struct commit *item)
return fill_packed_commit(item, packed_graph, pos);
}
 
-   

[PATCH 13/14] packed-graph: close under reachability

2018-01-25 Thread Derrick Stolee
Teach construct_graph() to walk all parents from the commits discovered in
packfiles. This prevents gaps given by loose objects or previously-missed
packfiles.

Signed-off-by: Derrick Stolee 
---
 packed-graph.c   | 26 ++
 t/t5319-graph.sh | 14 ++
 2 files changed, 40 insertions(+)

diff --git a/packed-graph.c b/packed-graph.c
index 0dc68a077e..c93515f18e 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -5,6 +5,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "revision.h"
 #include "packed-graph.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
@@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct 
object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(&revs, NULL);
+
+   for (i = 0; i < oids->num; i++) {
+   commit = lookup_commit(oids->list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
&revs.commits);
+   }
+
+   if (prepare_revision_walk(&revs))
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision(&revs)) != NULL) {
+   ALLOC_GROW(oids->list, oids->num + 1, oids->size);
+   oids->list[oids->num] = &(commit->object.oid);
+   (oids->num)++;
+   }
+}
+
 struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs)
 {
// Find a list of oids, adding the pointer to a list.
@@ -698,6 +722,8 @@ struct object_id *construct_graph(const char *pack_dir, 
char **pack_indexes, int
} else {
for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
}
+
+   close_reachable(&oids);
QSORT(oids.list, oids.num, commit_compare);
 
count_distinct = 1;
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
index 969150cd21..8bf5a0c993 100755
--- a/t/t5319-graph.sh
+++ b/t/t5319-graph.sh
@@ -212,6 +212,20 @@ test_expect_success 'clear graph' \
 _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
+test_expect_success 'build graph from latest pack with closure' \
+'graph5=$(cat new-idx | git graph --write --update-head --stdin-packs) &&
+ test_path_is_file ${packdir}/graph-${graph5}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_is_file ${packdir}/graph-head &&
+ echo ${graph5} >expect &&
+ cmp -n 40 expect ${packdir}/graph-head &&
+ git graph --read --graph-id=${graph5} >output &&
+ _graph_read_expect "21" "${packdir}" &&
+ cmp expect output'
+
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
 'cd .. &&
  git clone --bare full bare &&
-- 
2.16.0



[PATCH 11/14] commit: integrate packed graph with commit parsing

2018-01-25 Thread Derrick Stolee
Teach Git to inspect a packed graph to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.graph is false, then do not load the graph and behave as usual.

In test script t5319-graph.sh, add output-matching conditions on read-
only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commits walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee 
---
 alloc.c  |   1 +
 commit.c |  20 -
 commit.h |   2 +
 log-tree.c   |   3 +-
 packed-graph.c   | 242 +++
 packed-graph.h   |  18 +
 t/t5319-graph.sh | 114 --
 7 files changed, 387 insertions(+), 13 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..4a4dcfa2b7 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graphId = 0x;
return c;
 }
 
diff --git a/commit.c b/commit.c
index cab8d4455b..253c102808 100644
--- a/commit.c
+++ b/commit.c
@@ -12,6 +12,7 @@
 #include "prio-queue.h"
 #include "sha1-lookup.h"
 #include "wt-status.h"
+#include "packed-graph.h"
 
 static struct commit_extra_header *read_commit_extra_header_lines(const char 
*buf, size_t len, const char **);
 
@@ -374,7 +375,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
check_packed)
 {
enum object_type type;
void *buffer;
@@ -383,19 +384,27 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
 
if (!item)
return -1;
+
+   // If we already parsed, but got it from the graph, then keep going!
if (item->object.parsed)
return 0;
+
+   if (check_packed && parse_packed_commit(item))
+   return 0;
+
buffer = read_sha1_file(item->object.oid.hash, &type, &size);
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(&item->object.oid));
+   oid_to_hex(&item->object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(&item->object.oid));
+   oid_to_hex(&item->object.oid));
}
+
ret = parse_commit_buffer(item, buffer, size);
+
if (save_commit_buffer && !ret) {
set_commit_buffer(item, buffer, size);
return 0;
@@ -404,6 +413,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index 8c68ca1a5a..02f5f2a182 100644
--- a/commit.h
+++ b/commit.h
@@ -21,6 +21,7 @@ struct commit {
timestamp_t date;
struct commit_list *parents;
struct tree *tree;
+   uint32_t graphId;
 };
 
 extern int save_commit_buffer;
@@ -60,6 +61,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size);
+extern int parse_commit_internal(struct commit *item, int quiet_on_mis

Re: [PATCH 00/14] Serialized Commit Graph

2018-01-25 Thread Derrick Stolee

On 1/25/2018 10:46 AM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Jan 25 2018, Derrick Stolee jotted:


* 'git log --topo-order -1000' walks all reachable commits to avoid
   incorrect topological orders, but only needs the commit message for
   the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
   boundary between the commits reachable from A and those reachable
   from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
   compared to their upstream remote branches. This is essentially as
   hard as computing merge bases for each.

This is great, spotted / questions so far:

* git graph --blah says you need to enable the config, should say
   "unknown option --blah ". I.e. overzelous config guard.


This is a good point.


* On a big repo (git show-ref -s | ~/g/git/git-graph --write
   --update-head) is as of writing this still hanging for me, but strace
   shows it's brk()-ing. Presumably just still busy, a progress bar would
   be very nice.


Oops! This is my mistake. The correct command should be:

    git show-ref -s | git graph --write --update-head --stdin-commits

Without "--stdin-commits" the command will walk all packed objects
to look for commits and then build the graph. That's why it's taking
so long. That method takes several minutes on the Linux repo, but with
--stdin-commits it should take as long as "git log >/dev/null".


* Shouldn't there be a pack.useGraph option so this gets auto-updated on
   repack? I understand this series is a WIP, so that's more a "is that
   the UI" than "it needs now".


This will definitely be part of a follow-up patch.

Thanks,
-Stolee


Re: [PATCH 02/14] packed-graph: add core.graph setting

2018-01-25 Thread Derrick Stolee

On 1/25/2018 3:17 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee  wrote:

The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
  extern int fsync_object_files;
  extern int core_preload_index;
  extern int core_apply_sparse_checkout;
+extern int core_graph;

Putting it here instead of say the_repository makes sense as you'd want
to use this feature globally. However you can still have the config
different per repository  (e.g. version number of the graph setting,
as one might be optimized for speed and the other for file size of
the .graph file or such).

So not sure if we'd rather want to put this into the repository struct.
But then again the other core settings aren't there either and this
feature sounds like it is repository specific only in the experimental
phase; later it is expected to be on everywhere?


I do think that more things should go in the repository struct. 
Unfortunately, that is not the world we live in.


However, to make things clearer I'm following the pattern currently in 
master. You'll see the same with the global 'packed_graph' pointer, 
similar to 'packed_git'. I think these should be paired together until 
the repository absorbs them.


If other 'core_*' variables move to the repository, I'm happy to move 
core_graph.

If 'packed_git' moves to the repository, I'm happy to move 'packed_git'.

However, if there is significant interest in moving all new state to the 
repository, then I'll move these values there. Let's have that 
discussion here instead of spread around the rest of the patch.


Thanks,
-Stolee



Re: [PATCH 00/14] Serialized Commit Graph

2018-01-26 Thread Derrick Stolee

On 1/25/2018 6:06 PM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Jan 25 2018, Derrick Stolee jotted:

Oops! This is my mistake. The correct command should be:

  git show-ref -s | git graph --write --update-head --stdin-commits

Without "--stdin-commits" the command will walk all packed objects
to look for commits and then build the graph. That's why it's taking
so long. That method takes several minutes on the Linux repo, but with
--stdin-commits it should take as long as "git log >/dev/null".

Thanks, it took around 15m to finish with the command I initially ran on
my test repo.

Then the `merge-base --is-ancestor` performance problem I was
complaining about in
https://public-inbox.org/git/87608bawoa@evledraar.gmail.com/ takes
around 1s with your series, 5s without it. Nice.


Thanks for testing this! May I ask how many commits are in your repo? 
One way to find out is to run 'git graph --read' and it will tell you 
how many commits are in the serialized graph.


Thanks,
-Stolee


Re: [PATCH 01/14] graph: add packed graph design document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 3:04 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee  wrote:

Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee 
---
  Documentation/technical/packed-graph.txt | 185 +++
  1 file changed, 185 insertions(+)
  create mode 100644 Documentation/technical/packed-graph.txt

diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.

How did you come up with that specific number? (Is it platform dependent?)
I'd avoid a specific number to not derail the reader here into wondering
how this got measured.


Using a specific number was a mistake. Git can walk ~100K commits per 
second by parsing commits, in my tests on my machine. I'll instead say 
"commit count grows."



+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:

status needs the walk for the ahead/behind computation which is (1)?
(I forget how status would need to compute a merge-base)


'status' computes the ahead/behind counts using paint_down_to_common(). 
This is a more robust method than simply computing merge bases, but the 
possible merge bases are found as a result.



fetch is a networked command, which traditionally in Git is understood as
"can be slow" because you might be in Australia, or the connection is slow
otherwise. So giving this as an example it is not obvious that the DAG walking
is the bottleneck. Maybe git-merge or "git show --remerge-diff" [1] are better
examples for walk-intensive commands?

[1] https://public-inbox.org/git/cover.1409860234.git...@thomasrast.ch/
 never landed, so maybe that is a bad example. But for me that command
 is more obviously dependent on cheap walking the DAG compared to fetch.
 So, take my comments with a grain of salt!


Actually, a 'fetch' command does the same ahead/behind calculation as 
'status', and in GVFS repos we have seen that walk take 30s per branch 
when comparing local and remote copies a fast-moving branch. Yes, there 
are other (usually) more expensive things in 'fetch' so I'll drop that 
reference..



+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().


This new format is specifically removing the cost of decompression and parsing
(1) completely, whereas (2) we still have to walk the entire graph for now as
the generation numbers are not fully used as of yet, but provided.


A major goal of this work is to provide a place to store computed 
generation numbers so we can not walk the entire graph. I mention this 
here because 'git log -' is O(n) (due to commit-date heuristics that 
prevent walking the entire graph) while 'git log --topo-order -' is 
O(T) where T is the total number of reachable commits.



+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.

Does this mean we decrease the pressure on fast lookups in
packfiles/loose objects?


Yes, we do. In fact, when profiling 'git log --topo-order -1000', I 
noticed that 30-50% of the time (after this patch) is spent in 
lookup_tree(). If we can prevent checking the ODB for the existence of 
these trees until they are needed, we can get additional speedups. It is 
a bit wasteful that we are loading these trees even when we will never 
use them (such as computing merge bases).





+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a

Re: [PATCH 01/14] graph: add packed graph design document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:14 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee 
---
  Documentation/technical/packed-graph.txt | 185 +++
  1 file changed, 185 insertions(+)
  create mode 100644 Documentation/technical/packed-graph.txt

I really wanted to like having this patch at the beginning, but
unfortunatelly I didn't see the actual file format description,
which was a bit disappointing.  An example of the things that I was
curious about was how the "integer ID" is used to access into the
file.  If we could somehow use "integer ID" as an index into an
array of fixed size elements, it would be ideal to gain "fast
lookups", but because of the "list of parents" thing, it needs some
trickery to do so, and that was among the things that I wanted to
see how much thought went into the design, for example.


There is definitely a chicken-or-the-egg situation here. I'm happy to 
start with the format before the design document.


I can try to expand this "integer ID" concept, but you can see how I use 
it in the following method from patch 11/14:


+int parse_packed_commit(struct commit *item)
+{
+    if (!core_graph)
+    return 0;
+    if (item->object.parsed)
+    return 1;
+
+    prepare_packed_graph();
+    if (packed_graph) {
+    uint32_t pos;
+    int found;
+    if (item->graphId != 0x) {
+    pos = item->graphId;
+    found = 1;
+    } else {
+    found = bsearch_graph(packed_graph, 
&(item->object.oid), &pos);

+    }
+
+    if (found)
+    return fill_packed_commit(item, packed_graph, pos);
+    }
+
+    return 0;
+}

Note that if item->graphId has a "real" value (not 0x which in 
hindsight should be a macro) then we navigate directly to that position 
in the graph. Otherwise, we use binary search to query the graph's 
commit list to find the position (if the commit is packed).



diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:

s/data shape/history shape/ may make it even clearer.


+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.

Commits created after a packed graph file is built may of course not
appear in a packed graph file, but that is OK because they never need
to be listed as parents of commits in the file.  So "list of parents"
can always refer to the parents using the "integer ID for fast lookup".


One thing I need to test locally is what happens with boundary commits 
of a shallow clone. The commit's parents are not in the repo, so they 
will not be in the graph. I think that parse_commit_buffer() drops the 
parents, so the graph will treat them as root commits.



Makes sense.  Item 2. might want to say "The list of parents, using
the fast lookup integer ID (see 5.) as reference instead of OID",
though.


That will be more specific, thanks.


+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit.

When a commit A can reach roots X and Y, and Y is further than X,
the distance between Y and A becomes A's generation number.  "One
more than the length of the path from the commit to the furthest
root commit it can reach", in other words.


My "Equivalently,..." sentence is mang

Re: [PATCH 02/14] packed-graph: add core.graph setting

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:43 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

Signed-off-by: Derrick Stolee 
---
  Documentation/config.txt | 3 +++
  cache.h  | 1 +
  config.c | 5 +
  environment.c| 1 +
  4 files changed, 10 insertions(+)

Before you get too married to the name "graph", is it reasonable to
assume that the commit ancestry graph is the primary "graph" that
should come to users' minds when a simple word "graph" is used in
the context of discussing Git?  I suspect not.

Let's not just call this "core.graph" and "packed-graph", and in
addition give some adjective before "graph".


I was too focused that I wanted the word "graph" but "graph.c" already 
existed in source root that I came up with "packed-graph.c" just to have 
a separate filename. Clearly, "commit-graph" should be used instead. In 
v2, I'll use "/commit-graph.c" and "/builtin/commit-graph.c".


Thanks,
-Stolee


Re: [PATCH 03/14] packed-graph: create git-graph builtin

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:45 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee  wrote:

Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

I wonder if this builtin should not respect the boolean core graph,
as this new builtin commands' whole existence
is to deal with these new files?

As you assume this builtin as a plumbing command, I would
expect it to pay less attention to config rather than more.


My thought was to alert the caller "This graph isn't going to be good 
for anything!" and fail quickly before doing work. You do have a good 
point, and I think we can remove that condition here. When we integrate 
with other commands ('repack', 'fetch', 'clone') we will want a 
different setting that signals automatically writing the graph and we 
don't want those to fail because they are not aware of a second config 
setting.





@@ -408,6 +408,7 @@ static struct cmd_struct commands[] = {
 { "fsck-objects", cmd_fsck, RUN_SETUP },
 { "gc", cmd_gc, RUN_SETUP },
 { "get-tar-commit-id", cmd_get_tar_commit_id },
+   { "graph", cmd_graph, RUN_SETUP_GENTLY },

Why gently, though?

 From reading the docs (and assumptions on further patches)
we'd want to abort if there is no .git dir to be found?

Or is a future patch having manual logic? (e.g. if pack-dir is
given, the command may be invoked from outside a git dir)


You are right. I inherited this from my MIDX patch which can operate on 
a list of IDX files without a .git folder. The commit graph operations 
need an ODB.


Thanks,
-Stolee



Re: [PATCH 03/14] packed-graph: create git-graph builtin

2018-01-26 Thread Derrick Stolee

On 1/25/2018 6:01 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

Just to set my expectation straight.

Is it fair to say that in the ideal endgame state, this will be like
"git pack-objects" in that end users won't have to know about it,
but would serve as a crucial building block that is invoked by other
front-end commands that are more familiar to end users (just like
pack-objects are used behind the scenes by repack, push, etc.)?


That is my hope. Leaving that integration for later, after this feature 
has proven itself.


Re: [PATCH 04/14] packed-graph: add format document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 5:06 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee 
---
  Documentation/technical/graph-format.txt | 88 
  1 file changed, 88 insertions(+)
  create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===

Good that this is not saying "graph format" but is explicit that it
is about "commit".  Do the same for the previous steps.  Especially,
builtin/graph.c that does not have much to do with graph.c is not a
good way forward ;-)


:+1:


I do like the fact that later parents of octopus merges are moved
out of way to make the majority of records fixed length, but I am
not sure if the "up to two parents are recorded in line" is truly
the best arrangement.  Aren't majority of commits single-parent,
thereby wasting 4 bytes almost always?

Will 32-bit stay to be enough for everybody?  Wouldn't it make sense
to at least define them to be indices into arrays (i.e. scaled to
element size), not "offsets", to recover a few lost bits?


I incorrectly used the word "offset" when I mean "array position" for 
the edge values.



What's the point of storing object id length?  If you do not
understand the object ID scheme, knowing only the length would not
do you much good anyway, no?  And if you know the hashing scheme
specified by Object ID version, you already know the length, no?


I'll go read the OID transition document to learn more, but I didn't 
know if there were plans for things like "Use SHA3 but with different 
hash lengths depending on user requirements". One side benefit is that 
we immediately know the width of our commit and tree references within 
the commit graph file without needing to consult a table of hash 
definitions.



On 1/25/2018 5:18 PM, Stefan Beller wrote:

git.git has ~37k non-merge commits and ~12k merge commits,
(35 of them have 3 or more parents).

So 75% would waste the 4 bytes of the second parent.

However the first parent is still there, so any operation that only needs
the first parent (git bisect --first-parent?) would still be fast.
Not sure if that is common.


The current API boundary does not support this, as parse_commit_gently() 
is not aware of the --first-parent option. The benefits of injecting 
that information are probably not worth the complication.


On 1/25/2018 5:29 PM, Junio C Hamano wrote:

Stefan Beller  writes:


The downside of just having one parent or pointer into the edge list
would be to penalize 25% of the commit lookups with an indirection
compared to ~0% (the 35 octopus'). I'd rather want to optimize for
speed than disk size? (4 bytes for 37k is 145kB for git.git, which I
find is not a lot).

My comment is not about disk size but is about the size of working
set (or "size of array element").
I do want to optimize for speed over space, at least for two-parent 
commits. Hopefully my clarification about offset/array position 
clarifies Junio's concerns here.


Thanks,
-Stolee



Re: [PATCH 04/14] packed-graph: add format document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 5:07 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee  wrote:

Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee 
---
  Documentation/technical/graph-format.txt | 88 

So this is different from Documentation/technical/packed-graph.txt,
which gives high level design and this gives the details on how
to set bits.


  1 file changed, 88 insertions(+)
  create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'C', 'G', 'P', 'H'}
+
+   1-byte version number:
+   Currently, the only valid version is 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)

   This is 20 or 40 for sha1 ? (binary or text representation?)


20 for binary.




+   1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes listing the table of contents for the chunks:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.

... offset [in bytes/words/4k blocks?] in ...


bytes.




+   (Chunks are ordered contiguously in the file, so you can infer
+   the length using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of commits (N).

So F[0] > 0 for git.git for example.

Or another way: To lookup a 01xxx, I need to look at
entry(F[00] + 1 )...entry(F[01]).

Makes sense.


+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all commits in the graph.

... sorted ascending.



+   Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+   * The first H bytes are for the OID of the root tree.
+   * The next 8 bytes are for the int-ids of the first two parents of
+ the ith commit. Stores value 0x if no parent in that 
position.
+ If there are more than two parents, the second value has its most-
+ significant bit on and the other bits store an offset into the 
Large
+ Edge List chunk.

s/an offset into/position in/ ? (otherwise offset in bytes?)


+   * The next 8 bytes store the generation number of the commit and the
+ commit time in seconds since EPOCH. The generation number uses the
+ higher 30 bits of the first 4 bytes, while the commit time uses 
the
+ 32 bits of the second 4 bytes, along with the lowest 2 bits of the
+ lowest byte, storing the 33rd and 34th bit of the commit time.

This allows for a maximum generation number of
1.073.741.823 (2^30 -1) = 1 billion,
and a max time stamp of later than 2100.

Do you allow negative time stamps?



+
+   [Optional] Large Edge List (ID: 

<    9   10   11   12   13   14   15   16   17   18   >