Re: [PATCH v7 04/14] graph: add commit graph design document

2018-04-08 Thread Jakub Narebski
Derrick Stolee  writes:

> From: Derrick Stolee 
>
> Add Documentation/technical/commit-graph.txt with details of the planned
> commit graph feature, including future plans.

That's in my opinion a very good idea.  It would help anyone trying to
add to and extend this feature.

> Signed-off-by: Derrick Stolee 
> ---
>  Documentation/technical/commit-graph.txt | 163 +++
>  1 file changed, 163 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph.txt
>
> diff --git a/Documentation/technical/commit-graph.txt 
> b/Documentation/technical/commit-graph.txt
> new file mode 100644
> index 00..0550c6d0dc
> --- /dev/null
> +++ b/Documentation/technical/commit-graph.txt
> @@ -0,0 +1,163 @@
> +Git Commit Graph Design Notes
> +=
> +
> +Git walks the commit graph for many reasons, including:
> +
> +1. Listing and filtering commit history.
> +2. Computing merge bases.
> +
> +These operations can become slow as the commit count grows. The merge
> +base calculation shows up in many user-facing commands, such as 'merge-base'
> +or 'status' and can take minutes to compute depending on history shape.
> +
> +There are two main costs here:
> +
> +1. Decompressing and parsing commits.
> +2. Walking the entire graph to satisfy topological order constraints.
> +
> +The commit graph file is a supplemental data structure that accelerates
> +commit graph walks. If a user downgrades or disables the 'core.commitGraph'
> +config setting, then the existing ODB is sufficient.

This is a good explanation of why we want to have this, and why in this
form.  I really like the "Related links" with summary.

>   The file is stored
> +as "commit-graph" either in the .git/objects/info directory or in the info
> +directory of an alternate.

That is a good thing.

Do I understand it correctly that Git would use first "commit-graph"
file that it would encounter, or would it merge information from all
alternates (including perhaps self)?  What if repository is a
subtree-merge of different repositories, with each listed separately as
alternate?

> +
> +The commit graph file stores the commit graph structure along with some
> +extra metadata to speed up graph walks. By listing commit OIDs in lexi-
> +cographic order, we can identify an integer position for each commit and
> +refer to the parents of a commit using those integer positions. We use
> +binary search to find initial commits and then use the integer positions
> +for fast lookups during the walk.

It might be worth emphasizing here that fast access using integer
positions for commits in the chunk is possible only if chunk used
fixed-width format (each commit taking the same amount of space -- which
for example is not true for packfile).

> +
> +A consumer may load the following info for a commit from the graph:
> +
> +1. The commit OID.
> +2. The list of parents, along with their integer position.
> +3. The commit date.
> +4. The root tree OID.
> +5. The generation number (see definition below).
> +
> +Values 1-4 satisfy the requirements of parse_commit_gently().

Good to have it here.  It is nice to know why 1-4 are needed to be in
the "commit-graph" structure.

Do I understand it correctly that this feature is what makes porting Git
to start using "commit-graph" information easy, because it is single
point of entry, isn't it?

> +
> +Define the "generation number" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has generation number one.
> +
> + * A commit with at least one parent has generation number one more than
> +   the largest generation number among its parents.
> +
> +Equivalently, the generation number of a commit A is one more than the
> +length of a longest path from A to a root commit. The recursive definition
> +is easier to use for computation and observing the following property:
> +
> +If A and B are commits with generation numbers N and M, respectively,
> +and N <= M, then A cannot reach B. That is, we know without searching
> +that B is not an ancestor of A because it is further from a root commit
> +than A.

Because generation numbers from the "commit-graph" may not cover all
commits (recent commits can have no generation number information), I
think it would be good idea to state what happens then.

Because of how generation numbers are defined, if commit A has
generation number provided, then all ancestors also have generation
number information.  Thus:

   If A is a commit with generation number N, and B is a commit
   without generation number information, then A cannot reach B.
   That is, we know without searching that B is not an ancestor of A
   because if it was, it would have generation number information.

Additionally (but that might be not worth adding here):

   If A is a commit without generation number information, and B is
   a commit with generation nu

[PATCH v7 04/14] graph: add commit graph design document

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee 

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/commit-graph.txt | 163 +++
 1 file changed, 163 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
new file mode 100644
index 00..0550c6d0dc
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,163 @@
+Git Commit Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--
+
+- The commit graph file is stored in a file named 'commit-graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+Future Work
+---
+
+- The commit graph feature currently does not honor commit grafts. This can
+  be remedied by duplicating or refactoring the current graft logic.
+
+- The 'commit-graph' subcommand does not have a "verify" mode that is
+  necessary for integration with fsck.
+
+- The file format includes room for precomputed generation numbers. These
+  are not currently computed, so all generation numbers will be marked as
+  0 (