Re: [PATCH v7 04/14] graph: add commit graph design document
Derrick Stolee writes: > From: Derrick Stolee > > Add Documentation/technical/commit-graph.txt with details of the planned > commit graph feature, including future plans. That's in my opinion a very good idea. It would help anyone trying to add to and extend this feature. > Signed-off-by: Derrick Stolee > --- > Documentation/technical/commit-graph.txt | 163 +++ > 1 file changed, 163 insertions(+) > create mode 100644 Documentation/technical/commit-graph.txt > > diff --git a/Documentation/technical/commit-graph.txt > b/Documentation/technical/commit-graph.txt > new file mode 100644 > index 00..0550c6d0dc > --- /dev/null > +++ b/Documentation/technical/commit-graph.txt > @@ -0,0 +1,163 @@ > +Git Commit Graph Design Notes > += > + > +Git walks the commit graph for many reasons, including: > + > +1. Listing and filtering commit history. > +2. Computing merge bases. > + > +These operations can become slow as the commit count grows. The merge > +base calculation shows up in many user-facing commands, such as 'merge-base' > +or 'status' and can take minutes to compute depending on history shape. > + > +There are two main costs here: > + > +1. Decompressing and parsing commits. > +2. Walking the entire graph to satisfy topological order constraints. > + > +The commit graph file is a supplemental data structure that accelerates > +commit graph walks. If a user downgrades or disables the 'core.commitGraph' > +config setting, then the existing ODB is sufficient. This is a good explanation of why we want to have this, and why in this form. I really like the "Related links" with summary. > The file is stored > +as "commit-graph" either in the .git/objects/info directory or in the info > +directory of an alternate. That is a good thing. Do I understand it correctly that Git would use first "commit-graph" file that it would encounter, or would it merge information from all alternates (including perhaps self)? What if repository is a subtree-merge of different repositories, with each listed separately as alternate? > + > +The commit graph file stores the commit graph structure along with some > +extra metadata to speed up graph walks. By listing commit OIDs in lexi- > +cographic order, we can identify an integer position for each commit and > +refer to the parents of a commit using those integer positions. We use > +binary search to find initial commits and then use the integer positions > +for fast lookups during the walk. It might be worth emphasizing here that fast access using integer positions for commits in the chunk is possible only if chunk used fixed-width format (each commit taking the same amount of space -- which for example is not true for packfile). > + > +A consumer may load the following info for a commit from the graph: > + > +1. The commit OID. > +2. The list of parents, along with their integer position. > +3. The commit date. > +4. The root tree OID. > +5. The generation number (see definition below). > + > +Values 1-4 satisfy the requirements of parse_commit_gently(). Good to have it here. It is nice to know why 1-4 are needed to be in the "commit-graph" structure. Do I understand it correctly that this feature is what makes porting Git to start using "commit-graph" information easy, because it is single point of entry, isn't it? > + > +Define the "generation number" of a commit recursively as follows: > + > + * A commit with no parents (a root commit) has generation number one. > + > + * A commit with at least one parent has generation number one more than > + the largest generation number among its parents. > + > +Equivalently, the generation number of a commit A is one more than the > +length of a longest path from A to a root commit. The recursive definition > +is easier to use for computation and observing the following property: > + > +If A and B are commits with generation numbers N and M, respectively, > +and N <= M, then A cannot reach B. That is, we know without searching > +that B is not an ancestor of A because it is further from a root commit > +than A. Because generation numbers from the "commit-graph" may not cover all commits (recent commits can have no generation number information), I think it would be good idea to state what happens then. Because of how generation numbers are defined, if commit A has generation number provided, then all ancestors also have generation number information. Thus: If A is a commit with generation number N, and B is a commit without generation number information, then A cannot reach B. That is, we know without searching that B is not an ancestor of A because if it was, it would have generation number information. Additionally (but that might be not worth adding here): If A is a commit without generation number information, and B is a commit with generation nu
[PATCH v7 04/14] graph: add commit graph design document
From: Derrick Stolee Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 163 +++ 1 file changed, 163 insertions(+) create mode 100644 Documentation/technical/commit-graph.txt diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 00..0550c6d0dc --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,163 @@ +Git Commit Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +--- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- The file format includes room for precomputed generation numbers. These + are not currently computed, so all generation numbers will be marked as + 0 (