Re: [PATCH v7 21/22] gc: automatically write commit-graph files

2018-08-03 Thread SZEDER Gábor


Hi Derrick,

> The commit-graph file is a very helpful feature for speeding up git
> operations. In order to make it more useful, make it possible to
> write the commit-graph file during standard garbage collection
> operations.
> 
> Add a 'gc.commitGraph' config setting that triggers writing a
> commit-graph file after any non-trivial 'git gc' command. Defaults to
> false while the commit-graph feature matures. We specifically do not
> want to have this on by default until the commit-graph feature is fully
> integrated with history-modifying features like shallow clones.
> 
> Helped-by: Ævar Arnfjörð Bjarmason 
> Signed-off-by: Derrick Stolee 
> ---
>  Documentation/config.txt |  9 ++---
>  Documentation/git-gc.txt |  4 
>  builtin/gc.c |  6 ++
>  t/t5318-commit-graph.sh  | 14 ++
>  4 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..978deecfee 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -904,9 +904,12 @@ core.notesRef::
>  This setting defaults to "refs/notes/commits", and it can be overridden by
>  the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
>  
> -core.commitGraph::
> - Enable git commit graph feature. Allows reading from the
> - commit-graph file.
> +gc.commitGraph::
> + If true, then gc will rewrite the commit-graph file when
> + linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
> + '--auto' the commit-graph will be updated if housekeeping is
> + required. Default is false. See linkgit:git-commit-graph[1]
> + for details.

Sorry, I've lost track of this patch series, unfortunately, but after
it hit master yesterday a merge conflict with one of my (unrelated)
WIP patches drew my attention to this hunk above, which got me
confused:

 - Why remove the description of 'core.commitGraph'?  'git grep -i
   core.commitGraph' shows that it's still used supported.

 - Why insert the description of 'gc.commitGraph' here instead of next
   to the other 'gc.*' config variables, breaking lexicographical
   order of confvars?

 - This hunk, the other documentation hunk right below, and indeed the
   commit message all speak about 'gc.commitGraph', but the new code
   and test added in this patch use 'gc.writeCommitGraph' (following
   my suggestion from a couple of weeks ago, I think?).

What's going on?

>  
>  core.sparseCheckout::
>   Enable "sparse checkout" feature. See section "Sparse checkout" in
> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index 24b2dd44fe..f5bc98ccb3 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -136,6 +136,10 @@ The optional configuration variable `gc.packRefs` 
> determines if
>  it within all non-bare repos or it can be set to a boolean value.
>  This defaults to true.
>  
> +The optional configuration variable `gc.commitGraph` determines if
> +'git gc' should run 'git commit-graph write'. This can be set to a
> +boolean value. This defaults to false.
> +
>  The optional configuration variable `gc.aggressiveWindow` controls how
>  much time is spent optimizing the delta compression of the objects in
>  the repository when the --aggressive option is specified.  The larger
> diff --git a/builtin/gc.c b/builtin/gc.c
> index ccfb1ceaeb..b7dd206f41 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -20,6 +20,7 @@
>  #include "sigchain.h"
>  #include "argv-array.h"
>  #include "commit.h"
> +#include "commit-graph.h"
>  #include "packfile.h"
>  #include "object-store.h"
>  #include "pack.h"
> @@ -40,6 +41,7 @@ static int aggressive_depth = 50;
>  static int aggressive_window = 250;
>  static int gc_auto_threshold = 6700;
>  static int gc_auto_pack_limit = 50;
> +static int gc_write_commit_graph = 0;
>  static int detach_auto = 1;
>  static timestamp_t gc_log_expire_time;
>  static const char *gc_log_expire = "1.day.ago";
> @@ -129,6 +131,7 @@ static void gc_config(void)
>   git_config_get_int("gc.aggressivedepth", &aggressive_depth);
>   git_config_get_int("gc.auto", &gc_auto_threshold);
>   git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
> + git_config_get_bool("gc.writecommitgraph", &gc_write_commit_graph);
>   git_config_get_bool("gc.autodetach", &detach_auto);
>   git_config_get_expiry("gc.pruneexpire", &prune_expire);
>   git_config_get_expiry("gc.worktreepruneexpire", 
> &prune_worktrees_expire);
> @@ -641,6 +644,9 @@ int cmd_gc(int argc, const char **argv, const char 
> *prefix)
>   if (pack_garbage.nr > 0)
>   clean_pack_garbage();
>  
> + if (gc_write_commit_graph)
> + write_commit_graph_reachable(get_object_directory(), 0);
> +
>   if (auto_gc && too_many_loose_objects())
>   warning(_("There are too many unreachable loose objects; "
>   "run 'git prune' to remove them."));
> diff --git a/t/t5318-comm

Re: [PATCH v7 21/22] gc: automatically write commit-graph files

2018-06-27 Thread Derrick Stolee

On 6/27/2018 2:09 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


@@ -40,6 +41,7 @@ static int aggressive_depth = 50;
  static int aggressive_window = 250;
  static int gc_auto_threshold = 6700;
  static int gc_auto_pack_limit = 50;
+static int gc_write_commit_graph = 0;

Please avoid unnecessary (and undesirable) explicit initialization
to 0.  Instead, let BSS to handle it by leaving " = 0" out.


+test_expect_success 'check that gc computes commit-graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit --allow-empty -m "blank" &&
+   git commit-graph write --reachable &&
+   cp $objdir/info/commit-graph commit-graph-before-gc &&
+   git reset --hard HEAD~1 &&
+   git config gc.writeCommitGraph true &&
+   git gc &&
+   cp $objdir/info/commit-graph commit-graph-after-gc &&
+   ! test_cmp commit-graph-before-gc commit-graph-after-gc &&

The set of commits in the commit graph will chanbe by discarding the
(old) tip commit, so the fact that the contents of the file changed
across gc proves that "commit-graph write" was triggered during gc.

Come to think of it, do we promise to end-users (in docs etc.) that
commit-graph covers *ONLY* commits reachable from refs and HEAD?  I
am wondering what should happen if "git gc" here does not prune the
reflog for HEAD---wouldn't we want to reuse the properties of the
commit we are discarding when it comes back (e.g. you push, then
reset back, and the next pull makes it reappear in your repository)?


Today I learned that 'gc' keeps some of the reflog around. That makes 
sense, but I wouldn't optimize the commit-graph file for this scenario.



I guess what I am really questioning is if it is sensible to define
"--reachable" as "starting at all refs", unlike the usual connectivity
rules "gc" uses, especially when this is run from inside "gc".


It is sensible to me, especially because we only lose performance if we 
visit those other commits that are still in the object database. By 
writing the commit-graph on 'gc' and not during 'fetch', we are already 
assuming the commit-graph will usually be behind the set of commits that 
the user cares about, by design.


An alternate view on the decision will need help answering from others 
who know more than me: In fetch negotiation, does the client report 
commits in the reflog as 'have's or do they get re-downloaded on a 
resulting 'git pull'?





+   git commit-graph write --reachable &&
+   test_cmp commit-graph-after-gc $objdir/info/commit-graph

This says that running "commit-graph write" twice without changing
the topology MUST yield byte-for-byte identical commit-graph file.

Is that a reasonable requirement on the future implementation?  I am
wondering if there will arise a situation where you need to store
records in "some" fixed order but two records compare "the same" and
tie-breaking them to give stable sort is expensive, or something
like that, which would benefit if you leave an escape hatch to allow
two logically identical graphs expressed bitwise differently.


Since the file format allows flexibility in the order of the chunks, it 
is possible to have bitwise-different files that represent the same set 
of data. However, I would not want git to provide inconsistent output 
given the same set of commits covered by the file.


Thanks,
-Stolee


Re: [PATCH v7 21/22] gc: automatically write commit-graph files

2018-06-27 Thread Junio C Hamano
Derrick Stolee  writes:

> @@ -40,6 +41,7 @@ static int aggressive_depth = 50;
>  static int aggressive_window = 250;
>  static int gc_auto_threshold = 6700;
>  static int gc_auto_pack_limit = 50;
> +static int gc_write_commit_graph = 0;

Please avoid unnecessary (and undesirable) explicit initialization
to 0.  Instead, let BSS to handle it by leaving " = 0" out.

> +test_expect_success 'check that gc computes commit-graph' '
> + cd "$TRASH_DIRECTORY/full" &&
> + git commit --allow-empty -m "blank" &&
> + git commit-graph write --reachable &&
> + cp $objdir/info/commit-graph commit-graph-before-gc &&
> + git reset --hard HEAD~1 &&
> + git config gc.writeCommitGraph true &&
> + git gc &&
> + cp $objdir/info/commit-graph commit-graph-after-gc &&
> + ! test_cmp commit-graph-before-gc commit-graph-after-gc &&

The set of commits in the commit graph will chanbe by discarding the
(old) tip commit, so the fact that the contents of the file changed
across gc proves that "commit-graph write" was triggered during gc.

Come to think of it, do we promise to end-users (in docs etc.) that
commit-graph covers *ONLY* commits reachable from refs and HEAD?  I
am wondering what should happen if "git gc" here does not prune the
reflog for HEAD---wouldn't we want to reuse the properties of the
commit we are discarding when it comes back (e.g. you push, then
reset back, and the next pull makes it reappear in your repository)?

I guess what I am really questioning is if it is sensible to define
"--reachable" as "starting at all refs", unlike the usual connectivity
rules "gc" uses, especially when this is run from inside "gc".

> + git commit-graph write --reachable &&
> + test_cmp commit-graph-after-gc $objdir/info/commit-graph

This says that running "commit-graph write" twice without changing
the topology MUST yield byte-for-byte identical commit-graph file.

Is that a reasonable requirement on the future implementation?  I am
wondering if there will arise a situation where you need to store
records in "some" fixed order but two records compare "the same" and
tie-breaking them to give stable sort is expensive, or something
like that, which would benefit if you leave an escape hatch to allow
two logically identical graphs expressed bitwise differently.

> +'
> +
>  # the verify tests below expect the commit-graph to contain
>  # exactly the commits reachable from the commits/8 branch.
>  # If the file changes the set of commits in the list, then the


[PATCH v7 21/22] gc: automatically write commit-graph files

2018-06-27 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, make it possible to
write the commit-graph file during standard garbage collection
operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to have this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Helped-by: Ævar Arnfjörð Bjarmason 
Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt |  9 ++---
 Documentation/git-gc.txt |  4 
 builtin/gc.c |  6 ++
 t/t5318-commit-graph.sh  | 14 ++
 4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1cc18a828c..978deecfee 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -904,9 +904,12 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
-core.commitGraph::
-   Enable git commit graph feature. Allows reading from the
-   commit-graph file.
+gc.commitGraph::
+   If true, then gc will rewrite the commit-graph file when
+   linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
+   '--auto' the commit-graph will be updated if housekeeping is
+   required. Default is false. See linkgit:git-commit-graph[1]
+   for details.
 
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 24b2dd44fe..f5bc98ccb3 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -136,6 +136,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable `gc.commitGraph` determines if
+'git gc' should run 'git commit-graph write'. This can be set to a
+boolean value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index ccfb1ceaeb..b7dd206f41 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,7 @@
 #include "sigchain.h"
 #include "argv-array.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "packfile.h"
 #include "object-store.h"
 #include "pack.h"
@@ -40,6 +41,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_write_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -129,6 +131,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", &aggressive_depth);
git_config_get_int("gc.auto", &gc_auto_threshold);
git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+   git_config_get_bool("gc.writecommitgraph", &gc_write_commit_graph);
git_config_get_bool("gc.autodetach", &detach_auto);
git_config_get_expiry("gc.pruneexpire", &prune_expire);
git_config_get_expiry("gc.worktreepruneexpire", 
&prune_worktrees_expire);
@@ -641,6 +644,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_write_commit_graph)
+   write_commit_graph_reachable(get_object_directory(), 0);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2208c266b5..5947de3d24 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+test_expect_success 'check that gc computes commit-graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit --allow-empty -m "blank" &&
+   git commit-graph write --reachable &&
+   cp $objdir/info/commit-graph commit-graph-before-gc &&
+   git reset --hard HEAD~1 &&
+   git config gc.writeCommitGraph true &&
+   git gc &&
+   cp $objdir/info/commit-graph commit-graph-after-gc &&
+   ! test_cmp commit-graph-before-gc commit-graph-after-gc &&
+   git commit-graph write --reachable &&
+   test_cmp commit-graph-after-gc $objdir/info/commit-graph
+'
+
 # the verify tests below expect the commit-graph to contain
 # exactly the commits reachable