Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-10 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 10 2018, SZEDER Gábor wrote:

> On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote:
>> SZEDER Gábor  writes:
>>
>> >> git-gc - Cleanup unnecessary files and optimize the local repository
>> >>
>> >> Creating these indexes like the commit-graph falls under "optimize the
>> >> local repository",
>> >
>> > But it doesn't fall under "cleanup unnecessary files", which the
>> > commit-graph file is, since, strictly speaking, it's purely
>> > optimization.
>>
>> I won't be actively engaged in this discussion soon, but I must say
>> that "git gc" doing "garbage collection" is merely an implementation
>> detail of optimizing the repository for further use.  And from that
>> point of view, what needs to be updated is the synopsis of the
>> git-gc doc.  It states "X and Y" above, but it actually is "Y by
>> doing X and other things".
>
> Well, then perhaps the name of the command should be updated, too, to
> better reflect what it actually does...

I don't disagree, but between "git gc" being a longstanding thing called
not just by us , but third parties expecting it to a be a general swiss
army knife for "make repo better" (so for a new tool they'd need to
update their code), and general name bikeshedding I think it's best if
we just proceed for the sake of argument with the assumption that none
of us find the name confusing in this context.

At least that's the discussion I'm interested in. I.e. whether it makes
conceptual / structural sense for this stuff to live in the same place /
function in the code. We can always argue about the name of the function
as a separate topic.

>> I understand your "by definition there is no garbage immediately
>> after clone" position, and also I would understand if you find it
>> (perhaps philosophically) disturbing that "git clone" may give users
>> a suboptimal repository that immediately needs optimizing [*1*].
>>
>> But that bridge was crossed long time ago ever since pack transfer
>> was invented.  The data source sends only the pack data stream, and
>> the receiving end is responsible for spending cycles to build .idx
>> file.  Theoretically, .pack should be all that is needed---you
>> should be able to locate any necessary object by parsing the .pack
>> file every time you open it, and .idx is mere optimization.  You can
>> think of the .midx and graph files the same way.
>
> I don't think this is a valid comparison, because, practically, Git
> just didn't work after I deleted all pack index files.  So while they
> can be easily (re)generated, they are essential to make pack files
> usable.  The commit-graph and .midx files, however, can be safely
> deleted, and everything keeps working as before.

For "things that would run in 20ms now run in 30 seconds" (actual
numbers on a repo I have) values of "keeps working".

So I think this line gets blurred somewhat. In practice if you're
expecting the graph to be there to run the sort of commands that most
benefit from it it's essential that it be generated, not some nice
optional extra.

> OTOH, this is an excellent comparison, and I do think of the .midx and
> graph files the same way as the pack index files.  During a clone, the
> pack index file isn't generated by running a separate 'git gc
> (--auto)', but by clone (or fetch-pack?) running 'git index-pack'.
> The way I see it that should be the case for these other files as well.
>
> And it is much simpler, shorter, and cleaner to either run 'git
> commit-graph ...' or even to call write_commit_graph_reachable()
> directly from cmd_clone(), than to bolt on another option and config
> variable on 'git gc' [1] to coax it into some kind of an "after clone"
> mode, that it shouldn't be doing in the first place.  At least for
> now, so when we'll eventually get as far ...
>
>> I would not be surprised by a future in which the initial index-pack
>> that is responsible for receiving the incoming pack stream and
>> storing that in .pack file(s) while creating corresponding .idx
>> file(s) becomes also responsible for building .midx and graph files
>> in the same pass, or at least smaller number of passes.  Once we
>> gain experience and confidence with these new auxiliary files, that
>> ought to happen naturally.  And at that point, we won't be having
>> this discussion---we'd all happily run index-pack to receive the
>> pack data, because that is pretty much the fundamental requirement
>> to make use of the data.
>
> ... that what you wrote here becomes a reality (and I fully agree that
> this is what we should ultimately aim for), then we won't have that
> option and config variable still lying around and requiring
> maintenance because of backwards compatibility.

We'll still have the use-case of wanting to turn on
gc.writeCommitGraph=true or equivalent and wanting previously-cloned
repositories to "catch up" and get a commit graph ASAP (but not do a
full repack).

This is why my patch tries to unify those two codepaths, i.e. so I can
turn this on 

Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-10 Thread SZEDER Gábor
On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote:
> SZEDER Gábor  writes:
> 
> >> git-gc - Cleanup unnecessary files and optimize the local repository
> >> 
> >> Creating these indexes like the commit-graph falls under "optimize the
> >> local repository",
> >
> > But it doesn't fall under "cleanup unnecessary files", which the
> > commit-graph file is, since, strictly speaking, it's purely
> > optimization.
> 
> I won't be actively engaged in this discussion soon, but I must say
> that "git gc" doing "garbage collection" is merely an implementation
> detail of optimizing the repository for further use.  And from that
> point of view, what needs to be updated is the synopsis of the
> git-gc doc.  It states "X and Y" above, but it actually is "Y by
> doing X and other things".

Well, then perhaps the name of the command should be updated, too, to
better reflect what it actually does...

> I understand your "by definition there is no garbage immediately
> after clone" position, and also I would understand if you find it
> (perhaps philosophically) disturbing that "git clone" may give users
> a suboptimal repository that immediately needs optimizing [*1*].
> 
> But that bridge was crossed long time ago ever since pack transfer
> was invented.  The data source sends only the pack data stream, and
> the receiving end is responsible for spending cycles to build .idx
> file.  Theoretically, .pack should be all that is needed---you
> should be able to locate any necessary object by parsing the .pack
> file every time you open it, and .idx is mere optimization.  You can
> think of the .midx and graph files the same way.

I don't think this is a valid comparison, because, practically, Git
just didn't work after I deleted all pack index files.  So while they
can be easily (re)generated, they are essential to make pack files
usable.  The commit-graph and .midx files, however, can be safely
deleted, and everything keeps working as before.

OTOH, this is an excellent comparison, and I do think of the .midx and
graph files the same way as the pack index files.  During a clone, the
pack index file isn't generated by running a separate 'git gc
(--auto)', but by clone (or fetch-pack?) running 'git index-pack'.
The way I see it that should be the case for these other files as well.

And it is much simpler, shorter, and cleaner to either run 'git
commit-graph ...' or even to call write_commit_graph_reachable()
directly from cmd_clone(), than to bolt on another option and config
variable on 'git gc' [1] to coax it into some kind of an "after clone"
mode, that it shouldn't be doing in the first place.  At least for
now, so when we'll eventually get as far ...

> I would not be surprised by a future in which the initial index-pack
> that is responsible for receiving the incoming pack stream and
> storing that in .pack file(s) while creating corresponding .idx
> file(s) becomes also responsible for building .midx and graph files
> in the same pass, or at least smaller number of passes.  Once we
> gain experience and confidence with these new auxiliary files, that
> ought to happen naturally.  And at that point, we won't be having
> this discussion---we'd all happily run index-pack to receive the
> pack data, because that is pretty much the fundamental requirement
> to make use of the data.

... that what you wrote here becomes a reality (and I fully agree that
this is what we should ultimately aim for), then we won't have that
option and config variable still lying around and requiring
maintenance because of backwards compatibility.

1 - https://public-inbox.org/git/87in2hgzin@evledraar.gmail.com/

> [Footnote]
> 
> *1* Even without considering these recent invention of auxiliary
> files, cloning from a sloppily packed server whose primary focus
> is to avoid spending cycles by not computing better deltas will
> give the cloner a suboptimal repository.  If we truly want to
> have an optimized repository ready to be used after cloning, we
> should run an equivalent of "repack -a -d -f" immediately after
> "git clone".

I noticed a few times that I got surprisingly large packs from GitHub,
e.g. there is over 70% size difference between --single-branch cloning
v2.19.0 from GitHub and from my local clone or from kernel.org (~95MB
vs. ~55MB vs ~52MB).  After running 'git repack -a -d -f' they all end
up at ~65MB, which is a nice size reduction for the clone from GitHub,
but the others just gained 10-13 more MBs.



Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-09 Thread SZEDER Gábor
On Mon, Oct 08, 2018 at 11:08:03PM -0400, Jeff King wrote:
> I'd have done it as one fixed-size filter per commit. Then you should be
> able to hash the path keys once, and apply the result as a bitwise query
> to each individual commit (I'm assuming that it's constant-time to
> access the filter for each, as an index into an mmap'd array, with the
> offset coming from a commit-graph entry we'd be able to look up anyway).

I used one big Bloom filter for the whole history, because that was
the simplest way to get going, and because I was primarily interested
in the potential benefits instead of the cost of generating and
maintaining it.

Using a 8MB filter for git.git results in a false positive rate
between 0.21% - 0.53%.  Splitting that up for ~53k commits we get ~160
bytes for each.  On first sight that seems like rather small, but
running a bit statistics shows that 99% of our commits don't change
more than 10 files.

One advantage of the "one Bloom filter for each commit" is that if a
commit doesn't have a corresponding Bloom filter, then, well, we can't
query the non-existing filter.  OTOH, with one big Bloom filter we
have to be careful to only ever query it with commits whose changes
have already been added, otherwise we can get false negatives.

> I think it would also be easier to deal with maintenance, since each
> filter is independent (IIRC, you cannot delete from a bloom filter
> without re-adding all of the other keys).

Accumulating entries related to unreachable commits will eventually
increase the false positive rate, but otherwise it won't cause false
negatives, and won't increase the size of the Bloom filter or the time
necessary to query it.  So not deleting those entries right away is
not an issue, and I think it could be postponed until bigger gc runs.

[...]

> But there's also a related question: how do we match pathspec patterns?
> For a changed path like "foo/bar/baz", I imagine a bloom filter would
> mark all of "foo", "foo/bar", and "foo/bar/baz".

Indeed, that's what I did.

> But what about "*.c"? I
> don't think a bloom filter can answer that.

Surely not, but it could easily return "maybe", and thus simply fall
back to look at the diff.

However, I've looked through the output of

  grep '^git log[^|]*[\[*?]' ~/.bash_history

and haven't found a single case where I used Git's globbing.  When I
did use globbing, then I always used the shell's.  Yeah, just one data
point, and others surely use it differently, etc...  but I think we
should consider whether it's common enough to worry about and to
increase complexity because of it.

[...]

> So let's imagine we'd store such a cache external to the regular object
> data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
> has 1.7M entries. The paths should easily compress to a single 32-bit
> integer (e.g., as an index into a big path list). The oids are 20 bytes.
> Add a few bytes for modes. That's about 80MB. Big, but not impossibly
> so. Maybe pushing it for true gigantic repos, though.

In my experiments with the Linux repo a 256MB Bloom filter has ~0.3%
false positive rate, while a 128MB filter had 3-4%.  Even bigger,
though compared to the size of the checkout of the full kernel tree,
arguably not that much.

> Those numbers are ignoring merges, too. The meaning of "did this commit
> touch that path" is a lot trickier for a merge commit, and I think may
> depend on context. I'm not sure how even a bloom filter solution would
> handle that (I was assuming we'd mostly punt and let merges fall back to
> opening up the trees).

During revision walking rev_compare_tree() checks whether the given
paths changed between a commit and _one_ of its parents, and in case
of merge commits it's invoked multiple times with the same commit but
with different parent parameters.  By storing (changed-path,
parent-oid, commit-oid) tuples in the Bloom filter it can deal with
merges, too.



Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Jeff King
On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:

> > I wonder if Roaring does better here.
> 
> In these sparse cases, usually Roaring will organize the data as "array
> chunks" which are simply lists of the values. The thing that makes this
> still compressible is that we store two bytes per entry, as the entries are
> grouped by a common most-significant two bytes. SInce you say ~120k unique
> paths, the Roaring bitmap would have two or three chunks per bitmap (and
> those chunks could be empty). The overhead to store the chunk positions,
> types, and lengths does come at a cost, but it's more like 32 bytes _per
> commit_.

Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.

> > So at any rate, I do think it would not be out of the question to store
> > bitmaps like this. I'm much more worried about the maintenance cost of
> > adding new entries incrementally. I think it's only feasible if we give
> > up sorting, and then I wonder what other problems that might cause.
> The patch below gives me a starting point to try the Bloom filter approach
> and see what the numbers are like. You did all the "git" stuff like
> computing the changed paths, so thanks!

Great, I hope it can be useful. I almost wrote it as perl consuming the
output of "log --format=%h --name-only", but realized I didn't have a
perl ewah implementation handy.

You'll probably want to tweak this part:

> > +   prepare_revision_walk();
> > +   while ((commit = get_revision())) {
> > +   data.commit = commit;
> > +   diff_tree_combined_merge(commit, 0, );
> > +   }

...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:

  while ((commit = get_revision())) {
struct object_id *parent_oid;

/* ignore non-first parents, but handle root commits like --root */
if (commit->parents)
parent = >parents->item->object.oid;
else
parent = the_hash_algo->empty_tree;

diff_tree_oid(parent, >oid, ...);
  }

-Peff


Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Derrick Stolee

On 10/9/2018 2:46 PM, Jeff King wrote:

On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:


[I snipped all of the parts about bloom filters that seemed entirely
  reasonable to me ;) ]

Imagine we have that list. Is a bloom filter still the best data
structure for each commit? At the point that we have the complete
universe of paths, we could give each commit a bitmap of changed paths.
That lets us ask "did this commit touch these paths" (collect the bits
from the list of paths, then check for 1's), as well as "what paths did
we touch" (collect the 1 bits, and then index the path list).  Those
bitmaps should compress very well via EWAH or similar (most of them
would be huge stretches of 0's punctuated by short runs of 1's).

I'm not convinced we would frequently have runs of 1's, and the bitmap would
not compress much better than simply listing the positions. For example, a
path "foo/bar" that resolves to a tree would only start a run if the next
changes are the initial section of entries in that tree (sorted
lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
then we will break the run of 1's unless we changed every path deeper than
that tree.

Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
2). I agree that listing the positions could work, though I sort of
assumed that was more or less what a decent compressed bitmap would
turn into. E.g., if bit N is set, we should be able to say "N-1
zeroes, 1 one" in about the same size as we could say "position N".

EWAH seems pretty awful in that regard. Or at least its serialized
format is (or maybe it's our implementation that is bad).

The patch below generates a bitmap for each commit in a repository (it
doesn't output the total list of paths; I've left that as an exercise
for the reader). On linux.git, the result is 57MB. But when I look at
the individual bitmap sizes (via GIT_TRACE), they're quite silly.
Storing a single set bit takes 28 bytes in serialized form!

There are only around 120k unique paths (including prefix trees).
Naively using run-length encoding and varints, our worst case should be
something like 18-20 bits to say "120k zeroes, then a 1, then all
zeroes".  And the average case should be better (you don't even need to
say "120k", but some smaller number).

I wonder if Roaring does better here.


In these sparse cases, usually Roaring will organize the data as "array 
chunks" which are simply lists of the values. The thing that makes this 
still compressible is that we store two bytes per entry, as the entries 
are grouped by a common most-significant two bytes. SInce you say ~120k 
unique paths, the Roaring bitmap would have two or three chunks per 
bitmap (and those chunks could be empty). The overhead to store the 
chunk positions, types, and lengths does come at a cost, but it's more 
like 32 bytes _per commit_.



Gzipping the resulting bitmaps drops the total size to about 7.5MB.
That's not a particularly important number, but I think it shows that
the built-in ewah compression is far from ideal.

Just listing the positions with a series of varints would generally be
fine, since we expect sparse bitmaps. I just hoped that a good
RLE scheme would degrade to roughly that for the sparse case, but also
perform well for more dense cases.

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter 
approach and see what the numbers are like. You did all the "git" stuff 
like computing the changed paths, so thanks!


-- >8 --
diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
  TEST_PROGRAMS_NEED_X += test-pkt-line
  TEST_PROGRAMS_NEED_X += test-svn-fe
  TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
  
  TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
  
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c

new file mode 100644
index 00..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+   struct hashmap_entry ent;
+   unsigned pos;
+   char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+const void *entry,
+const void *entry_or_key,
+const void *keydata)
+{
+   const struct pathmap_entry *a = entry;
+   const struct pathmap_entry *b = entry_or_key;
+  

Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Jeff King
On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:

> [I snipped all of the parts about bloom filters that seemed entirely
>  reasonable to me ;) ]

> > Imagine we have that list. Is a bloom filter still the best data
> > structure for each commit? At the point that we have the complete
> > universe of paths, we could give each commit a bitmap of changed paths.
> > That lets us ask "did this commit touch these paths" (collect the bits
> > from the list of paths, then check for 1's), as well as "what paths did
> > we touch" (collect the 1 bits, and then index the path list).  Those
> > bitmaps should compress very well via EWAH or similar (most of them
> > would be huge stretches of 0's punctuated by short runs of 1's).
> 
> I'm not convinced we would frequently have runs of 1's, and the bitmap would
> not compress much better than simply listing the positions. For example, a
> path "foo/bar" that resolves to a tree would only start a run if the next
> changes are the initial section of entries in that tree (sorted
> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
> then we will break the run of 1's unless we changed every path deeper than
> that tree.

Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
2). I agree that listing the positions could work, though I sort of
assumed that was more or less what a decent compressed bitmap would
turn into. E.g., if bit N is set, we should be able to say "N-1
zeroes, 1 one" in about the same size as we could say "position N".

EWAH seems pretty awful in that regard. Or at least its serialized
format is (or maybe it's our implementation that is bad).

The patch below generates a bitmap for each commit in a repository (it
doesn't output the total list of paths; I've left that as an exercise
for the reader). On linux.git, the result is 57MB. But when I look at
the individual bitmap sizes (via GIT_TRACE), they're quite silly.
Storing a single set bit takes 28 bytes in serialized form!

There are only around 120k unique paths (including prefix trees).
Naively using run-length encoding and varints, our worst case should be
something like 18-20 bits to say "120k zeroes, then a 1, then all
zeroes".  And the average case should be better (you don't even need to
say "120k", but some smaller number).

I wonder if Roaring does better here.

Gzipping the resulting bitmaps drops the total size to about 7.5MB.
That's not a particularly important number, but I think it shows that
the built-in ewah compression is far from ideal.

Just listing the positions with a series of varints would generally be
fine, since we expect sparse bitmaps. I just hoped that a good
RLE scheme would degrade to roughly that for the sparse case, but also
perform well for more dense cases.

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.

-Peff

-- >8 --
diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-pkt-line
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
 
 TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
 
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
new file mode 100644
index 00..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+   struct hashmap_entry ent;
+   unsigned pos;
+   char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+const void *entry,
+const void *entry_or_key,
+const void *keydata)
+{
+   const struct pathmap_entry *a = entry;
+   const struct pathmap_entry *b = entry_or_key;
+   const char *key = keydata;
+
+   return strcmp(a->path, key ? key : b->path);
+}
+
+static int pathmap_entry_strcmp(const void *va, const void *vb)
+{
+   struct pathmap_entry *a = *(struct pathmap_entry **)va;
+   struct pathmap_entry *b = *(struct pathmap_entry **)vb;
+   return strcmp(a->path, b->path);
+}
+
+struct walk_paths_data {
+   struct hashmap *paths;
+   struct commit *commit;
+};
+
+static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
+{
+   struct argv_array argv = ARGV_ARRAY_INIT;
+   struct rev_info revs;
+   struct walk_paths_data data;
+   struct commit *commit;
+
+   argv_array_pushl(, "rev-list",
+

Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Ævar Arnfjörð Bjarmason


On Tue, Oct 09 2018, Derrick Stolee wrote:

> The filter needs to store every path that would be considered "not
> TREESAME". It can't store wildcards, so you would need to evaluate the
> wildcard and test all of those paths individually (not a good idea).

If full paths are stored, yes, But have you considered instead of
storing paths, storing all trigrams that can be extracted from changed
paths at that commit?

I.e. instead of a change to "t/t-basic.sh" storing
"t/t-basic.sh" we'd store ["t/t", "/t0", "t00", "000", "00-" ...]
etc.

That sort of approach would mean that e.g. "t*000*", "*asi*.sh"
etc. could all be indexed, and as long as we could find three
consecutive bytes of fixed string we'd have a chance to short-circuit,
but would need to degrade to a full tree unpack for e.g. "t*". We could
also special-case certain sub-three-char indexes, or to
"bi-grams". E.g. to be able to index '*.c' or 't*' (first char at the
beginning of a string only).

It would mean having to check more things in the bloom filter for each
commit, but that's going to be hot in cache at that point so it'll
probably beat unpacking trees by far, and we could short-circuit exit at
the first one that returned false.


Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)

2018-10-09 Thread Derrick Stolee

(Changing title to reflect the new topic.)

On 10/8/2018 11:08 PM, Jeff King wrote:

On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:

There are two questions that I was hoping to answer by looking at 
your code:

1. How do you store your Bloom filter? Is it connected to the commit-graph
and split on a commit-by-commit basis (storing "$path" as a key), or is it
one huge Bloom filter (storing "$commitid:$path" as key)?

I guess you've probably thought all of this through for your
implementation, but let me pontificate.

I'd have done it as one fixed-size filter per commit. Then you should be
able to hash the path keys once, and apply the result as a bitwise query
to each individual commit (I'm assuming that it's constant-time to
access the filter for each, as an index into an mmap'd array, with the
offset coming from a commit-graph entry we'd be able to look up anyway).


You're right that we want to hash the path a constant number of times. 
Add in that we want to re-use information already serialized when 
updating the file, and we see that having a commit-by-commit Bloom 
filter is a good idea. Using (commit, path) pairs requires lots of 
re-hashing, repeated work when extending the filter, and poor locality 
when evaluating membership of a single key.


The nice thing is that you can generate k 32-bit hash values based on 
two 32-bit hash values that are "independent enough" (see [1]). We used 
Murmur3 with two different seed values to generate hashes a & b, then 
used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our 
k hash values. These can be computed up front and then dropped into any 
size filter using a simple modulo operation. This allows flexible sizes 
in our filters.


I don't think fixed size filters are a good idea, and instead would opt 
for flex-sized filters with a maximum size. The typical parameters use 7 
hash functions and a filter size of (at least) 10 bits per entry. For 
most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a 
maximum of 512 bytes covers 99% of commits. We will want these bounds to 
be configurable via config. If we had a fixed size, then we either make 
it too small (and don't have sufficient coverage of commits) or too 
large (and waste a lot of space on the commits that change very little).


We can store these flex-sized filters in the commit-graph using two 
columns of data (two new optional chunks):


* Bloom filter data: stores the binary data of each commit's Bloom 
filter, concatenated together in the same order as the commits appear in 
the commit-graph.


* Bloom filter positions: The ith position of this column stores the 
start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A 
Bloom filter of size 0 is intended to mean "we didn't store this filter 
because it would be too large". We can compute the lengths of the filter 
by inspecting adjacent values.


In order to be flexible, we will want to encode some basic information 
into the Bloom filter data chunk, such as a tuple of (hash version, num 
hash bits, num bits per entry). This allows us to change the parameters 
in config but still be able to read a serialized filter. Here I assume 
that all filters share the same parameters. The "hash version" here is 
different than the_hash_algo, because we don't care about cryptographic 
security, only a uniform distrubution (hence, Murmur3 is a good, fast 
option).


[1] 
https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf




I think it would also be easier to deal with maintenance, since each
filter is independent (IIRC, you cannot delete from a bloom filter
without re-adding all of the other keys).

The obvious downside is that it's O(commits) storage instead of O(1).
It would always be O(changes), as the Bloom filter needs to grow in size 
as the number of entries grows.

Now let me ponder a bit further afield. A bloom filter lets us answer
the question "did this commit (probably) touch these paths?". But it
does not let us answer "which paths did this commit touch?".

That second one is less useful than you might think, because we almost
always care about not just the names of the paths, but their actual
object ids. Think about a --raw diff, or even traversing for
reachability (where if we knew the tree-diff cheaply, we could avoid
asking "have we seen this yet?" on most of the tree entries). The names
alone can make that a bit faster, but in the worst case you still have
to walk the whole tree to find their entries.

But there's also a related question: how do we match pathspec patterns?
For a changed path like "foo/bar/baz", I imagine a bloom filter would
mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
don't think a bloom filter can answer that.


The filter needs to store every path that would be considered "not 
TREESAME". It can't store wildcards, so you would need to evaluate the 
wildcard and test all of those paths 

Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Jeff King
On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:

> > > > But I'm afraid it will take a while until I get around to turn it into
> > > > something presentable...
> > > Do you have the code pushed somewhere public where one could take a look? 
> > > I
> > > Do you have the code pushed somewhere public where one could take a
> > > look? I could provide some early feedback.
> > Nah, definitely not...  I know full well how embarassingly broken this
> > implementation is, I don't need others to tell me that ;)
> There are two questions that I was hoping to answer by looking at your code:
> 
> 1. How do you store your Bloom filter? Is it connected to the commit-graph
> and split on a commit-by-commit basis (storing "$path" as a key), or is it
> one huge Bloom filter (storing "$commitid:$path" as key)?

I guess you've probably thought all of this through for your
implementation, but let me pontificate.

I'd have done it as one fixed-size filter per commit. Then you should be
able to hash the path keys once, and apply the result as a bitwise query
to each individual commit (I'm assuming that it's constant-time to
access the filter for each, as an index into an mmap'd array, with the
offset coming from a commit-graph entry we'd be able to look up anyway).

I think it would also be easier to deal with maintenance, since each
filter is independent (IIRC, you cannot delete from a bloom filter
without re-adding all of the other keys).

The obvious downside is that it's O(commits) storage instead of O(1).

Now let me ponder a bit further afield. A bloom filter lets us answer
the question "did this commit (probably) touch these paths?". But it
does not let us answer "which paths did this commit touch?".

That second one is less useful than you might think, because we almost
always care about not just the names of the paths, but their actual
object ids. Think about a --raw diff, or even traversing for
reachability (where if we knew the tree-diff cheaply, we could avoid
asking "have we seen this yet?" on most of the tree entries). The names
alone can make that a bit faster, but in the worst case you still have
to walk the whole tree to find their entries.

But there's also a related question: how do we match pathspec patterns?
For a changed path like "foo/bar/baz", I imagine a bloom filter would
mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
don't think a bloom filter can answer that.

At least not by itself. If we imagine that the commit-graph also had an
alphabetized list of every path in every tree, then it's easy: apply the
glob to that list once to get a set of concrete paths, and then query
the bloom filters for those. And that list actually isn't too big. The
complete set of paths in linux.git is only about 300k gzipped (I think
that's the most relevant measure, since it's an obvious win to avoid
repeating shared prefixes of long paths).

Imagine we have that list. Is a bloom filter still the best data
structure for each commit? At the point that we have the complete
universe of paths, we could give each commit a bitmap of changed paths.
That lets us ask "did this commit touch these paths" (collect the bits
from the list of paths, then check for 1's), as well as "what paths did
we touch" (collect the 1 bits, and then index the path list).  Those
bitmaps should compress very well via EWAH or similar (most of them
would be huge stretches of 0's punctuated by short runs of 1's).

So that seems promising to me (or at least not an obvious dead-end). I
do think maintenance gets to be a headache, though. Adding new paths
potentially means reordering the bitmaps, which means O(commits) work to
"incrementally" update the structure. (Unless you always add the new
paths at the end, but then you lose fast lookups in the list; that might
be an acceptable tradeoff).

And finally, there's one more radical option: could we actually store a
real per-commit tree-diff cache? I.e., imagine that each commit had the
equivalent of a --raw diff easily accessible, including object ids. That
would allow:

  - fast pathspec matches, including globs

  - fast --raw output (and faster -p output, since we can skip the tree
entirely)

  - fast reachability traversals (we only need to bother to look at the
objects for changed entries)

where "fast" is basically O(size of commit's changes), rather than
O(size of whole tree). This was one of the big ideas of packv4 that
never materialized. You can _almost_ do it with packv2, since after all,
we end up storing many trees as deltas. But those deltas are byte-wise
so it's hard for a reader to convert them directly into a pure-tree
diff (they also don't mention the "deleted" data, so it's really only
half a diff).

So let's imagine we'd store such a cache external to the regular object
data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
has 1.7M entries. The paths should easily compress to a single 32-bit
integer (e.g., as an index 

Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Junio C Hamano
SZEDER Gábor  writes:

> There is certainly potential there.  With a (very) rough PoC
> experiment, a 8MB bloom filter, and a carefully choosen path I can
> achieve a nice, almost 25x speedup:
>
>   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
>   6
>
>   real0m1.563s
>   user0m1.519s
>   sys 0m0.045s
>
>   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- 
> t/valgrind/valgrind.sh
>   6
>
>   real0m0.063s
>   user0m0.043s
>   sys 0m0.020s

Even though I somehow sense a sign of exaggeration in [v] in the
pathname, it still is quite respectable.

>   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false 
> positives: 64 fp ratio: 0.003934
>
> But I'm afraid it will take a while until I get around to turn it into
> something presentable...

That's OK.  This is an encouraging result.

Just from curiousity, how are you keying the filter?  tree object
name of the top-level and full path concatenated or something like
that?


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Derrick Stolee

On 10/8/2018 2:10 PM, SZEDER Gábor wrote:

On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:

Nice! These numbers make sense to me, in terms of how many TREESAME queries
we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)

As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.


This is expected. The performance ratio is better when the path is any 
of the following:


1. A very deep path (need to walk multiple trees to answer TREESAME)

2. An entry is late in a very wide tree (need to spend extra time 
parsing tree object)


3. The path doesn't change very often (need to inspect many TREESAME 
pairs before finding enough interesting commits)


4. Some sub-path changes often (so the TREESAME comparison needs to 
parse beyond that sub-path often)


Our standard examples (Git and Linux repos) don't have many paths that 
have these properties. But: they do exist. In other projects, this is 
actually typical. Think about Java projects that frequently have ~5 
levels of folders before actually touching a code file.


When I was implementing the Bloom filter feature for Azure Repos, I ran 
performance tests on the Linux repo using a random sampling of paths. 
The typical speedup was 5x while some outliers were in the 25x range.





But I'm afraid it will take a while until I get around to turn it into
something presentable...

Do you have the code pushed somewhere public where one could take a look? I
Do you have the code pushed somewhere public where one could take a
look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)

There are two questions that I was hoping to answer by looking at your code:

1. How do you store your Bloom filter? Is it connected to the 
commit-graph and split on a commit-by-commit basis (storing "$path" as a 
key), or is it one huge Bloom filter (storing "$commitid:$path" as key)?


2. Where does your Bloom filter check plug into the TREESAME logic? I 
haven't investigated this part, but hopefully it isn't too complicated.


Thanks,
-Stolee


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread SZEDER Gábor
On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
> On 10/8/2018 12:41 PM, SZEDER Gábor wrote:
> >On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> >>I'm still excited about the prospect of a bloom filter for paths which
> >>each commit touches. I think that's the next big frontier in getting
> >>things like "git log -- path" to a reasonable run-time.
> >There is certainly potential there.  With a (very) rough PoC
> >experiment, a 8MB bloom filter, and a carefully choosen path I can
> >achieve a nice, almost 25x speedup:
> >
> >   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
> >   6
> >
> >   real0m1.563s
> >   user0m1.519s
> >   sys 0m0.045s
> >
> >   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- 
> > t/valgrind/valgrind.sh
> >   6
> >
> >   real0m0.063s
> >   user0m0.043s
> >   sys 0m0.020s
> >
> >   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false 
> > positives: 64 fp ratio: 0.003934

> Nice! These numbers make sense to me, in terms of how many TREESAME queries
> we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)

As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.

> >But I'm afraid it will take a while until I get around to turn it into
> >something presentable...
> Do you have the code pushed somewhere public where one could take a look? I
> Do you have the code pushed somewhere public where one could take a 
> look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)



Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread Derrick Stolee

On 10/8/2018 12:41 PM, SZEDER Gábor wrote:

On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:

I'm still excited about the prospect of a bloom filter for paths which
each commit touches. I think that's the next big frontier in getting
things like "git log -- path" to a reasonable run-time.

There is certainly potential there.  With a (very) rough PoC
experiment, a 8MB bloom filter, and a carefully choosen path I can
achieve a nice, almost 25x speedup:

   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
   6

   real0m1.563s
   user0m1.519s
   sys 0m0.045s

   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- 
t/valgrind/valgrind.sh
   6

   real0m0.063s
   user0m0.043s
   sys 0m0.020s

   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false 
positives: 64 fp ratio: 0.003934
Nice! These numbers make sense to me, in terms of how many TREESAME 
queries we actually need to perform for such a query.

But I'm afraid it will take a while until I get around to turn it into
something presentable...
Do you have the code pushed somewhere public where one could take a 
look? I could provide some early feedback.


Thanks,
-Stolee


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-08 Thread SZEDER Gábor
On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> I'm still excited about the prospect of a bloom filter for paths which
> each commit touches. I think that's the next big frontier in getting
> things like "git log -- path" to a reasonable run-time.

There is certainly potential there.  With a (very) rough PoC
experiment, a 8MB bloom filter, and a carefully choosen path I can
achieve a nice, almost 25x speedup:

  $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
  6

  real0m1.563s
  user0m1.519s
  sys 0m0.045s

  $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- 
t/valgrind/valgrind.sh
  6

  real0m0.063s
  user0m0.043s
  sys 0m0.020s

  bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false 
positives: 64 fp ratio: 0.003934

But I'm afraid it will take a while until I get around to turn it into
something presentable...



Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Jeff King
On Fri, Oct 05, 2018 at 10:01:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> >
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> You don't really need a new global datastructure to solve this
> problem. It would be sufficient to have git-gc itself write out a 4-line
> text file after it runs saying how many tags, commits, trees and blobs
> it found on its last run.
>
> You can then fuzzily compare object counts v.s. commit counts for the
> purposes of deciding whether something like the commit-graph needs to be
> updated, while assuming that whatever new data you have has similar
> enough ratios of those as your existing data.

I think this is basically the same thing as Stolee's suggestion to keep
the total object count in the commit-graph file. The only difference is
here is that we know the actual ratio of commit to blobs for this
particular repository. But I don't think we need to know that. As you
said, this is fuzzy anyway, so a single number for "update the graph
when there are N new objects" is likely enough.

If you had a repository with an unusually large tree, you'd end up
rebuilding the graph more often. But I think it would probably be OK, as
we're primarily trying not to waste time doing a graph rebuild when
we've only done a small amount of other work. But if we just shoved a
ton of objects through index-pack then we did a lot of work, whether
those were commit objects or not.

-Peff


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Jeff King
On Fri, Oct 05, 2018 at 04:00:12PM -0400, Derrick Stolee wrote:

> On 10/5/2018 3:47 PM, Jeff King wrote:
> > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
> > 
> > > > So can we really just take (total_objects - commit_graph_objects) and
> > > > compare it to some threshold?
> > > The commit-graph only stores the number of _commits_, not total objects.
> > Oh, right, of course. That does throw a monkey wrench in that line of
> > thought. ;)
> > 
> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> > 
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> If we want, then we can add an optional chunk to the commit-graph file that
> stores the object count.

Yeah, that's probably a saner route, since we have to do the write then
anyway.

-Peff


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Ævar Arnfjörð Bjarmason


On Fri, Oct 05 2018, Jeff King wrote:

> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>> > So can we really just take (total_objects - commit_graph_objects) and
>> > compare it to some threshold?
>>
>> The commit-graph only stores the number of _commits_, not total objects.
>
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

You don't really need a new global datastructure to solve this
problem. It would be sufficient to have git-gc itself write out a 4-line
text file after it runs saying how many tags, commits, trees and blobs
it found on its last run.

You can then fuzzily compare object counts v.s. commit counts for the
purposes of deciding whether something like the commit-graph needs to be
updated, while assuming that whatever new data you have has similar
enough ratios of those as your existing data.

That's an assumption that'll hold well enough for big repos where this
matters the most, and who tend to grow in fairly uniform ways as far as
their object type ratios go.

Databases like MySQL, PostgreSQL etc. pull similar tricks with their
fuzzy table statistics.


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 3:47 PM, Jeff King wrote:

On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:


So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?

The commit-graph only stores the number of _commits_, not total objects.

Oh, right, of course. That does throw a monkey wrench in that line of
thought. ;)

There's unfortunately not a fast way of doing that. One option would be
to keep a counter of "ungraphed commit objects", and have callers update
it. Anybody admitting a pack via index-pack or unpack-objects can easily
get this information. Commands like fast-import can do likewise, and
"git commit" obviously increments it by one.

I'm not excited about adding a new global on-disk data structure (and
the accompanying lock).


If we want, then we can add an optional chunk to the commit-graph file 
that stores the object count.




Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Jeff King
On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:

> > So can we really just take (total_objects - commit_graph_objects) and
> > compare it to some threshold?
> 
> The commit-graph only stores the number of _commits_, not total objects.

Oh, right, of course. That does throw a monkey wrench in that line of
thought. ;)

There's unfortunately not a fast way of doing that. One option would be
to keep a counter of "ungraphed commit objects", and have callers update
it. Anybody admitting a pack via index-pack or unpack-objects can easily
get this information. Commands like fast-import can do likewise, and
"git commit" obviously increments it by one.

I'm not excited about adding a new global on-disk data structure (and
the accompanying lock).

-Peff


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 3:21 PM, Jeff King wrote:

On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:


My misunderstanding was that your proposed change to gc computes the
commit-graph in either of these two cases:

(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
pack, and it probably contains a lot of new commits") or we could create
some more complicated condition based on counting reachable commits with
infinite generation number (the number of commits not in the commit-graph
file).

I like that you are moving forward to make the commit-graph be written more
frequently, but I'm trying to push us in a direction of writing it even more
often than your proposed strategy. We should avoid creating too many
orthogonal conditions that trigger the commit-graph write, which is why I'm
pushing on your design here.

Anyone else have thoughts on this direction?

Yes, I think measuring "sufficiently behind" is the right thing.
Everything else is a proxy or heuristic, and will run into corner cases.
E.g., I have some small number of objects and then do a huge fetch, and
now my commit-graph only covers 5% of what's available.

We know how many objects are in the graph already. And it's not too
expensive to get the number of objects in the repository. We can do the
same sampling for loose objects that "gc --auto" does, and counting
packed objects just involves opening up the .idx files (that can be slow
if you have a ton of packs, but you'd want to either repack or use a
.midx in that case anyway, either of which would help here).

So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?


The commit-graph only stores the number of _commits_, not total objects.

Azure Repos' commit-graph does store the total number of objects, and 
that is how we trigger updating the graph, so it is not unreasonable to 
use that as a heuristic.


Thanks,

-Stolee



Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Jeff King
On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:

> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
> 
> (1) The auto-GC threshold is met.
> 
> (2) There is no commit-graph file.
> 
> And what I hope to have instead of (2) is (3):
> 
> (3) The commit-graph file is "sufficiently behind" the tip refs.
> 
> This condition is intentionally vague at the moment. It could be that we
> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
> pack, and it probably contains a lot of new commits") or we could create
> some more complicated condition based on counting reachable commits with
> infinite generation number (the number of commits not in the commit-graph
> file).
> 
> I like that you are moving forward to make the commit-graph be written more
> frequently, but I'm trying to push us in a direction of writing it even more
> often than your proposed strategy. We should avoid creating too many
> orthogonal conditions that trigger the commit-graph write, which is why I'm
> pushing on your design here.
> 
> Anyone else have thoughts on this direction?

Yes, I think measuring "sufficiently behind" is the right thing.
Everything else is a proxy or heuristic, and will run into corner cases.
E.g., I have some small number of objects and then do a huge fetch, and
now my commit-graph only covers 5% of what's available.

We know how many objects are in the graph already. And it's not too
expensive to get the number of objects in the repository. We can do the
same sampling for loose objects that "gc --auto" does, and counting
packed objects just involves opening up the .idx files (that can be slow
if you have a ton of packs, but you'd want to either repack or use a
.midx in that case anyway, either of which would help here).

So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?

-Peff


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Ævar Arnfjörð Bjarmason


On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 05 2018, Derrick Stolee wrote:
>>
>>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
 I don't have time to polish this up for submission now, but here's a WIP
 patch that implements this, highlights:

* There's a gc.clone.autoDetach=false default setting which overrides
  gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
  --cloning option to indicate this).
>>> I'll repeat that it could make sense to do the same thing on clone
>>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
>>> communicate that we just downloaded a pack from a remote.
>> I don't think that makes sense, but let's talk about why, because maybe
>> I've missed something, you're certainly more familiar with the
>> commit-graph than I am.
>>
>> The reason to do it on clone as a special-case or when the file is
>> missing, is because we know the file is desired (via the GC config), and
>> presumably is expected to help performance, and we have 0% of it. So by
>> going from 0% to 100% on clone we'll get fast --contains and other
>> goodies the graph helps with.
>>
>> But when we're doing a fetch, or really anything else that runs "git gc
>> --auto" we can safely assume that we have a recent enough graph, because
>> it will have been run whenever auto-gc kicked in.
>>
>> I.e.:
>>
>>  # Slow, if we assume background forked commit-graph generation
>>  # (which I'm avoiding)
>>  git clone x && cd x && git tag --contains
>>  # Fast enough, since we have an existing commit-graph
>>  cd x && git fetch && git tag --contains
>>
>> I *do* think it might make sense to in general split off parts of "gc
>> --auto" that we'd like to be more aggressive about, simply because the
>> ratio of how long it takes to do, and how much it helps with performance
>> makes more sense than a full repack, which is what the current heuristic
>> is based on.
>>
>> And maybe when we run in that mode we should run in the foreground, but
>> I don't see why git-fetch should be a special case there, and in this
>> regard, the gc.clone.autoDetach=false setting I've made doesn't make
>> much sence. I.e. maybe we should also skip forking to the background in
>> such a mode when we trigger such a "mini gc" via git-commit or whatever.
>
> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
>
> (1) The auto-GC threshold is met.
>
> (2) There is no commit-graph file.
>
> And what I hope to have instead of (2) is (3):
>
> (3) The commit-graph file is "sufficiently behind" the tip refs.
>
> This condition is intentionally vague at the moment. It could be that
> we hint that (3) holds by saying "--post-fetch" (i.e. "We just
> downloaded a pack, and it probably contains a lot of new commits") or
> we could create some more complicated condition based on counting
> reachable commits with infinite generation number (the number of
> commits not in the commit-graph file).
>
> I like that you are moving forward to make the commit-graph be written
> more frequently, but I'm trying to push us in a direction of writing
> it even more often than your proposed strategy. We should avoid
> creating too many orthogonal conditions that trigger the commit-graph
> write, which is why I'm pushing on your design here.
>
> Anyone else have thoughts on this direction?

Ah. I see. I think #3 makes perfect sense, but probably makes sense to
do as a follow-up, or maybe you'd like to stick a patch on top of the
series I have when I send it. I don't know how to write the "I'm not
quite happy about the commit graph" code :)

What I will do is refactor gc.c a bit and leave it in a state where it's
going to be really easy to change the existing "we have no commit graph,
and thus should do the optimization step" to have some more complex
condition instead of "we have no commit graph", i.e. your "we just
grabbed a lot of data".

Also, I'll drop the gc.clone.autoDetach=false setting and name it
something more general. maybe gc.AutoDetachOnBigOptimization=false?
Anyway something more generic so that "clone" will always pass in some
option saying "expect a large % commit graph update" (100% in its case),
and then in "fetch" we could have some detection of how big what we just
got from the server is, and do the same.

This seems to be to be the most general thing that would make sense, and
could also be extended e.g. to "git commit" and other users of gc
--auto. If I started with a README file in an empty repo, and then made
a commit where I added 1 million files all in one commit, in which case
we'd (depending on that setting) also block in the foreground and
generate the commit-graph.


Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:

On Fri, Oct 05 2018, Derrick Stolee wrote:


On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

   * There's a gc.clone.autoDetach=false default setting which overrides
 gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
 --cloning option to indicate this).

I'll repeat that it could make sense to do the same thing on clone
_and_ fetch. Perhaps a "--post-fetch" flag would be good here to
communicate that we just downloaded a pack from a remote.

I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

 # Slow, if we assume background forked commit-graph generation
 # (which I'm avoiding)
 git clone x && cd x && git tag --contains
 # Fast enough, since we have an existing commit-graph
 cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.


My misunderstanding was that your proposed change to gc computes the 
commit-graph in either of these two cases:


(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we 
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a 
pack, and it probably contains a lot of new commits") or we could create 
some more complicated condition based on counting reachable commits with 
infinite generation number (the number of commits not in the 
commit-graph file).


I like that you are moving forward to make the commit-graph be written 
more frequently, but I'm trying to push us in a direction of writing it 
even more often than your proposed strategy. We should avoid creating 
too many orthogonal conditions that trigger the commit-graph write, 
which is why I'm pushing on your design here.


Anyone else have thoughts on this direction?

Thanks,

-Stolee



Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Ævar Arnfjörð Bjarmason


On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>> I don't have time to polish this up for submission now, but here's a WIP
>> patch that implements this, highlights:
>>
>>   * There's a gc.clone.autoDetach=false default setting which overrides
>> gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>> --cloning option to indicate this).
>
> I'll repeat that it could make sense to do the same thing on clone
> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
> communicate that we just downloaded a pack from a remote.

I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

# Slow, if we assume background forked commit-graph generation
# (which I'm avoiding)
git clone x && cd x && git tag --contains
# Fast enough, since we have an existing commit-graph
cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.

>>   * A clone of say git.git with gc.writeCommitGraph=true looks like:
>>
>> [...]
>> Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
>> Resolving deltas: 100% (188947/188947), done.
>> Computing commit graph generation numbers: 100% (55210/55210), done.
>
> This looks like good UX. Thanks for the progress here!
>
>>   * The 'git gc --auto' command also knows to (only) run the commit-graph
>> (and space is left for future optimization steps) if general GC isn't
>> needed, but we need "optimization":
>>
>> $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c 
>> gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
>> Annotating commits in commit graph: 341229, done.
>> Computing commit graph generation numbers: 100% (165969/165969), done.
>> $
>
> Will this also trigger a full commit-graph rewrite on every 'git
> commit' command?

Nope, because "git commit" can safely be assumed to have some
commit-graph anyway, and I'm just special casing the case where it
doesn't exist.

But if it doesn't exist and you do a "git commit" then "gc --auto" will
be run, and we'll fork to the background and generate it...

>  Or is there some way we can compute the staleness of
> the commit-graph in order to only update if we get too far ahead?
> Previously, this was solved by relying on the auto-GC threshold.

So re the "I don't think that makes sense..." at the start of my E-Mail,
isn't it fine to rely on the default thresholds here, or should we be
more aggressive?

>>   * The patch to gc.c looks less scary with -w, most of it is indenting
>> the existing pack-refs etc. with a "!auto_gc || should_gc" condition.
>>
>>   * I added a commit_graph_exists() exists function and only care if I
>> get ENOENT for the purposes of this gc mode. This would need to be
>> tweaked for the incremental mode Derrick talks about, but if we just
>> set "should_optimize" that'll also work as far as gc --auto is
>> concerned (e.g. on fetch, am etc.)
>
> The incremental mode would operate the same as split-index, which
> means we will still look for .git/objects/info/commit-graph. That file
> may point us to more files.

Ah!

>> +int commit_graph_exists(const char *graph_file)
>> +{
>> +struct stat st;
>> +if (stat(graph_file, )) {
>> +if (errno == ENOENT)
>> +return 0;
>> +else
>> +return -1;
>> +}
>> +return 1;
>> +}
>> +
>
> This method serves a very similar purpose to
> generation_numbers_enabled(), except your method only cares about the
> file existing. It ignores information like `core.commitGraph`, which
> should keep us from doing anything with the commit-graph file if
> false.
>
> Nothing about your 

Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Derrick Stolee

On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

  * There's a gc.clone.autoDetach=false default setting which overrides
gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
--cloning option to indicate this).


I'll repeat that it could make sense to do the same thing on clone _and_ 
fetch. Perhaps a "--post-fetch" flag would be good here to communicate 
that we just downloaded a pack from a remote.



  * A clone of say git.git with gc.writeCommitGraph=true looks like:

[...]
Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
Resolving deltas: 100% (188947/188947), done.
Computing commit graph generation numbers: 100% (55210/55210), done.


This looks like good UX. Thanks for the progress here!


  * The 'git gc --auto' command also knows to (only) run the commit-graph
(and space is left for future optimization steps) if general GC isn't
needed, but we need "optimization":

$ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c 
gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
Annotating commits in commit graph: 341229, done.
Computing commit graph generation numbers: 100% (165969/165969), done.
$


Will this also trigger a full commit-graph rewrite on every 'git commit' 
command? Or is there some way we can compute the staleness of the 
commit-graph in order to only update if we get too far ahead? 
Previously, this was solved by relying on the auto-GC threshold.



  * The patch to gc.c looks less scary with -w, most of it is indenting
the existing pack-refs etc. with a "!auto_gc || should_gc" condition.

  * I added a commit_graph_exists() exists function and only care if I
get ENOENT for the purposes of this gc mode. This would need to be
tweaked for the incremental mode Derrick talks about, but if we just
set "should_optimize" that'll also work as far as gc --auto is
concerned (e.g. on fetch, am etc.)


The incremental mode would operate the same as split-index, which means 
we will still look for .git/objects/info/commit-graph. That file may 
point us to more files.



+int commit_graph_exists(const char *graph_file)
+{
+   struct stat st;
+   if (stat(graph_file, )) {
+   if (errno == ENOENT)
+   return 0;
+   else
+   return -1;
+   }
+   return 1;
+}
+


This method serves a very similar purpose to 
generation_numbers_enabled(), except your method only cares about the 
file existing. It ignores information like `core.commitGraph`, which 
should keep us from doing anything with the commit-graph file if false.


Nothing about your method is specific to the commit-graph file, since 
you provide a filename as a parameter. It could easily be "int 
file_exists(const char *filename)".


Thanks,

-Stolee




Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-05 Thread Junio C Hamano
SZEDER Gábor  writes:

>> git-gc - Cleanup unnecessary files and optimize the local repository
>> 
>> Creating these indexes like the commit-graph falls under "optimize the
>> local repository",
>
> But it doesn't fall under "cleanup unnecessary files", which the
> commit-graph file is, since, strictly speaking, it's purely
> optimization.

I won't be actively engaged in this discussion soon, but I must say
that "git gc" doing "garbage collection" is merely an implementation
detail of optimizing the repository for further use.  And from that
point of view, what needs to be updated is the synopsis of the
git-gc doc.  It states "X and Y" above, but it actually is "Y by
doing X and other things".

I understand your "by definition there is no garbage immediately
after clone" position, and also I would understand if you find it
(perhaps philosophically) disturbing that "git clone" may give users
a suboptimal repository that immediately needs optimizing [*1*].

But that bridge was crossed long time ago ever since pack transfer
was invented.  The data source sends only the pack data stream, and
the receiving end is responsible for spending cycles to build .idx
file.  Theoretically, .pack should be all that is needed---you
should be able to locate any necessary object by parsing the .pack
file every time you open it, and .idx is mere optimization.  You can
think of the .midx and graph files the same way.

I'd consider it a growing pain that these two recent inventions were
and are still built as a totally optional and separate features,
requiring completely separate full enumeration of objects in the
repository that needs to happen anyway when we build .idx out of the
received .pack.

I would not be surprised by a future in which the initial index-pack
that is responsible for receiving the incoming pack stream and
storing that in .pack file(s) while creating corresponding .idx
file(s) becomes also responsible for building .midx and graph files
in the same pass, or at least smaller number of passes.  Once we
gain experience and confidence with these new auxiliary files, that
ought to happen naturally.  And at that point, we won't be having
this discussion---we'd all happily run index-pack to receive the
pack data, because that is pretty much the fundamental requirement
to make use of the data.

[Footnote]

*1* Even without considering these recent invention of auxiliary
files, cloning from a sloppily packed server whose primary focus
is to avoid spending cycles by not computing better deltas will
give the cloner a suboptimal repository.  If we truly want to
have an optimized repository ready to be used after cloning, we
should run an equivalent of "repack -a -d -f" immediately after
"git clone".



[RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-04 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, Ævar Arnfjörð Bjarmason wrote:

> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
>
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
>
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.
>
> So we should make "git gc --auto" be run on clone, and change the
> need_to_gc() / cmd_gc() behavior so that we detect that the
> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> then just generate that without doing a full repack.
>
> As an aside such more granular "gc" would be nice for e.g. pack-refs
> too. It's possible for us to just have one pack, but to have 100k loose
> refs.
>
> It might also be good to have some gc.autoDetachOnClone option and have
> it false by default, so we don't have a race condition where "clone
> linux && git -C linux tag --contains" is slow because the graph hasn't
> been generated yet, and generating the graph initially doesn't take that
> long compared to the time to clone a large repo (and on a small one it
> won't matter either way).
>
> I was going to say "also for midx", but of course after clone we have
> just one pack, so I can't imagine us needing this. But I can see us
> having other such optional side-indexes in the future generated by gc,
> and they'd also benefit from this.

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

 * There's a gc.clone.autoDetach=false default setting which overrides
   gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
   --cloning option to indicate this).

 * A clone of say git.git with gc.writeCommitGraph=true looks like:

   [...]
   Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
   Resolving deltas: 100% (188947/188947), done.
   Computing commit graph generation numbers: 100% (55210/55210), done.

 * The 'git gc --auto' command also knows to (only) run the commit-graph
   (and space is left for future optimization steps) if general GC isn't
   needed, but we need "optimization":

   $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c 
gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
   Annotating commits in commit graph: 341229, done.
   Computing commit graph generation numbers: 100% (165969/165969), done.
   $

 * The patch to gc.c looks less scary with -w, most of it is indenting
   the existing pack-refs etc. with a "!auto_gc || should_gc" condition.

 * I added a commit_graph_exists() exists function and only care if I
   get ENOENT for the purposes of this gc mode. This would need to be
   tweaked for the incremental mode Derrick talks about, but if we just
   set "should_optimize" that'll also work as far as gc --auto is
   concerned (e.g. on fetch, am etc.)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1546833213..5759fbb067 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1621,7 +1621,19 @@ gc.autoPackLimit::

 gc.autoDetach::
Make `git gc --auto` return immediately and run in background
-   if the system supports it. Default is true.
+   if the system supports it. Default is true. Overridden by
+   `gc.clone.autoDetach` when running linkgit:git-clone[1].
+
+gc.clone.autoDetach::
+   Make `git gc --auto` return immediately and run in background
+   if the system supports it when run via
+   linkgit:git-clone[1]. Default is false.
++
+The reason this defaults to false is because the only time we'll have
+work to do after a 'git clone' is if something like
+`gc.writeCommitGraph` is true, in that case we'd like to compute the
+optimized file before returning, so that say commands that benefit
+from commit graph aren't slow until it's generated in the background.

 gc.bigPackThreshold::
If non-zero, all packs larger than this limit are kept when
diff --git a/builtin/clone.c b/builtin/clone.c
index 15b142d646..824c130ba5 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -897,6 +897,8 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)
struct remote *remote;
int err = 0, complete_refs_before_fetch = 1;
int submodule_progress;
+   const char *argv_gc_auto[]   = {"gc", "--auto", "--cloning", NULL};
+   const char *argv_gc_auto_quiet[] = {"gc", "--auto", "--cloning", 
"--quiet", NULL};

struct refspec rs = REFSPEC_INIT_FETCH;
struct argv_array ref_prefixes = ARGV_ARRAY_INIT;
@@ -1245,5 +1247,11 @@ int cmd_clone(int argc, const char **argv, const char 
*prefix)

refspec_clear();
argv_array_clear(_prefixes);
+
+   if (0 <= option_verbosity)
+   

Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, Jeff King wrote:

> On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote:
>
>> I share these concerns in a slightly more abstract way, as
>> I would bucket the actions into two separate bins:
>>
>> One bin that throws away information.
>> this would include removing expired reflog entries (which
>> I do not think are garbage, or collection thereof), but their
>> usefulness is questionable.
>>
>> The other bin would be actions that optimize but
>> do not throw away any information, repacking (without
>> dropping files) would be part of it, or the new
>> "write additional files".
>>
>> Maybe we can move all actions of the second bin into a new
>> "git optimize" command, and git gc would do first the "throw away
>> things" and then the optimize action, whereas clone would only
>> go for the second optimizing part?
>
> One problem with that world-view is that some of the operations do
> _both_, for efficiency. E.g., repacking will drop unreachable objects in
> too-old packs. We could actually be more aggressive in combining things
> here. For instance, a full object graph walk in linux.git takes 30-60
> seconds, depending on your CPU. But we do it at least twice during a gc:
> once to repack, and then again to determine reachability for pruning.
>
> If you generate bitmaps during the repack step, you can use them during
> the prune step. But by itself, the cost of generating the bitmaps
> generally outweighs the extra walk. So it's not worth generating them
> _just_ for this (but is an obvious optimization for a server which would
> be generating them anyway).

I don't mean to fan the flames of this obviously controversial "git gc
does optimization" topic (which I didn't suspect there would be a debate
about...), but a related thing I was wondering about the other day is
whether we could have a gc.fsck option, and in the background do fsck
while we were at it, and report this back via some facility like
gc.log[1].

That would also fall into this category of more work we could do while
we're doing a full walk anyway, but as with what you're suggesting would
require some refactoring.

1. Well, one that doesn't suck, see
   https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/ /
   https://public-inbox.org/git/87d0vmck55@evledraar.gmail.com/ etc.


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Jeff King
On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote:

> I share these concerns in a slightly more abstract way, as
> I would bucket the actions into two separate bins:
> 
> One bin that throws away information.
> this would include removing expired reflog entries (which
> I do not think are garbage, or collection thereof), but their
> usefulness is questionable.
> 
> The other bin would be actions that optimize but
> do not throw away any information, repacking (without
> dropping files) would be part of it, or the new
> "write additional files".
> 
> Maybe we can move all actions of the second bin into a new
> "git optimize" command, and git gc would do first the "throw away
> things" and then the optimize action, whereas clone would only
> go for the second optimizing part?

One problem with that world-view is that some of the operations do
_both_, for efficiency. E.g., repacking will drop unreachable objects in
too-old packs. We could actually be more aggressive in combining things
here. For instance, a full object graph walk in linux.git takes 30-60
seconds, depending on your CPU. But we do it at least twice during a gc:
once to repack, and then again to determine reachability for pruning.

If you generate bitmaps during the repack step, you can use them during
the prune step. But by itself, the cost of generating the bitmaps
generally outweighs the extra walk. So it's not worth generating them
_just_ for this (but is an obvious optimization for a server which would
be generating them anyway).

-Peff


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Jeff King
On Wed, Oct 03, 2018 at 02:59:34PM -0400, Derrick Stolee wrote:

> > They don't help yet, and there's no good reason to enable bitmaps for
> > clients. I have a few patches that use bitmaps for things like
> > ahead/behind and --contains checks, but the utility of those may be
> > lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
> > I'm mildly in favor of replacing the existing .bitmap format with
> > something better integrated with commit-graphs (which would give us an
> > opportunity to clean up some of the rough edges).
> 
> If the commit-graph doesn't improve enough on those applications, then we
> could consider adding a commit-to-commit reachability bitmap inside the
> commit-graph. ;)

That unfortunately wouldn't be enough for us to ditch the existing
.bitmap files, since we need full object reachability for some cases
(including packing). And commit-to-commit reachability is a trivial
subset of that. I'm not sure if it would be better to just leave
.bitmaps in place as a server-side thing, and grow a new thing for
commit-to-commit reachability (since it would presumably be easier).

I'm still excited about the prospect of a bloom filter for paths which
each commit touches. I think that's the next big frontier in getting
things like "git log -- path" to a reasonable run-time.

-Peff


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Stefan Beller
>
> But you thought right, I do have an objection against that.  'git gc'
> should, well, collect garbage.  Any non-gc stuff is already violating
> separation of concerns.

I share these concerns in a slightly more abstract way, as
I would bucket the actions into two separate bins:

One bin that throws away information.
this would include removing expired reflog entries (which
I do not think are garbage, or collection thereof), but their
usefulness is questionable.

The other bin would be actions that optimize but
do not throw away any information, repacking (without
dropping files) would be part of it, or the new
"write additional files".

Maybe we can move all actions of the second bin into a new
"git optimize" command, and git gc would do first the "throw away
things" and then the optimize action, whereas clone would only
go for the second optimizing part?


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Derrick Stolee

On 10/3/2018 2:51 PM, Jeff King wrote:

On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote:


On Wed, Oct 03 2018, Stefan Beller wrote:


So we wouldn't be spending 5 minutes repacking linux.git right after
cloning it, just ~10s generating the commit graph, and the same would
happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
which would kick of "gc --auto" in the background and do the same thing.

Or generating local bitmaps or pack idx files as well?

I'm less familiar with this area, but when I clone I get a pack *.idx
file, why does it need to be regenerated?

But yeah, in principle this would be a sensible addition, but I'm not
aware of cases where clients get significant benefits from bitmaps (see
https://githubengineering.com/counting-objects/), and I don't turn it on
for clients, but maybe I've missed something.

They don't help yet, and there's no good reason to enable bitmaps for
clients. I have a few patches that use bitmaps for things like
ahead/behind and --contains checks, but the utility of those may be
lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
I'm mildly in favor of replacing the existing .bitmap format with
something better integrated with commit-graphs (which would give us an
opportunity to clean up some of the rough edges).


If the commit-graph doesn't improve enough on those applications, then 
we could consider adding a commit-to-commit reachability bitmap inside 
the commit-graph. ;)


-Stolee


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Jeff King
On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Wed, Oct 03 2018, Stefan Beller wrote:
> 
> >> So we wouldn't be spending 5 minutes repacking linux.git right after
> >> cloning it, just ~10s generating the commit graph, and the same would
> >> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> >> which would kick of "gc --auto" in the background and do the same thing.
> >
> > Or generating local bitmaps or pack idx files as well?
> 
> I'm less familiar with this area, but when I clone I get a pack *.idx
> file, why does it need to be regenerated?
> 
> But yeah, in principle this would be a sensible addition, but I'm not
> aware of cases where clients get significant benefits from bitmaps (see
> https://githubengineering.com/counting-objects/), and I don't turn it on
> for clients, but maybe I've missed something.

They don't help yet, and there's no good reason to enable bitmaps for
clients. I have a few patches that use bitmaps for things like
ahead/behind and --contains checks, but the utility of those may be
lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
I'm mildly in favor of replacing the existing .bitmap format with
something better integrated with commit-graphs (which would give us an
opportunity to clean up some of the rough edges).

-Peff


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, Stefan Beller wrote:

>> So we wouldn't be spending 5 minutes repacking linux.git right after
>> cloning it, just ~10s generating the commit graph, and the same would
>> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
>> which would kick of "gc --auto" in the background and do the same thing.
>
> Or generating local bitmaps or pack idx files as well?

I'm less familiar with this area, but when I clone I get a pack *.idx
file, why does it need to be regenerated?

But yeah, in principle this would be a sensible addition, but I'm not
aware of cases where clients get significant benefits from bitmaps (see
https://githubengineering.com/counting-objects/), and I don't turn it on
for clients, but maybe I've missed something.


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Stefan Beller
> So we wouldn't be spending 5 minutes repacking linux.git right after
> cloning it, just ~10s generating the commit graph, and the same would
> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> which would kick of "gc --auto" in the background and do the same thing.

Or generating local bitmaps or pack idx files as well?


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread SZEDER Gábor
On Wed, Oct 03, 2018 at 05:19:41PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> >> >> So we should make "git gc --auto" be run on clone,
> >> >> >
> >> >> > There is no garbage after 'git clone'...
> >> >>
> >> >> "git gc" is really "git gc-or-create-indexes" these days.
> >> >
> >> > Because it happens to be convenient to create those indexes at
> >> > gc-time.  But that should not be an excuse to run gc when by
> >> > definition no gc is needed.
> >>
> >> Ah, I thought you just had an objection to the "gc" name being used for
> >> non-gc stuff,
> >
> > But you thought right, I do have an objection against that.  'git gc'
> > should, well, collect garbage.  Any non-gc stuff is already violating
> > separation of concerns.
> 
> Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to
> perform common maintenance operations.", 2006-12-27) it has been
> described as:
> 
> git-gc - Cleanup unnecessary files and optimize the local repository
> 
> Creating these indexes like the commit-graph falls under "optimize the
> local repository",

But it doesn't fall under "cleanup unnecessary files", which the
commit-graph file is, since, strictly speaking, it's purely
optimization.

That description came about, because cleaning up unnecessary files,
notably combining lots of loose refs into a single packed-refs file
and combining lots of loose objects and pack files into a single pack
file, could not only make the repository smaller (barring too many
exploding unreachable objects), but, as it turned out, could also make
Git operations in that repository faster.

To me, the main goal of the command is cleanup.  Optimization, however
beneficial, is its side effect, and I assume the "optimize" part was
added to the description mainly to inform and "encourage" users.
After all, the command is called 'git gc', not 'git optimize-repo'.

> and 3rd party tools (e.g. the repo tool doing this
> came up on list recently) have been calling "gc --auto" with this
> assumption.
> 
> >>  but if you mean we shouldn't do a giant repack right after
> >> clone I agree.
> >
> > And, I also mean that since 'git clone' knows that there can't
> > possibly be any garbage in the first place, then it shouldn't call 'gc
> > --auto' at all.  However, since it also knows that there is a lot of
> > new stuff, then it should create a commit-graph if enabled.
> 
> Is this something you think just because the tool isn't called
> git-gc-and-optimzie, or do you think this regardless of what it's
> called?

Well, that still has 'gc' in it...

> I don't see how splitting up the entry points for "detect if we need to
> cleanup or optimize the repo" leaves us with a better codebase for the
> reasons noted in
> https://public-inbox.org/git/87pnwrgll2@evledraar.gmail.com/

Such a separation would be valuable for those having gc.auto = 0 in
their config.  Or, in general, to have a clearly marked entry point to
update all the enabled "purely-optimization" files without 'gc'
exploding a bunch of "just-became-unreachable" objects from deleted
reflog entries and packfiles, or without performing a comparatively
expensive repacking.  Note the "clearly marked"; I don't think
teaching 'gc [--auto]' various tricks to only create/update these
files without doing what it is fundamentally supposed to do qualifies
for that.




Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Duy Nguyen
On Wed, Oct 3, 2018 at 3:23 PM Ævar Arnfjörð Bjarmason  wrote:
>
> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
>
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
>
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.

Since you have core.hooksPath, you can already force gc (even in
detached mode) in post-checkout hook. I'm adding a new
"post-init-repo" hook to allow more customization right after clone or
init-db which may also be useful for other things than gc.
-- 
Duy


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>>
>> > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >>
>> >> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>> >>
>> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> >> Don't have time to patch this now, but thought I'd send a note / RFC
>> >> >> about this.
>> >> >>
>> >> >> Now that we have the commit graph it's nice to be able to set
>> >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig 
>> >> >> or
>> >> >> /etc/gitconfig to apply them to all repos.
>> >> >>
>> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be 
>> >> >> slow
>> >> >> until whenever my first "gc" kicks in, which may be quite some time if
>> >> >> I'm just using it passively.
>> >> >>
>> >> >> So we should make "git gc --auto" be run on clone,
>> >> >
>> >> > There is no garbage after 'git clone'...
>> >>
>> >> "git gc" is really "git gc-or-create-indexes" these days.
>> >
>> > Because it happens to be convenient to create those indexes at
>> > gc-time.  But that should not be an excuse to run gc when by
>> > definition no gc is needed.
>>
>> Ah, I thought you just had an objection to the "gc" name being used for
>> non-gc stuff,
>
> But you thought right, I do have an objection against that.  'git gc'
> should, well, collect garbage.  Any non-gc stuff is already violating
> separation of concerns.

Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to
perform common maintenance operations.", 2006-12-27) it has been
described as:

git-gc - Cleanup unnecessary files and optimize the local repository

Creating these indexes like the commit-graph falls under "optimize the
local repository", and 3rd party tools (e.g. the repo tool doing this
came up on list recently) have been calling "gc --auto" with this
assumption.

>>  but if you mean we shouldn't do a giant repack right after
>> clone I agree.
>
> And, I also mean that since 'git clone' knows that there can't
> possibly be any garbage in the first place, then it shouldn't call 'gc
> --auto' at all.  However, since it also knows that there is a lot of
> new stuff, then it should create a commit-graph if enabled.

Is this something you think just because the tool isn't called
git-gc-and-optimzie, or do you think this regardless of what it's
called?

I don't see how splitting up the entry points for "detect if we need to
cleanup or optimize the repo" leaves us with a better codebase for the
reasons noted in
https://public-inbox.org/git/87pnwrgll2@evledraar.gmail.com/


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread SZEDER Gábor
On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> 
> > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> >>
> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> >> Don't have time to patch this now, but thought I'd send a note / RFC
> >> >> about this.
> >> >>
> >> >> Now that we have the commit graph it's nice to be able to set
> >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> >> >> /etc/gitconfig to apply them to all repos.
> >> >>
> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> >> >> until whenever my first "gc" kicks in, which may be quite some time if
> >> >> I'm just using it passively.
> >> >>
> >> >> So we should make "git gc --auto" be run on clone,
> >> >
> >> > There is no garbage after 'git clone'...
> >>
> >> "git gc" is really "git gc-or-create-indexes" these days.
> >
> > Because it happens to be convenient to create those indexes at
> > gc-time.  But that should not be an excuse to run gc when by
> > definition no gc is needed.
> 
> Ah, I thought you just had an objection to the "gc" name being used for
> non-gc stuff,

But you thought right, I do have an objection against that.  'git gc'
should, well, collect garbage.  Any non-gc stuff is already violating
separation of concerns.

>  but if you mean we shouldn't do a giant repack right after
> clone I agree.

And, I also mean that since 'git clone' knows that there can't
possibly be any garbage in the first place, then it shouldn't call 'gc
--auto' at all.  However, since it also knows that there is a lot of
new stuff, then it should create a commit-graph if enabled.

> I meant that "gc --auto" would learn to do a subset of
> its work, instead of the current "I have work to do, let's do all of
> pack-refs/repack/commit-graph etc.".
> 
> So we wouldn't be spending 5 minutes repacking linux.git right after
> cloning it, just ~10s generating the commit graph, and the same would
> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> which would kick of "gc --auto" in the background and do the same thing.


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Duy Nguyen
On Wed, Oct 3, 2018 at 4:01 PM Ævar Arnfjörð Bjarmason  wrote:
> >> and change the
> >> need_to_gc() / cmd_gc() behavior so that we detect that the
> >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> >> then just generate that without doing a full repack.
> >
> > Or just teach 'git clone' to run 'git commit-graph write ...'
>
> Then when adding something like the commit graph we'd need to patch both
> git-clone and git-gc, it's much more straightforward to make
> need_to_gc() more granular.

It is straightforward and misleading. If we organize the code well,
patching both would not take much more effort and it reduces wtf?
moments reading the code.
-- 
Duy


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>>
>> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> Don't have time to patch this now, but thought I'd send a note / RFC
>> >> about this.
>> >>
>> >> Now that we have the commit graph it's nice to be able to set
>> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> >> /etc/gitconfig to apply them to all repos.
>> >>
>> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> >> until whenever my first "gc" kicks in, which may be quite some time if
>> >> I'm just using it passively.
>> >>
>> >> So we should make "git gc --auto" be run on clone,
>> >
>> > There is no garbage after 'git clone'...
>>
>> "git gc" is really "git gc-or-create-indexes" these days.
>
> Because it happens to be convenient to create those indexes at
> gc-time.  But that should not be an excuse to run gc when by
> definition no gc is needed.

Ah, I thought you just had an objection to the "gc" name being used for
non-gc stuff, but if you mean we shouldn't do a giant repack right after
clone I agree. I meant that "gc --auto" would learn to do a subset of
its work, instead of the current "I have work to do, let's do all of
pack-refs/repack/commit-graph etc.".

So we wouldn't be spending 5 minutes repacking linux.git right after
cloning it, just ~10s generating the commit graph, and the same would
happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
which would kick of "gc --auto" in the background and do the same thing.


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, Derrick Stolee wrote:

> On 10/3/2018 9:36 AM, SZEDER Gábor wrote:
>> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>> Don't have time to patch this now, but thought I'd send a note / RFC
>>> about this.
>>>
>>> Now that we have the commit graph it's nice to be able to set
>>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>>> /etc/gitconfig to apply them to all repos.
>>>
>>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>>> until whenever my first "gc" kicks in, which may be quite some time if
>>> I'm just using it passively.
>>>
>>> So we should make "git gc --auto" be run on clone,
>> There is no garbage after 'git clone'...
>
> And since there is no garbage, the gc will not write the commit-graph.

I should probably have replied to this instead of SZEDER's in
https://public-inbox.org/git/87r2h7gmd7@evledraar.gmail.com/ anyway
my 0.02 on that there.

>>
>>> and change the
>>> need_to_gc() / cmd_gc() behavior so that we detect that the
>>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
>>> then just generate that without doing a full repack.
>> Or just teach 'git clone' to run 'git commit-graph write ...'
>
> I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting
> until the file is incremental (on my to-do list soon), so the write is
> fast when only adding a few commits at a time. This would cover the
> clone case, too.

It's re-arranging deck chairs on the Titanic at this point, but this
approach seems like the wrong way to go in this whole "do we have crap
to do?" git-gc state-machine.

In my mind we should have only one entry point into that, and there
shouldn't be magic like "here's the gc-ish stuff we do on
fetch". Because if we care about a bunch of new commits being added on
"fetch", that can also happen on "commit", "am", "merge", all of which
run "gc --auto" now.

Which is why I'm suggesting that we could add a sub-mode in need_to_gc()
that detects if a file we want to generate is entirely missing, which is
extendable to future formats, and the only caveat at that point is if
we'd like that subset of "gc" to block and run in the foreground in the
"clone" (or "fetch", ...) case.

And then if we have a desire to incrementally add recently added commits
to such formats, "gc --auto" could learn to consume reflogs or some
other general inventory of "stuff added since last gc", and then we
wouldn't have to instrument "fetch" specifically, the same would work
for "commit", "am", "merge" etc.


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread SZEDER Gábor
On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> 
> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> Don't have time to patch this now, but thought I'd send a note / RFC
> >> about this.
> >>
> >> Now that we have the commit graph it's nice to be able to set
> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> >> /etc/gitconfig to apply them to all repos.
> >>
> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> >> until whenever my first "gc" kicks in, which may be quite some time if
> >> I'm just using it passively.
> >>
> >> So we should make "git gc --auto" be run on clone,
> >
> > There is no garbage after 'git clone'...
> 
> "git gc" is really "git gc-or-create-indexes" these days.

Because it happens to be convenient to create those indexes at
gc-time.  But that should not be an excuse to run gc when by
definition no gc is needed.

> >> and change the
> >> need_to_gc() / cmd_gc() behavior so that we detect that the
> >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> >> then just generate that without doing a full repack.
> >
> > Or just teach 'git clone' to run 'git commit-graph write ...'
> 
> Then when adding something like the commit graph we'd need to patch both
> git-clone and git-gc, it's much more straightforward to make
> need_to_gc() more granular.
> 
> >> As an aside such more granular "gc" would be nice for e.g. pack-refs
> >> too. It's possible for us to just have one pack, but to have 100k loose
> >> refs.
> >>
> >> It might also be good to have some gc.autoDetachOnClone option and have
> >> it false by default, so we don't have a race condition where "clone
> >> linux && git -C linux tag --contains" is slow because the graph hasn't
> >> been generated yet, and generating the graph initially doesn't take that
> >> long compared to the time to clone a large repo (and on a small one it
> >> won't matter either way).
> >>
> >> I was going to say "also for midx", but of course after clone we have
> >> just one pack, so I can't imagine us needing this. But I can see us
> >> having other such optional side-indexes in the future generated by gc,
> >> and they'd also benefit from this.
> >>
> >> #leftoverbits


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Don't have time to patch this now, but thought I'd send a note / RFC
>> about this.
>>
>> Now that we have the commit graph it's nice to be able to set
>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> /etc/gitconfig to apply them to all repos.
>>
>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> until whenever my first "gc" kicks in, which may be quite some time if
>> I'm just using it passively.
>>
>> So we should make "git gc --auto" be run on clone,
>
> There is no garbage after 'git clone'...

"git gc" is really "git gc-or-create-indexes" these days.

>> and change the
>> need_to_gc() / cmd_gc() behavior so that we detect that the
>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
>> then just generate that without doing a full repack.
>
> Or just teach 'git clone' to run 'git commit-graph write ...'

Then when adding something like the commit graph we'd need to patch both
git-clone and git-gc, it's much more straightforward to make
need_to_gc() more granular.

>> As an aside such more granular "gc" would be nice for e.g. pack-refs
>> too. It's possible for us to just have one pack, but to have 100k loose
>> refs.
>>
>> It might also be good to have some gc.autoDetachOnClone option and have
>> it false by default, so we don't have a race condition where "clone
>> linux && git -C linux tag --contains" is slow because the graph hasn't
>> been generated yet, and generating the graph initially doesn't take that
>> long compared to the time to clone a large repo (and on a small one it
>> won't matter either way).
>>
>> I was going to say "also for midx", but of course after clone we have
>> just one pack, so I can't imagine us needing this. But I can see us
>> having other such optional side-indexes in the future generated by gc,
>> and they'd also benefit from this.
>>
>> #leftoverbits


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Derrick Stolee

On 10/3/2018 9:36 AM, SZEDER Gábor wrote:

On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

Don't have time to patch this now, but thought I'd send a note / RFC
about this.

Now that we have the commit graph it's nice to be able to set
e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
/etc/gitconfig to apply them to all repos.

But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
until whenever my first "gc" kicks in, which may be quite some time if
I'm just using it passively.

So we should make "git gc --auto" be run on clone,

There is no garbage after 'git clone'...


And since there is no garbage, the gc will not write the commit-graph.




and change the
need_to_gc() / cmd_gc() behavior so that we detect that the
gc.writeCommitGraph=true setting is on, but we have no commit graph, and
then just generate that without doing a full repack.

Or just teach 'git clone' to run 'git commit-graph write ...'


I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting 
until the file is incremental (on my to-do list soon), so the write is 
fast when only adding a few commits at a time. This would cover the 
clone case, too.


Thanks,
-Stolee


Re: We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread SZEDER Gábor
On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
> 
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
> 
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.
> 
> So we should make "git gc --auto" be run on clone,

There is no garbage after 'git clone'...

> and change the
> need_to_gc() / cmd_gc() behavior so that we detect that the
> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> then just generate that without doing a full repack.

Or just teach 'git clone' to run 'git commit-graph write ...'

> As an aside such more granular "gc" would be nice for e.g. pack-refs
> too. It's possible for us to just have one pack, but to have 100k loose
> refs.
> 
> It might also be good to have some gc.autoDetachOnClone option and have
> it false by default, so we don't have a race condition where "clone
> linux && git -C linux tag --contains" is slow because the graph hasn't
> been generated yet, and generating the graph initially doesn't take that
> long compared to the time to clone a large repo (and on a small one it
> won't matter either way).
> 
> I was going to say "also for midx", but of course after clone we have
> just one pack, so I can't imagine us needing this. But I can see us
> having other such optional side-indexes in the future generated by gc,
> and they'd also benefit from this.
> 
> #leftoverbits


We should add a "git gc --auto" after "git clone" due to commit graph

2018-10-03 Thread Ævar Arnfjörð Bjarmason
Don't have time to patch this now, but thought I'd send a note / RFC
about this.

Now that we have the commit graph it's nice to be able to set
e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
/etc/gitconfig to apply them to all repos.

But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
until whenever my first "gc" kicks in, which may be quite some time if
I'm just using it passively.

So we should make "git gc --auto" be run on clone, and change the
need_to_gc() / cmd_gc() behavior so that we detect that the
gc.writeCommitGraph=true setting is on, but we have no commit graph, and
then just generate that without doing a full repack.

As an aside such more granular "gc" would be nice for e.g. pack-refs
too. It's possible for us to just have one pack, but to have 100k loose
refs.

It might also be good to have some gc.autoDetachOnClone option and have
it false by default, so we don't have a race condition where "clone
linux && git -C linux tag --contains" is slow because the graph hasn't
been generated yet, and generating the graph initially doesn't take that
long compared to the time to clone a large repo (and on a small one it
won't matter either way).

I was going to say "also for midx", but of course after clone we have
just one pack, so I can't imagine us needing this. But I can see us
having other such optional side-indexes in the future generated by gc,
and they'd also benefit from this.

#leftoverbits