Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 10 2018, SZEDER Gábor wrote: > On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote: >> SZEDER Gábor writes: >> >> >> git-gc - Cleanup unnecessary files and optimize the local repository >> >> >> >> Creating these indexes like the commit-graph falls under "optimize the >> >> local repository", >> > >> > But it doesn't fall under "cleanup unnecessary files", which the >> > commit-graph file is, since, strictly speaking, it's purely >> > optimization. >> >> I won't be actively engaged in this discussion soon, but I must say >> that "git gc" doing "garbage collection" is merely an implementation >> detail of optimizing the repository for further use. And from that >> point of view, what needs to be updated is the synopsis of the >> git-gc doc. It states "X and Y" above, but it actually is "Y by >> doing X and other things". > > Well, then perhaps the name of the command should be updated, too, to > better reflect what it actually does... I don't disagree, but between "git gc" being a longstanding thing called not just by us , but third parties expecting it to a be a general swiss army knife for "make repo better" (so for a new tool they'd need to update their code), and general name bikeshedding I think it's best if we just proceed for the sake of argument with the assumption that none of us find the name confusing in this context. At least that's the discussion I'm interested in. I.e. whether it makes conceptual / structural sense for this stuff to live in the same place / function in the code. We can always argue about the name of the function as a separate topic. >> I understand your "by definition there is no garbage immediately >> after clone" position, and also I would understand if you find it >> (perhaps philosophically) disturbing that "git clone" may give users >> a suboptimal repository that immediately needs optimizing [*1*]. >> >> But that bridge was crossed long time ago ever since pack transfer >> was invented. The data source sends only the pack data stream, and >> the receiving end is responsible for spending cycles to build .idx >> file. Theoretically, .pack should be all that is needed---you >> should be able to locate any necessary object by parsing the .pack >> file every time you open it, and .idx is mere optimization. You can >> think of the .midx and graph files the same way. > > I don't think this is a valid comparison, because, practically, Git > just didn't work after I deleted all pack index files. So while they > can be easily (re)generated, they are essential to make pack files > usable. The commit-graph and .midx files, however, can be safely > deleted, and everything keeps working as before. For "things that would run in 20ms now run in 30 seconds" (actual numbers on a repo I have) values of "keeps working". So I think this line gets blurred somewhat. In practice if you're expecting the graph to be there to run the sort of commands that most benefit from it it's essential that it be generated, not some nice optional extra. > OTOH, this is an excellent comparison, and I do think of the .midx and > graph files the same way as the pack index files. During a clone, the > pack index file isn't generated by running a separate 'git gc > (--auto)', but by clone (or fetch-pack?) running 'git index-pack'. > The way I see it that should be the case for these other files as well. > > And it is much simpler, shorter, and cleaner to either run 'git > commit-graph ...' or even to call write_commit_graph_reachable() > directly from cmd_clone(), than to bolt on another option and config > variable on 'git gc' [1] to coax it into some kind of an "after clone" > mode, that it shouldn't be doing in the first place. At least for > now, so when we'll eventually get as far ... > >> I would not be surprised by a future in which the initial index-pack >> that is responsible for receiving the incoming pack stream and >> storing that in .pack file(s) while creating corresponding .idx >> file(s) becomes also responsible for building .midx and graph files >> in the same pass, or at least smaller number of passes. Once we >> gain experience and confidence with these new auxiliary files, that >> ought to happen naturally. And at that point, we won't be having >> this discussion---we'd all happily run index-pack to receive the >> pack data, because that is pretty much the fundamental requirement >> to make use of the data. > > ... that what you wrote here becomes a reality (and I fully agree that > this is what we should ultimately aim for), then we won't have that > option and config variable still lying around and requiring > maintenance because of backwards compatibility. We'll still have the use-case of wanting to turn on gc.writeCommitGraph=true or equivalent and wanting previously-cloned repositories to "catch up" and get a commit graph ASAP (but not do a full repack). This is why my patch tries to unify those two codepaths, i.e. so I can turn this on
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote: > SZEDER Gábor writes: > > >> git-gc - Cleanup unnecessary files and optimize the local repository > >> > >> Creating these indexes like the commit-graph falls under "optimize the > >> local repository", > > > > But it doesn't fall under "cleanup unnecessary files", which the > > commit-graph file is, since, strictly speaking, it's purely > > optimization. > > I won't be actively engaged in this discussion soon, but I must say > that "git gc" doing "garbage collection" is merely an implementation > detail of optimizing the repository for further use. And from that > point of view, what needs to be updated is the synopsis of the > git-gc doc. It states "X and Y" above, but it actually is "Y by > doing X and other things". Well, then perhaps the name of the command should be updated, too, to better reflect what it actually does... > I understand your "by definition there is no garbage immediately > after clone" position, and also I would understand if you find it > (perhaps philosophically) disturbing that "git clone" may give users > a suboptimal repository that immediately needs optimizing [*1*]. > > But that bridge was crossed long time ago ever since pack transfer > was invented. The data source sends only the pack data stream, and > the receiving end is responsible for spending cycles to build .idx > file. Theoretically, .pack should be all that is needed---you > should be able to locate any necessary object by parsing the .pack > file every time you open it, and .idx is mere optimization. You can > think of the .midx and graph files the same way. I don't think this is a valid comparison, because, practically, Git just didn't work after I deleted all pack index files. So while they can be easily (re)generated, they are essential to make pack files usable. The commit-graph and .midx files, however, can be safely deleted, and everything keeps working as before. OTOH, this is an excellent comparison, and I do think of the .midx and graph files the same way as the pack index files. During a clone, the pack index file isn't generated by running a separate 'git gc (--auto)', but by clone (or fetch-pack?) running 'git index-pack'. The way I see it that should be the case for these other files as well. And it is much simpler, shorter, and cleaner to either run 'git commit-graph ...' or even to call write_commit_graph_reachable() directly from cmd_clone(), than to bolt on another option and config variable on 'git gc' [1] to coax it into some kind of an "after clone" mode, that it shouldn't be doing in the first place. At least for now, so when we'll eventually get as far ... > I would not be surprised by a future in which the initial index-pack > that is responsible for receiving the incoming pack stream and > storing that in .pack file(s) while creating corresponding .idx > file(s) becomes also responsible for building .midx and graph files > in the same pass, or at least smaller number of passes. Once we > gain experience and confidence with these new auxiliary files, that > ought to happen naturally. And at that point, we won't be having > this discussion---we'd all happily run index-pack to receive the > pack data, because that is pretty much the fundamental requirement > to make use of the data. ... that what you wrote here becomes a reality (and I fully agree that this is what we should ultimately aim for), then we won't have that option and config variable still lying around and requiring maintenance because of backwards compatibility. 1 - https://public-inbox.org/git/87in2hgzin@evledraar.gmail.com/ > [Footnote] > > *1* Even without considering these recent invention of auxiliary > files, cloning from a sloppily packed server whose primary focus > is to avoid spending cycles by not computing better deltas will > give the cloner a suboptimal repository. If we truly want to > have an optimized repository ready to be used after cloning, we > should run an equivalent of "repack -a -d -f" immediately after > "git clone". I noticed a few times that I got surprisingly large packs from GitHub, e.g. there is over 70% size difference between --single-branch cloning v2.19.0 from GitHub and from my local clone or from kernel.org (~95MB vs. ~55MB vs ~52MB). After running 'git repack -a -d -f' they all end up at ~65MB, which is a nice size reduction for the clone from GitHub, but the others just gained 10-13 more MBs.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Mon, Oct 08, 2018 at 11:08:03PM -0400, Jeff King wrote: > I'd have done it as one fixed-size filter per commit. Then you should be > able to hash the path keys once, and apply the result as a bitwise query > to each individual commit (I'm assuming that it's constant-time to > access the filter for each, as an index into an mmap'd array, with the > offset coming from a commit-graph entry we'd be able to look up anyway). I used one big Bloom filter for the whole history, because that was the simplest way to get going, and because I was primarily interested in the potential benefits instead of the cost of generating and maintaining it. Using a 8MB filter for git.git results in a false positive rate between 0.21% - 0.53%. Splitting that up for ~53k commits we get ~160 bytes for each. On first sight that seems like rather small, but running a bit statistics shows that 99% of our commits don't change more than 10 files. One advantage of the "one Bloom filter for each commit" is that if a commit doesn't have a corresponding Bloom filter, then, well, we can't query the non-existing filter. OTOH, with one big Bloom filter we have to be careful to only ever query it with commits whose changes have already been added, otherwise we can get false negatives. > I think it would also be easier to deal with maintenance, since each > filter is independent (IIRC, you cannot delete from a bloom filter > without re-adding all of the other keys). Accumulating entries related to unreachable commits will eventually increase the false positive rate, but otherwise it won't cause false negatives, and won't increase the size of the Bloom filter or the time necessary to query it. So not deleting those entries right away is not an issue, and I think it could be postponed until bigger gc runs. [...] > But there's also a related question: how do we match pathspec patterns? > For a changed path like "foo/bar/baz", I imagine a bloom filter would > mark all of "foo", "foo/bar", and "foo/bar/baz". Indeed, that's what I did. > But what about "*.c"? I > don't think a bloom filter can answer that. Surely not, but it could easily return "maybe", and thus simply fall back to look at the diff. However, I've looked through the output of grep '^git log[^|]*[\[*?]' ~/.bash_history and haven't found a single case where I used Git's globbing. When I did use globbing, then I always used the shell's. Yeah, just one data point, and others surely use it differently, etc... but I think we should consider whether it's common enough to worry about and to increase complexity because of it. [...] > So let's imagine we'd store such a cache external to the regular object > data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git > has 1.7M entries. The paths should easily compress to a single 32-bit > integer (e.g., as an index into a big path list). The oids are 20 bytes. > Add a few bytes for modes. That's about 80MB. Big, but not impossibly > so. Maybe pushing it for true gigantic repos, though. In my experiments with the Linux repo a 256MB Bloom filter has ~0.3% false positive rate, while a 128MB filter had 3-4%. Even bigger, though compared to the size of the checkout of the full kernel tree, arguably not that much. > Those numbers are ignoring merges, too. The meaning of "did this commit > touch that path" is a lot trickier for a merge commit, and I think may > depend on context. I'm not sure how even a bloom filter solution would > handle that (I was assuming we'd mostly punt and let merges fall back to > opening up the trees). During revision walking rev_compare_tree() checks whether the given paths changed between a commit and _one_ of its parents, and in case of merge commits it's invoked multiple times with the same commit but with different parent parameters. By storing (changed-path, parent-oid, commit-oid) tuples in the Bloom filter it can deal with merges, too.
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote: > > I wonder if Roaring does better here. > > In these sparse cases, usually Roaring will organize the data as "array > chunks" which are simply lists of the values. The thing that makes this > still compressible is that we store two bytes per entry, as the entries are > grouped by a common most-significant two bytes. SInce you say ~120k unique > paths, the Roaring bitmap would have two or three chunks per bitmap (and > those chunks could be empty). The overhead to store the chunk positions, > types, and lengths does come at a cost, but it's more like 32 bytes _per > commit_. Hmph. It really sounds like we could do better with a custom RLE solution. But that makes me feel like I'm missing something, because surely I can't invent something better than the state of the art in a simple thought experiment, right? I know what I'm proposing would be quite bad for random access, but my impression is that EWAH is the same. For the scale of bitmaps we're talking about, I think linear/streaming access through the bitmap would be OK. > > So at any rate, I do think it would not be out of the question to store > > bitmaps like this. I'm much more worried about the maintenance cost of > > adding new entries incrementally. I think it's only feasible if we give > > up sorting, and then I wonder what other problems that might cause. > The patch below gives me a starting point to try the Bloom filter approach > and see what the numbers are like. You did all the "git" stuff like > computing the changed paths, so thanks! Great, I hope it can be useful. I almost wrote it as perl consuming the output of "log --format=%h --name-only", but realized I didn't have a perl ewah implementation handy. You'll probably want to tweak this part: > > + prepare_revision_walk(); > > + while ((commit = get_revision())) { > > + data.commit = commit; > > + diff_tree_combined_merge(commit, 0, ); > > + } ...to handle merges in a particular way. This will actually ignore merges totally. You could add "-m" to the revision arguments to get a per-parent diff, but of course you'd see those in your callback individually. If you want to do _just_ the first parent diff, I think you'll have to pick it apart manually, like: while ((commit = get_revision())) { struct object_id *parent_oid; /* ignore non-first parents, but handle root commits like --root */ if (commit->parents) parent = >parents->item->object.oid; else parent = the_hash_algo->empty_tree; diff_tree_oid(parent, >oid, ...); } -Peff
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
On 10/9/2018 2:46 PM, Jeff King wrote: On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote: [I snipped all of the parts about bloom filters that seemed entirely reasonable to me ;) ] Imagine we have that list. Is a bloom filter still the best data structure for each commit? At the point that we have the complete universe of paths, we could give each commit a bitmap of changed paths. That lets us ask "did this commit touch these paths" (collect the bits from the list of paths, then check for 1's), as well as "what paths did we touch" (collect the 1 bits, and then index the path list). Those bitmaps should compress very well via EWAH or similar (most of them would be huge stretches of 0's punctuated by short runs of 1's). I'm not convinced we would frequently have runs of 1's, and the bitmap would not compress much better than simply listing the positions. For example, a path "foo/bar" that resolves to a tree would only start a run if the next changes are the initial section of entries in that tree (sorted lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree, then we will break the run of 1's unless we changed every path deeper than that tree. Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or 2). I agree that listing the positions could work, though I sort of assumed that was more or less what a decent compressed bitmap would turn into. E.g., if bit N is set, we should be able to say "N-1 zeroes, 1 one" in about the same size as we could say "position N". EWAH seems pretty awful in that regard. Or at least its serialized format is (or maybe it's our implementation that is bad). The patch below generates a bitmap for each commit in a repository (it doesn't output the total list of paths; I've left that as an exercise for the reader). On linux.git, the result is 57MB. But when I look at the individual bitmap sizes (via GIT_TRACE), they're quite silly. Storing a single set bit takes 28 bytes in serialized form! There are only around 120k unique paths (including prefix trees). Naively using run-length encoding and varints, our worst case should be something like 18-20 bits to say "120k zeroes, then a 1, then all zeroes". And the average case should be better (you don't even need to say "120k", but some smaller number). I wonder if Roaring does better here. In these sparse cases, usually Roaring will organize the data as "array chunks" which are simply lists of the values. The thing that makes this still compressible is that we store two bytes per entry, as the entries are grouped by a common most-significant two bytes. SInce you say ~120k unique paths, the Roaring bitmap would have two or three chunks per bitmap (and those chunks could be empty). The overhead to store the chunk positions, types, and lengths does come at a cost, but it's more like 32 bytes _per commit_. Gzipping the resulting bitmaps drops the total size to about 7.5MB. That's not a particularly important number, but I think it shows that the built-in ewah compression is far from ideal. Just listing the positions with a series of varints would generally be fine, since we expect sparse bitmaps. I just hoped that a good RLE scheme would degrade to roughly that for the sparse case, but also perform well for more dense cases. So at any rate, I do think it would not be out of the question to store bitmaps like this. I'm much more worried about the maintenance cost of adding new entries incrementally. I think it's only feasible if we give up sorting, and then I wonder what other problems that might cause. The patch below gives me a starting point to try the Bloom filter approach and see what the numbers are like. You did all the "git" stuff like computing the changed paths, so thanks! -- >8 -- diff --git a/Makefile b/Makefile index 13e1c52478..f6e823f2d6 100644 --- a/Makefile +++ b/Makefile @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options TEST_PROGRAMS_NEED_X += test-pkt-line TEST_PROGRAMS_NEED_X += test-svn-fe TEST_PROGRAMS_NEED_X += test-tool +TEST_PROGRAMS_NEED_X += test-tree-bitmap TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X)) diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c new file mode 100644 index 00..bc5cf0e514 --- /dev/null +++ b/t/helper/test-tree-bitmap.c @@ -0,0 +1,167 @@ +#include "cache.h" +#include "revision.h" +#include "diffcore.h" +#include "argv-array.h" +#include "ewah/ewok.h" + +/* map of pathnames to bit positions */ +struct pathmap_entry { + struct hashmap_entry ent; + unsigned pos; + char path[FLEX_ARRAY]; +}; + +static int pathmap_entry_hashcmp(const void *unused_cmp_data, +const void *entry, +const void *entry_or_key, +const void *keydata) +{ + const struct pathmap_entry *a = entry; + const struct pathmap_entry *b = entry_or_key; +
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote: > [I snipped all of the parts about bloom filters that seemed entirely > reasonable to me ;) ] > > Imagine we have that list. Is a bloom filter still the best data > > structure for each commit? At the point that we have the complete > > universe of paths, we could give each commit a bitmap of changed paths. > > That lets us ask "did this commit touch these paths" (collect the bits > > from the list of paths, then check for 1's), as well as "what paths did > > we touch" (collect the 1 bits, and then index the path list). Those > > bitmaps should compress very well via EWAH or similar (most of them > > would be huge stretches of 0's punctuated by short runs of 1's). > > I'm not convinced we would frequently have runs of 1's, and the bitmap would > not compress much better than simply listing the positions. For example, a > path "foo/bar" that resolves to a tree would only start a run if the next > changes are the initial section of entries in that tree (sorted > lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree, > then we will break the run of 1's unless we changed every path deeper than > that tree. Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or 2). I agree that listing the positions could work, though I sort of assumed that was more or less what a decent compressed bitmap would turn into. E.g., if bit N is set, we should be able to say "N-1 zeroes, 1 one" in about the same size as we could say "position N". EWAH seems pretty awful in that regard. Or at least its serialized format is (or maybe it's our implementation that is bad). The patch below generates a bitmap for each commit in a repository (it doesn't output the total list of paths; I've left that as an exercise for the reader). On linux.git, the result is 57MB. But when I look at the individual bitmap sizes (via GIT_TRACE), they're quite silly. Storing a single set bit takes 28 bytes in serialized form! There are only around 120k unique paths (including prefix trees). Naively using run-length encoding and varints, our worst case should be something like 18-20 bits to say "120k zeroes, then a 1, then all zeroes". And the average case should be better (you don't even need to say "120k", but some smaller number). I wonder if Roaring does better here. Gzipping the resulting bitmaps drops the total size to about 7.5MB. That's not a particularly important number, but I think it shows that the built-in ewah compression is far from ideal. Just listing the positions with a series of varints would generally be fine, since we expect sparse bitmaps. I just hoped that a good RLE scheme would degrade to roughly that for the sparse case, but also perform well for more dense cases. So at any rate, I do think it would not be out of the question to store bitmaps like this. I'm much more worried about the maintenance cost of adding new entries incrementally. I think it's only feasible if we give up sorting, and then I wonder what other problems that might cause. -Peff -- >8 -- diff --git a/Makefile b/Makefile index 13e1c52478..f6e823f2d6 100644 --- a/Makefile +++ b/Makefile @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options TEST_PROGRAMS_NEED_X += test-pkt-line TEST_PROGRAMS_NEED_X += test-svn-fe TEST_PROGRAMS_NEED_X += test-tool +TEST_PROGRAMS_NEED_X += test-tree-bitmap TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X)) diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c new file mode 100644 index 00..bc5cf0e514 --- /dev/null +++ b/t/helper/test-tree-bitmap.c @@ -0,0 +1,167 @@ +#include "cache.h" +#include "revision.h" +#include "diffcore.h" +#include "argv-array.h" +#include "ewah/ewok.h" + +/* map of pathnames to bit positions */ +struct pathmap_entry { + struct hashmap_entry ent; + unsigned pos; + char path[FLEX_ARRAY]; +}; + +static int pathmap_entry_hashcmp(const void *unused_cmp_data, +const void *entry, +const void *entry_or_key, +const void *keydata) +{ + const struct pathmap_entry *a = entry; + const struct pathmap_entry *b = entry_or_key; + const char *key = keydata; + + return strcmp(a->path, key ? key : b->path); +} + +static int pathmap_entry_strcmp(const void *va, const void *vb) +{ + struct pathmap_entry *a = *(struct pathmap_entry **)va; + struct pathmap_entry *b = *(struct pathmap_entry **)vb; + return strcmp(a->path, b->path); +} + +struct walk_paths_data { + struct hashmap *paths; + struct commit *commit; +}; + +static void walk_paths(diff_format_fn_t fn, struct hashmap *paths) +{ + struct argv_array argv = ARGV_ARRAY_INIT; + struct rev_info revs; + struct walk_paths_data data; + struct commit *commit; + + argv_array_pushl(, "rev-list", +
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
On Tue, Oct 09 2018, Derrick Stolee wrote: > The filter needs to store every path that would be considered "not > TREESAME". It can't store wildcards, so you would need to evaluate the > wildcard and test all of those paths individually (not a good idea). If full paths are stored, yes, But have you considered instead of storing paths, storing all trigrams that can be extracted from changed paths at that commit? I.e. instead of a change to "t/t-basic.sh" storing "t/t-basic.sh" we'd store ["t/t", "/t0", "t00", "000", "00-" ...] etc. That sort of approach would mean that e.g. "t*000*", "*asi*.sh" etc. could all be indexed, and as long as we could find three consecutive bytes of fixed string we'd have a chance to short-circuit, but would need to degrade to a full tree unpack for e.g. "t*". We could also special-case certain sub-three-char indexes, or to "bi-grams". E.g. to be able to index '*.c' or 't*' (first char at the beginning of a string only). It would mean having to check more things in the bloom filter for each commit, but that's going to be hot in cache at that point so it'll probably beat unpacking trees by far, and we could short-circuit exit at the first one that returned false.
Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
(Changing title to reflect the new topic.) On 10/8/2018 11:08 PM, Jeff King wrote: On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote: There are two questions that I was hoping to answer by looking at your code: 1. How do you store your Bloom filter? Is it connected to the commit-graph and split on a commit-by-commit basis (storing "$path" as a key), or is it one huge Bloom filter (storing "$commitid:$path" as key)? I guess you've probably thought all of this through for your implementation, but let me pontificate. I'd have done it as one fixed-size filter per commit. Then you should be able to hash the path keys once, and apply the result as a bitwise query to each individual commit (I'm assuming that it's constant-time to access the filter for each, as an index into an mmap'd array, with the offset coming from a commit-graph entry we'd be able to look up anyway). You're right that we want to hash the path a constant number of times. Add in that we want to re-use information already serialized when updating the file, and we see that having a commit-by-commit Bloom filter is a good idea. Using (commit, path) pairs requires lots of re-hashing, repeated work when extending the filter, and poor locality when evaluating membership of a single key. The nice thing is that you can generate k 32-bit hash values based on two 32-bit hash values that are "independent enough" (see [1]). We used Murmur3 with two different seed values to generate hashes a & b, then used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our k hash values. These can be computed up front and then dropped into any size filter using a simple modulo operation. This allows flexible sizes in our filters. I don't think fixed size filters are a good idea, and instead would opt for flex-sized filters with a maximum size. The typical parameters use 7 hash functions and a filter size of (at least) 10 bits per entry. For most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a maximum of 512 bytes covers 99% of commits. We will want these bounds to be configurable via config. If we had a fixed size, then we either make it too small (and don't have sufficient coverage of commits) or too large (and waste a lot of space on the commits that change very little). We can store these flex-sized filters in the commit-graph using two columns of data (two new optional chunks): * Bloom filter data: stores the binary data of each commit's Bloom filter, concatenated together in the same order as the commits appear in the commit-graph. * Bloom filter positions: The ith position of this column stores the start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A Bloom filter of size 0 is intended to mean "we didn't store this filter because it would be too large". We can compute the lengths of the filter by inspecting adjacent values. In order to be flexible, we will want to encode some basic information into the Bloom filter data chunk, such as a tuple of (hash version, num hash bits, num bits per entry). This allows us to change the parameters in config but still be able to read a serialized filter. Here I assume that all filters share the same parameters. The "hash version" here is different than the_hash_algo, because we don't care about cryptographic security, only a uniform distrubution (hence, Murmur3 is a good, fast option). [1] https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf I think it would also be easier to deal with maintenance, since each filter is independent (IIRC, you cannot delete from a bloom filter without re-adding all of the other keys). The obvious downside is that it's O(commits) storage instead of O(1). It would always be O(changes), as the Bloom filter needs to grow in size as the number of entries grows. Now let me ponder a bit further afield. A bloom filter lets us answer the question "did this commit (probably) touch these paths?". But it does not let us answer "which paths did this commit touch?". That second one is less useful than you might think, because we almost always care about not just the names of the paths, but their actual object ids. Think about a --raw diff, or even traversing for reachability (where if we knew the tree-diff cheaply, we could avoid asking "have we seen this yet?" on most of the tree entries). The names alone can make that a bit faster, but in the worst case you still have to walk the whole tree to find their entries. But there's also a related question: how do we match pathspec patterns? For a changed path like "foo/bar/baz", I imagine a bloom filter would mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I don't think a bloom filter can answer that. The filter needs to store every path that would be considered "not TREESAME". It can't store wildcards, so you would need to evaluate the wildcard and test all of those paths
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote: > > > > But I'm afraid it will take a while until I get around to turn it into > > > > something presentable... > > > Do you have the code pushed somewhere public where one could take a look? > > > I > > > Do you have the code pushed somewhere public where one could take a > > > look? I could provide some early feedback. > > Nah, definitely not... I know full well how embarassingly broken this > > implementation is, I don't need others to tell me that ;) > There are two questions that I was hoping to answer by looking at your code: > > 1. How do you store your Bloom filter? Is it connected to the commit-graph > and split on a commit-by-commit basis (storing "$path" as a key), or is it > one huge Bloom filter (storing "$commitid:$path" as key)? I guess you've probably thought all of this through for your implementation, but let me pontificate. I'd have done it as one fixed-size filter per commit. Then you should be able to hash the path keys once, and apply the result as a bitwise query to each individual commit (I'm assuming that it's constant-time to access the filter for each, as an index into an mmap'd array, with the offset coming from a commit-graph entry we'd be able to look up anyway). I think it would also be easier to deal with maintenance, since each filter is independent (IIRC, you cannot delete from a bloom filter without re-adding all of the other keys). The obvious downside is that it's O(commits) storage instead of O(1). Now let me ponder a bit further afield. A bloom filter lets us answer the question "did this commit (probably) touch these paths?". But it does not let us answer "which paths did this commit touch?". That second one is less useful than you might think, because we almost always care about not just the names of the paths, but their actual object ids. Think about a --raw diff, or even traversing for reachability (where if we knew the tree-diff cheaply, we could avoid asking "have we seen this yet?" on most of the tree entries). The names alone can make that a bit faster, but in the worst case you still have to walk the whole tree to find their entries. But there's also a related question: how do we match pathspec patterns? For a changed path like "foo/bar/baz", I imagine a bloom filter would mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I don't think a bloom filter can answer that. At least not by itself. If we imagine that the commit-graph also had an alphabetized list of every path in every tree, then it's easy: apply the glob to that list once to get a set of concrete paths, and then query the bloom filters for those. And that list actually isn't too big. The complete set of paths in linux.git is only about 300k gzipped (I think that's the most relevant measure, since it's an obvious win to avoid repeating shared prefixes of long paths). Imagine we have that list. Is a bloom filter still the best data structure for each commit? At the point that we have the complete universe of paths, we could give each commit a bitmap of changed paths. That lets us ask "did this commit touch these paths" (collect the bits from the list of paths, then check for 1's), as well as "what paths did we touch" (collect the 1 bits, and then index the path list). Those bitmaps should compress very well via EWAH or similar (most of them would be huge stretches of 0's punctuated by short runs of 1's). So that seems promising to me (or at least not an obvious dead-end). I do think maintenance gets to be a headache, though. Adding new paths potentially means reordering the bitmaps, which means O(commits) work to "incrementally" update the structure. (Unless you always add the new paths at the end, but then you lose fast lookups in the list; that might be an acceptable tradeoff). And finally, there's one more radical option: could we actually store a real per-commit tree-diff cache? I.e., imagine that each commit had the equivalent of a --raw diff easily accessible, including object ids. That would allow: - fast pathspec matches, including globs - fast --raw output (and faster -p output, since we can skip the tree entirely) - fast reachability traversals (we only need to bother to look at the objects for changed entries) where "fast" is basically O(size of commit's changes), rather than O(size of whole tree). This was one of the big ideas of packv4 that never materialized. You can _almost_ do it with packv2, since after all, we end up storing many trees as deltas. But those deltas are byte-wise so it's hard for a reader to convert them directly into a pure-tree diff (they also don't mention the "deleted" data, so it's really only half a diff). So let's imagine we'd store such a cache external to the regular object data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git has 1.7M entries. The paths should easily compress to a single 32-bit integer (e.g., as an index
Re: We should add a "git gc --auto" after "git clone" due to commit graph
SZEDER Gábor writes: > There is certainly potential there. With a (very) rough PoC > experiment, a 8MB bloom filter, and a carefully choosen path I can > achieve a nice, almost 25x speedup: > > $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh > 6 > > real0m1.563s > user0m1.519s > sys 0m0.045s > > $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- > t/valgrind/valgrind.sh > 6 > > real0m0.063s > user0m0.043s > sys 0m0.020s Even though I somehow sense a sign of exaggeration in [v] in the pathname, it still is quite respectable. > bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false > positives: 64 fp ratio: 0.003934 > > But I'm afraid it will take a while until I get around to turn it into > something presentable... That's OK. This is an encouraging result. Just from curiousity, how are you keying the filter? tree object name of the top-level and full path concatenated or something like that?
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/8/2018 2:10 PM, SZEDER Gábor wrote: On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote: Nice! These numbers make sense to me, in terms of how many TREESAME queries we actually need to perform for such a query. Yeah... because you didn't notice that I deliberately cheated :) As it turned out, it's not just about the number of diff queries that we can spare, but, for the speedup _ratio_, it's more about how expensive those diff queries are. git.git has a rather flat hierarchy, and 't/' is the 372th entry in the current root tree object, while 'valgrind/' is the 923th entry, and the diff machinery spends considerable time wading through the previous entries. Notice the "carefully chosen path" remark in my previous email; I think this particular path has the highest number of preceeding tree entries, and, in addition, 't/' changes rather frequently, so the diff machinery often has to scan two relatively big tree objects. Had I chosen 'Documentation/RelNotes/1.5.0.1.txt' instead, i.e. another path two directories deep, but whose leading path components are both near the beginning of the tree objects, the speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only" ~5.7x instead of ~24.8x. This is expected. The performance ratio is better when the path is any of the following: 1. A very deep path (need to walk multiple trees to answer TREESAME) 2. An entry is late in a very wide tree (need to spend extra time parsing tree object) 3. The path doesn't change very often (need to inspect many TREESAME pairs before finding enough interesting commits) 4. Some sub-path changes often (so the TREESAME comparison needs to parse beyond that sub-path often) Our standard examples (Git and Linux repos) don't have many paths that have these properties. But: they do exist. In other projects, this is actually typical. Think about Java projects that frequently have ~5 levels of folders before actually touching a code file. When I was implementing the Bloom filter feature for Azure Repos, I ran performance tests on the Linux repo using a random sampling of paths. The typical speedup was 5x while some outliers were in the 25x range. But I'm afraid it will take a while until I get around to turn it into something presentable... Do you have the code pushed somewhere public where one could take a look? I Do you have the code pushed somewhere public where one could take a look? I could provide some early feedback. Nah, definitely not... I know full well how embarassingly broken this implementation is, I don't need others to tell me that ;) There are two questions that I was hoping to answer by looking at your code: 1. How do you store your Bloom filter? Is it connected to the commit-graph and split on a commit-by-commit basis (storing "$path" as a key), or is it one huge Bloom filter (storing "$commitid:$path" as key)? 2. Where does your Bloom filter check plug into the TREESAME logic? I haven't investigated this part, but hopefully it isn't too complicated. Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote: > On 10/8/2018 12:41 PM, SZEDER Gábor wrote: > >On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote: > >>I'm still excited about the prospect of a bloom filter for paths which > >>each commit touches. I think that's the next big frontier in getting > >>things like "git log -- path" to a reasonable run-time. > >There is certainly potential there. With a (very) rough PoC > >experiment, a 8MB bloom filter, and a carefully choosen path I can > >achieve a nice, almost 25x speedup: > > > > $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh > > 6 > > > > real0m1.563s > > user0m1.519s > > sys 0m0.045s > > > > $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- > > t/valgrind/valgrind.sh > > 6 > > > > real0m0.063s > > user0m0.043s > > sys 0m0.020s > > > > bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false > > positives: 64 fp ratio: 0.003934 > Nice! These numbers make sense to me, in terms of how many TREESAME queries > we actually need to perform for such a query. Yeah... because you didn't notice that I deliberately cheated :) As it turned out, it's not just about the number of diff queries that we can spare, but, for the speedup _ratio_, it's more about how expensive those diff queries are. git.git has a rather flat hierarchy, and 't/' is the 372th entry in the current root tree object, while 'valgrind/' is the 923th entry, and the diff machinery spends considerable time wading through the previous entries. Notice the "carefully chosen path" remark in my previous email; I think this particular path has the highest number of preceeding tree entries, and, in addition, 't/' changes rather frequently, so the diff machinery often has to scan two relatively big tree objects. Had I chosen 'Documentation/RelNotes/1.5.0.1.txt' instead, i.e. another path two directories deep, but whose leading path components are both near the beginning of the tree objects, the speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only" ~5.7x instead of ~24.8x. > >But I'm afraid it will take a while until I get around to turn it into > >something presentable... > Do you have the code pushed somewhere public where one could take a look? I > Do you have the code pushed somewhere public where one could take a > look? I could provide some early feedback. Nah, definitely not... I know full well how embarassingly broken this implementation is, I don't need others to tell me that ;)
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/8/2018 12:41 PM, SZEDER Gábor wrote: On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote: I'm still excited about the prospect of a bloom filter for paths which each commit touches. I think that's the next big frontier in getting things like "git log -- path" to a reasonable run-time. There is certainly potential there. With a (very) rough PoC experiment, a 8MB bloom filter, and a carefully choosen path I can achieve a nice, almost 25x speedup: $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m1.563s user0m1.519s sys 0m0.045s $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m0.063s user0m0.043s sys 0m0.020s bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934 Nice! These numbers make sense to me, in terms of how many TREESAME queries we actually need to perform for such a query. But I'm afraid it will take a while until I get around to turn it into something presentable... Do you have the code pushed somewhere public where one could take a look? I could provide some early feedback. Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote: > I'm still excited about the prospect of a bloom filter for paths which > each commit touches. I think that's the next big frontier in getting > things like "git log -- path" to a reasonable run-time. There is certainly potential there. With a (very) rough PoC experiment, a 8MB bloom filter, and a carefully choosen path I can achieve a nice, almost 25x speedup: $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m1.563s user0m1.519s sys 0m0.045s $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m0.063s user0m0.043s sys 0m0.020s bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934 But I'm afraid it will take a while until I get around to turn it into something presentable...
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05, 2018 at 10:01:31PM +0200, Ævar Arnfjörð Bjarmason wrote: > > There's unfortunately not a fast way of doing that. One option would be > > to keep a counter of "ungraphed commit objects", and have callers update > > it. Anybody admitting a pack via index-pack or unpack-objects can easily > > get this information. Commands like fast-import can do likewise, and > > "git commit" obviously increments it by one. > > > > I'm not excited about adding a new global on-disk data structure (and > > the accompanying lock). > > You don't really need a new global datastructure to solve this > problem. It would be sufficient to have git-gc itself write out a 4-line > text file after it runs saying how many tags, commits, trees and blobs > it found on its last run. > > You can then fuzzily compare object counts v.s. commit counts for the > purposes of deciding whether something like the commit-graph needs to be > updated, while assuming that whatever new data you have has similar > enough ratios of those as your existing data. I think this is basically the same thing as Stolee's suggestion to keep the total object count in the commit-graph file. The only difference is here is that we know the actual ratio of commit to blobs for this particular repository. But I don't think we need to know that. As you said, this is fuzzy anyway, so a single number for "update the graph when there are N new objects" is likely enough. If you had a repository with an unusually large tree, you'd end up rebuilding the graph more often. But I think it would probably be OK, as we're primarily trying not to waste time doing a graph rebuild when we've only done a small amount of other work. But if we just shoved a ton of objects through index-pack then we did a lot of work, whether those were commit objects or not. -Peff
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05, 2018 at 04:00:12PM -0400, Derrick Stolee wrote: > On 10/5/2018 3:47 PM, Jeff King wrote: > > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > > > > > > So can we really just take (total_objects - commit_graph_objects) and > > > > compare it to some threshold? > > > The commit-graph only stores the number of _commits_, not total objects. > > Oh, right, of course. That does throw a monkey wrench in that line of > > thought. ;) > > > > There's unfortunately not a fast way of doing that. One option would be > > to keep a counter of "ungraphed commit objects", and have callers update > > it. Anybody admitting a pack via index-pack or unpack-objects can easily > > get this information. Commands like fast-import can do likewise, and > > "git commit" obviously increments it by one. > > > > I'm not excited about adding a new global on-disk data structure (and > > the accompanying lock). > > If we want, then we can add an optional chunk to the commit-graph file that > stores the object count. Yeah, that's probably a saner route, since we have to do the write then anyway. -Peff
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05 2018, Jeff King wrote: > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > >> > So can we really just take (total_objects - commit_graph_objects) and >> > compare it to some threshold? >> >> The commit-graph only stores the number of _commits_, not total objects. > > Oh, right, of course. That does throw a monkey wrench in that line of > thought. ;) > > There's unfortunately not a fast way of doing that. One option would be > to keep a counter of "ungraphed commit objects", and have callers update > it. Anybody admitting a pack via index-pack or unpack-objects can easily > get this information. Commands like fast-import can do likewise, and > "git commit" obviously increments it by one. > > I'm not excited about adding a new global on-disk data structure (and > the accompanying lock). You don't really need a new global datastructure to solve this problem. It would be sufficient to have git-gc itself write out a 4-line text file after it runs saying how many tags, commits, trees and blobs it found on its last run. You can then fuzzily compare object counts v.s. commit counts for the purposes of deciding whether something like the commit-graph needs to be updated, while assuming that whatever new data you have has similar enough ratios of those as your existing data. That's an assumption that'll hold well enough for big repos where this matters the most, and who tend to grow in fairly uniform ways as far as their object type ratios go. Databases like MySQL, PostgreSQL etc. pull similar tricks with their fuzzy table statistics.
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 3:47 PM, Jeff King wrote: On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? The commit-graph only stores the number of _commits_, not total objects. Oh, right, of course. That does throw a monkey wrench in that line of thought. ;) There's unfortunately not a fast way of doing that. One option would be to keep a counter of "ungraphed commit objects", and have callers update it. Anybody admitting a pack via index-pack or unpack-objects can easily get this information. Commands like fast-import can do likewise, and "git commit" obviously increments it by one. I'm not excited about adding a new global on-disk data structure (and the accompanying lock). If we want, then we can add an optional chunk to the commit-graph file that stores the object count.
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > > So can we really just take (total_objects - commit_graph_objects) and > > compare it to some threshold? > > The commit-graph only stores the number of _commits_, not total objects. Oh, right, of course. That does throw a monkey wrench in that line of thought. ;) There's unfortunately not a fast way of doing that. One option would be to keep a counter of "ungraphed commit objects", and have callers update it. Anybody admitting a pack via index-pack or unpack-objects can easily get this information. Commands like fast-import can do likewise, and "git commit" obviously increments it by one. I'm not excited about adding a new global on-disk data structure (and the accompanying lock). -Peff
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 3:21 PM, Jeff King wrote: On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote: My misunderstanding was that your proposed change to gc computes the commit-graph in either of these two cases: (1) The auto-GC threshold is met. (2) There is no commit-graph file. And what I hope to have instead of (2) is (3): (3) The commit-graph file is "sufficiently behind" the tip refs. This condition is intentionally vague at the moment. It could be that we hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a pack, and it probably contains a lot of new commits") or we could create some more complicated condition based on counting reachable commits with infinite generation number (the number of commits not in the commit-graph file). I like that you are moving forward to make the commit-graph be written more frequently, but I'm trying to push us in a direction of writing it even more often than your proposed strategy. We should avoid creating too many orthogonal conditions that trigger the commit-graph write, which is why I'm pushing on your design here. Anyone else have thoughts on this direction? Yes, I think measuring "sufficiently behind" is the right thing. Everything else is a proxy or heuristic, and will run into corner cases. E.g., I have some small number of objects and then do a huge fetch, and now my commit-graph only covers 5% of what's available. We know how many objects are in the graph already. And it's not too expensive to get the number of objects in the repository. We can do the same sampling for loose objects that "gc --auto" does, and counting packed objects just involves opening up the .idx files (that can be slow if you have a ton of packs, but you'd want to either repack or use a .midx in that case anyway, either of which would help here). So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? The commit-graph only stores the number of _commits_, not total objects. Azure Repos' commit-graph does store the total number of objects, and that is how we trigger updating the graph, so it is not unreasonable to use that as a heuristic. Thanks, -Stolee
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote: > My misunderstanding was that your proposed change to gc computes the > commit-graph in either of these two cases: > > (1) The auto-GC threshold is met. > > (2) There is no commit-graph file. > > And what I hope to have instead of (2) is (3): > > (3) The commit-graph file is "sufficiently behind" the tip refs. > > This condition is intentionally vague at the moment. It could be that we > hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a > pack, and it probably contains a lot of new commits") or we could create > some more complicated condition based on counting reachable commits with > infinite generation number (the number of commits not in the commit-graph > file). > > I like that you are moving forward to make the commit-graph be written more > frequently, but I'm trying to push us in a direction of writing it even more > often than your proposed strategy. We should avoid creating too many > orthogonal conditions that trigger the commit-graph write, which is why I'm > pushing on your design here. > > Anyone else have thoughts on this direction? Yes, I think measuring "sufficiently behind" is the right thing. Everything else is a proxy or heuristic, and will run into corner cases. E.g., I have some small number of objects and then do a huge fetch, and now my commit-graph only covers 5% of what's available. We know how many objects are in the graph already. And it's not too expensive to get the number of objects in the repository. We can do the same sampling for loose objects that "gc --auto" does, and counting packed objects just involves opening up the .idx files (that can be slow if you have a ton of packs, but you'd want to either repack or use a .midx in that case anyway, either of which would help here). So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? -Peff
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05 2018, Derrick Stolee wrote: > On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Oct 05 2018, Derrick Stolee wrote: >> >>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). >>> I'll repeat that it could make sense to do the same thing on clone >>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to >>> communicate that we just downloaded a pack from a remote. >> I don't think that makes sense, but let's talk about why, because maybe >> I've missed something, you're certainly more familiar with the >> commit-graph than I am. >> >> The reason to do it on clone as a special-case or when the file is >> missing, is because we know the file is desired (via the GC config), and >> presumably is expected to help performance, and we have 0% of it. So by >> going from 0% to 100% on clone we'll get fast --contains and other >> goodies the graph helps with. >> >> But when we're doing a fetch, or really anything else that runs "git gc >> --auto" we can safely assume that we have a recent enough graph, because >> it will have been run whenever auto-gc kicked in. >> >> I.e.: >> >> # Slow, if we assume background forked commit-graph generation >> # (which I'm avoiding) >> git clone x && cd x && git tag --contains >> # Fast enough, since we have an existing commit-graph >> cd x && git fetch && git tag --contains >> >> I *do* think it might make sense to in general split off parts of "gc >> --auto" that we'd like to be more aggressive about, simply because the >> ratio of how long it takes to do, and how much it helps with performance >> makes more sense than a full repack, which is what the current heuristic >> is based on. >> >> And maybe when we run in that mode we should run in the foreground, but >> I don't see why git-fetch should be a special case there, and in this >> regard, the gc.clone.autoDetach=false setting I've made doesn't make >> much sence. I.e. maybe we should also skip forking to the background in >> such a mode when we trigger such a "mini gc" via git-commit or whatever. > > My misunderstanding was that your proposed change to gc computes the > commit-graph in either of these two cases: > > (1) The auto-GC threshold is met. > > (2) There is no commit-graph file. > > And what I hope to have instead of (2) is (3): > > (3) The commit-graph file is "sufficiently behind" the tip refs. > > This condition is intentionally vague at the moment. It could be that > we hint that (3) holds by saying "--post-fetch" (i.e. "We just > downloaded a pack, and it probably contains a lot of new commits") or > we could create some more complicated condition based on counting > reachable commits with infinite generation number (the number of > commits not in the commit-graph file). > > I like that you are moving forward to make the commit-graph be written > more frequently, but I'm trying to push us in a direction of writing > it even more often than your proposed strategy. We should avoid > creating too many orthogonal conditions that trigger the commit-graph > write, which is why I'm pushing on your design here. > > Anyone else have thoughts on this direction? Ah. I see. I think #3 makes perfect sense, but probably makes sense to do as a follow-up, or maybe you'd like to stick a patch on top of the series I have when I send it. I don't know how to write the "I'm not quite happy about the commit graph" code :) What I will do is refactor gc.c a bit and leave it in a state where it's going to be really easy to change the existing "we have no commit graph, and thus should do the optimization step" to have some more complex condition instead of "we have no commit graph", i.e. your "we just grabbed a lot of data". Also, I'll drop the gc.clone.autoDetach=false setting and name it something more general. maybe gc.AutoDetachOnBigOptimization=false? Anyway something more generic so that "clone" will always pass in some option saying "expect a large % commit graph update" (100% in its case), and then in "fetch" we could have some detection of how big what we just got from the server is, and do the same. This seems to be to be the most general thing that would make sense, and could also be extended e.g. to "git commit" and other users of gc --auto. If I started with a README file in an empty repo, and then made a commit where I added 1 million files all in one commit, in which case we'd (depending on that setting) also block in the foreground and generate the commit-graph.
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote: On Fri, Oct 05 2018, Derrick Stolee wrote: On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). I'll repeat that it could make sense to do the same thing on clone _and_ fetch. Perhaps a "--post-fetch" flag would be good here to communicate that we just downloaded a pack from a remote. I don't think that makes sense, but let's talk about why, because maybe I've missed something, you're certainly more familiar with the commit-graph than I am. The reason to do it on clone as a special-case or when the file is missing, is because we know the file is desired (via the GC config), and presumably is expected to help performance, and we have 0% of it. So by going from 0% to 100% on clone we'll get fast --contains and other goodies the graph helps with. But when we're doing a fetch, or really anything else that runs "git gc --auto" we can safely assume that we have a recent enough graph, because it will have been run whenever auto-gc kicked in. I.e.: # Slow, if we assume background forked commit-graph generation # (which I'm avoiding) git clone x && cd x && git tag --contains # Fast enough, since we have an existing commit-graph cd x && git fetch && git tag --contains I *do* think it might make sense to in general split off parts of "gc --auto" that we'd like to be more aggressive about, simply because the ratio of how long it takes to do, and how much it helps with performance makes more sense than a full repack, which is what the current heuristic is based on. And maybe when we run in that mode we should run in the foreground, but I don't see why git-fetch should be a special case there, and in this regard, the gc.clone.autoDetach=false setting I've made doesn't make much sence. I.e. maybe we should also skip forking to the background in such a mode when we trigger such a "mini gc" via git-commit or whatever. My misunderstanding was that your proposed change to gc computes the commit-graph in either of these two cases: (1) The auto-GC threshold is met. (2) There is no commit-graph file. And what I hope to have instead of (2) is (3): (3) The commit-graph file is "sufficiently behind" the tip refs. This condition is intentionally vague at the moment. It could be that we hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a pack, and it probably contains a lot of new commits") or we could create some more complicated condition based on counting reachable commits with infinite generation number (the number of commits not in the commit-graph file). I like that you are moving forward to make the commit-graph be written more frequently, but I'm trying to push us in a direction of writing it even more often than your proposed strategy. We should avoid creating too many orthogonal conditions that trigger the commit-graph write, which is why I'm pushing on your design here. Anyone else have thoughts on this direction? Thanks, -Stolee
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Fri, Oct 05 2018, Derrick Stolee wrote: > On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >> I don't have time to polish this up for submission now, but here's a WIP >> patch that implements this, highlights: >> >> * There's a gc.clone.autoDetach=false default setting which overrides >> gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a >> --cloning option to indicate this). > > I'll repeat that it could make sense to do the same thing on clone > _and_ fetch. Perhaps a "--post-fetch" flag would be good here to > communicate that we just downloaded a pack from a remote. I don't think that makes sense, but let's talk about why, because maybe I've missed something, you're certainly more familiar with the commit-graph than I am. The reason to do it on clone as a special-case or when the file is missing, is because we know the file is desired (via the GC config), and presumably is expected to help performance, and we have 0% of it. So by going from 0% to 100% on clone we'll get fast --contains and other goodies the graph helps with. But when we're doing a fetch, or really anything else that runs "git gc --auto" we can safely assume that we have a recent enough graph, because it will have been run whenever auto-gc kicked in. I.e.: # Slow, if we assume background forked commit-graph generation # (which I'm avoiding) git clone x && cd x && git tag --contains # Fast enough, since we have an existing commit-graph cd x && git fetch && git tag --contains I *do* think it might make sense to in general split off parts of "gc --auto" that we'd like to be more aggressive about, simply because the ratio of how long it takes to do, and how much it helps with performance makes more sense than a full repack, which is what the current heuristic is based on. And maybe when we run in that mode we should run in the foreground, but I don't see why git-fetch should be a special case there, and in this regard, the gc.clone.autoDetach=false setting I've made doesn't make much sence. I.e. maybe we should also skip forking to the background in such a mode when we trigger such a "mini gc" via git-commit or whatever. >> * A clone of say git.git with gc.writeCommitGraph=true looks like: >> >> [...] >> Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. >> Resolving deltas: 100% (188947/188947), done. >> Computing commit graph generation numbers: 100% (55210/55210), done. > > This looks like good UX. Thanks for the progress here! > >> * The 'git gc --auto' command also knows to (only) run the commit-graph >> (and space is left for future optimization steps) if general GC isn't >> needed, but we need "optimization": >> >> $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c >> gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; >> Annotating commits in commit graph: 341229, done. >> Computing commit graph generation numbers: 100% (165969/165969), done. >> $ > > Will this also trigger a full commit-graph rewrite on every 'git > commit' command? Nope, because "git commit" can safely be assumed to have some commit-graph anyway, and I'm just special casing the case where it doesn't exist. But if it doesn't exist and you do a "git commit" then "gc --auto" will be run, and we'll fork to the background and generate it... > Or is there some way we can compute the staleness of > the commit-graph in order to only update if we get too far ahead? > Previously, this was solved by relying on the auto-GC threshold. So re the "I don't think that makes sense..." at the start of my E-Mail, isn't it fine to rely on the default thresholds here, or should we be more aggressive? >> * The patch to gc.c looks less scary with -w, most of it is indenting >> the existing pack-refs etc. with a "!auto_gc || should_gc" condition. >> >> * I added a commit_graph_exists() exists function and only care if I >> get ENOENT for the purposes of this gc mode. This would need to be >> tweaked for the incremental mode Derrick talks about, but if we just >> set "should_optimize" that'll also work as far as gc --auto is >> concerned (e.g. on fetch, am etc.) > > The incremental mode would operate the same as split-index, which > means we will still look for .git/objects/info/commit-graph. That file > may point us to more files. Ah! >> +int commit_graph_exists(const char *graph_file) >> +{ >> +struct stat st; >> +if (stat(graph_file, )) { >> +if (errno == ENOENT) >> +return 0; >> +else >> +return -1; >> +} >> +return 1; >> +} >> + > > This method serves a very similar purpose to > generation_numbers_enabled(), except your method only cares about the > file existing. It ignores information like `core.commitGraph`, which > should keep us from doing anything with the commit-graph file if > false. > > Nothing about your
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). I'll repeat that it could make sense to do the same thing on clone _and_ fetch. Perhaps a "--post-fetch" flag would be good here to communicate that we just downloaded a pack from a remote. * A clone of say git.git with gc.writeCommitGraph=true looks like: [...] Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. Resolving deltas: 100% (188947/188947), done. Computing commit graph generation numbers: 100% (55210/55210), done. This looks like good UX. Thanks for the progress here! * The 'git gc --auto' command also knows to (only) run the commit-graph (and space is left for future optimization steps) if general GC isn't needed, but we need "optimization": $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; Annotating commits in commit graph: 341229, done. Computing commit graph generation numbers: 100% (165969/165969), done. $ Will this also trigger a full commit-graph rewrite on every 'git commit' command? Or is there some way we can compute the staleness of the commit-graph in order to only update if we get too far ahead? Previously, this was solved by relying on the auto-GC threshold. * The patch to gc.c looks less scary with -w, most of it is indenting the existing pack-refs etc. with a "!auto_gc || should_gc" condition. * I added a commit_graph_exists() exists function and only care if I get ENOENT for the purposes of this gc mode. This would need to be tweaked for the incremental mode Derrick talks about, but if we just set "should_optimize" that'll also work as far as gc --auto is concerned (e.g. on fetch, am etc.) The incremental mode would operate the same as split-index, which means we will still look for .git/objects/info/commit-graph. That file may point us to more files. +int commit_graph_exists(const char *graph_file) +{ + struct stat st; + if (stat(graph_file, )) { + if (errno == ENOENT) + return 0; + else + return -1; + } + return 1; +} + This method serves a very similar purpose to generation_numbers_enabled(), except your method only cares about the file existing. It ignores information like `core.commitGraph`, which should keep us from doing anything with the commit-graph file if false. Nothing about your method is specific to the commit-graph file, since you provide a filename as a parameter. It could easily be "int file_exists(const char *filename)". Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
SZEDER Gábor writes: >> git-gc - Cleanup unnecessary files and optimize the local repository >> >> Creating these indexes like the commit-graph falls under "optimize the >> local repository", > > But it doesn't fall under "cleanup unnecessary files", which the > commit-graph file is, since, strictly speaking, it's purely > optimization. I won't be actively engaged in this discussion soon, but I must say that "git gc" doing "garbage collection" is merely an implementation detail of optimizing the repository for further use. And from that point of view, what needs to be updated is the synopsis of the git-gc doc. It states "X and Y" above, but it actually is "Y by doing X and other things". I understand your "by definition there is no garbage immediately after clone" position, and also I would understand if you find it (perhaps philosophically) disturbing that "git clone" may give users a suboptimal repository that immediately needs optimizing [*1*]. But that bridge was crossed long time ago ever since pack transfer was invented. The data source sends only the pack data stream, and the receiving end is responsible for spending cycles to build .idx file. Theoretically, .pack should be all that is needed---you should be able to locate any necessary object by parsing the .pack file every time you open it, and .idx is mere optimization. You can think of the .midx and graph files the same way. I'd consider it a growing pain that these two recent inventions were and are still built as a totally optional and separate features, requiring completely separate full enumeration of objects in the repository that needs to happen anyway when we build .idx out of the received .pack. I would not be surprised by a future in which the initial index-pack that is responsible for receiving the incoming pack stream and storing that in .pack file(s) while creating corresponding .idx file(s) becomes also responsible for building .midx and graph files in the same pass, or at least smaller number of passes. Once we gain experience and confidence with these new auxiliary files, that ought to happen naturally. And at that point, we won't be having this discussion---we'd all happily run index-pack to receive the pack data, because that is pretty much the fundamental requirement to make use of the data. [Footnote] *1* Even without considering these recent invention of auxiliary files, cloning from a sloppily packed server whose primary focus is to avoid spending cycles by not computing better deltas will give the cloner a suboptimal repository. If we truly want to have an optimized repository ready to be used after cloning, we should run an equivalent of "repack -a -d -f" immediately after "git clone".
[RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, Ævar Arnfjörð Bjarmason wrote: > Don't have time to patch this now, but thought I'd send a note / RFC > about this. > > Now that we have the commit graph it's nice to be able to set > e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or > /etc/gitconfig to apply them to all repos. > > But when I clone e.g. linux.git stuff like 'tag --contains' will be slow > until whenever my first "gc" kicks in, which may be quite some time if > I'm just using it passively. > > So we should make "git gc --auto" be run on clone, and change the > need_to_gc() / cmd_gc() behavior so that we detect that the > gc.writeCommitGraph=true setting is on, but we have no commit graph, and > then just generate that without doing a full repack. > > As an aside such more granular "gc" would be nice for e.g. pack-refs > too. It's possible for us to just have one pack, but to have 100k loose > refs. > > It might also be good to have some gc.autoDetachOnClone option and have > it false by default, so we don't have a race condition where "clone > linux && git -C linux tag --contains" is slow because the graph hasn't > been generated yet, and generating the graph initially doesn't take that > long compared to the time to clone a large repo (and on a small one it > won't matter either way). > > I was going to say "also for midx", but of course after clone we have > just one pack, so I can't imagine us needing this. But I can see us > having other such optional side-indexes in the future generated by gc, > and they'd also benefit from this. I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). * A clone of say git.git with gc.writeCommitGraph=true looks like: [...] Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. Resolving deltas: 100% (188947/188947), done. Computing commit graph generation numbers: 100% (55210/55210), done. * The 'git gc --auto' command also knows to (only) run the commit-graph (and space is left for future optimization steps) if general GC isn't needed, but we need "optimization": $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; Annotating commits in commit graph: 341229, done. Computing commit graph generation numbers: 100% (165969/165969), done. $ * The patch to gc.c looks less scary with -w, most of it is indenting the existing pack-refs etc. with a "!auto_gc || should_gc" condition. * I added a commit_graph_exists() exists function and only care if I get ENOENT for the purposes of this gc mode. This would need to be tweaked for the incremental mode Derrick talks about, but if we just set "should_optimize" that'll also work as far as gc --auto is concerned (e.g. on fetch, am etc.) diff --git a/Documentation/config.txt b/Documentation/config.txt index 1546833213..5759fbb067 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1621,7 +1621,19 @@ gc.autoPackLimit:: gc.autoDetach:: Make `git gc --auto` return immediately and run in background - if the system supports it. Default is true. + if the system supports it. Default is true. Overridden by + `gc.clone.autoDetach` when running linkgit:git-clone[1]. + +gc.clone.autoDetach:: + Make `git gc --auto` return immediately and run in background + if the system supports it when run via + linkgit:git-clone[1]. Default is false. ++ +The reason this defaults to false is because the only time we'll have +work to do after a 'git clone' is if something like +`gc.writeCommitGraph` is true, in that case we'd like to compute the +optimized file before returning, so that say commands that benefit +from commit graph aren't slow until it's generated in the background. gc.bigPackThreshold:: If non-zero, all packs larger than this limit are kept when diff --git a/builtin/clone.c b/builtin/clone.c index 15b142d646..824c130ba5 100644 --- a/builtin/clone.c +++ b/builtin/clone.c @@ -897,6 +897,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix) struct remote *remote; int err = 0, complete_refs_before_fetch = 1; int submodule_progress; + const char *argv_gc_auto[] = {"gc", "--auto", "--cloning", NULL}; + const char *argv_gc_auto_quiet[] = {"gc", "--auto", "--cloning", "--quiet", NULL}; struct refspec rs = REFSPEC_INIT_FETCH; struct argv_array ref_prefixes = ARGV_ARRAY_INIT; @@ -1245,5 +1247,11 @@ int cmd_clone(int argc, const char **argv, const char *prefix) refspec_clear(); argv_array_clear(_prefixes); + + if (0 <= option_verbosity) +
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, Jeff King wrote: > On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote: > >> I share these concerns in a slightly more abstract way, as >> I would bucket the actions into two separate bins: >> >> One bin that throws away information. >> this would include removing expired reflog entries (which >> I do not think are garbage, or collection thereof), but their >> usefulness is questionable. >> >> The other bin would be actions that optimize but >> do not throw away any information, repacking (without >> dropping files) would be part of it, or the new >> "write additional files". >> >> Maybe we can move all actions of the second bin into a new >> "git optimize" command, and git gc would do first the "throw away >> things" and then the optimize action, whereas clone would only >> go for the second optimizing part? > > One problem with that world-view is that some of the operations do > _both_, for efficiency. E.g., repacking will drop unreachable objects in > too-old packs. We could actually be more aggressive in combining things > here. For instance, a full object graph walk in linux.git takes 30-60 > seconds, depending on your CPU. But we do it at least twice during a gc: > once to repack, and then again to determine reachability for pruning. > > If you generate bitmaps during the repack step, you can use them during > the prune step. But by itself, the cost of generating the bitmaps > generally outweighs the extra walk. So it's not worth generating them > _just_ for this (but is an obvious optimization for a server which would > be generating them anyway). I don't mean to fan the flames of this obviously controversial "git gc does optimization" topic (which I didn't suspect there would be a debate about...), but a related thing I was wondering about the other day is whether we could have a gc.fsck option, and in the background do fsck while we were at it, and report this back via some facility like gc.log[1]. That would also fall into this category of more work we could do while we're doing a full walk anyway, but as with what you're suggesting would require some refactoring. 1. Well, one that doesn't suck, see https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/ / https://public-inbox.org/git/87d0vmck55@evledraar.gmail.com/ etc.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote: > I share these concerns in a slightly more abstract way, as > I would bucket the actions into two separate bins: > > One bin that throws away information. > this would include removing expired reflog entries (which > I do not think are garbage, or collection thereof), but their > usefulness is questionable. > > The other bin would be actions that optimize but > do not throw away any information, repacking (without > dropping files) would be part of it, or the new > "write additional files". > > Maybe we can move all actions of the second bin into a new > "git optimize" command, and git gc would do first the "throw away > things" and then the optimize action, whereas clone would only > go for the second optimizing part? One problem with that world-view is that some of the operations do _both_, for efficiency. E.g., repacking will drop unreachable objects in too-old packs. We could actually be more aggressive in combining things here. For instance, a full object graph walk in linux.git takes 30-60 seconds, depending on your CPU. But we do it at least twice during a gc: once to repack, and then again to determine reachability for pruning. If you generate bitmaps during the repack step, you can use them during the prune step. But by itself, the cost of generating the bitmaps generally outweighs the extra walk. So it's not worth generating them _just_ for this (but is an obvious optimization for a server which would be generating them anyway). -Peff
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 02:59:34PM -0400, Derrick Stolee wrote: > > They don't help yet, and there's no good reason to enable bitmaps for > > clients. I have a few patches that use bitmaps for things like > > ahead/behind and --contains checks, but the utility of those may be > > lessened quite a bit by Stolee's commit-graph work. And if it isn't, > > I'm mildly in favor of replacing the existing .bitmap format with > > something better integrated with commit-graphs (which would give us an > > opportunity to clean up some of the rough edges). > > If the commit-graph doesn't improve enough on those applications, then we > could consider adding a commit-to-commit reachability bitmap inside the > commit-graph. ;) That unfortunately wouldn't be enough for us to ditch the existing .bitmap files, since we need full object reachability for some cases (including packing). And commit-to-commit reachability is a trivial subset of that. I'm not sure if it would be better to just leave .bitmaps in place as a server-side thing, and grow a new thing for commit-to-commit reachability (since it would presumably be easier). I'm still excited about the prospect of a bloom filter for paths which each commit touches. I think that's the next big frontier in getting things like "git log -- path" to a reasonable run-time. -Peff
Re: We should add a "git gc --auto" after "git clone" due to commit graph
> > But you thought right, I do have an objection against that. 'git gc' > should, well, collect garbage. Any non-gc stuff is already violating > separation of concerns. I share these concerns in a slightly more abstract way, as I would bucket the actions into two separate bins: One bin that throws away information. this would include removing expired reflog entries (which I do not think are garbage, or collection thereof), but their usefulness is questionable. The other bin would be actions that optimize but do not throw away any information, repacking (without dropping files) would be part of it, or the new "write additional files". Maybe we can move all actions of the second bin into a new "git optimize" command, and git gc would do first the "throw away things" and then the optimize action, whereas clone would only go for the second optimizing part?
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/3/2018 2:51 PM, Jeff King wrote: On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote: On Wed, Oct 03 2018, Stefan Beller wrote: So we wouldn't be spending 5 minutes repacking linux.git right after cloning it, just ~10s generating the commit graph, and the same would happen if you rm'd .git/objects/info/commit-graph and ran "git commit", which would kick of "gc --auto" in the background and do the same thing. Or generating local bitmaps or pack idx files as well? I'm less familiar with this area, but when I clone I get a pack *.idx file, why does it need to be regenerated? But yeah, in principle this would be a sensible addition, but I'm not aware of cases where clients get significant benefits from bitmaps (see https://githubengineering.com/counting-objects/), and I don't turn it on for clients, but maybe I've missed something. They don't help yet, and there's no good reason to enable bitmaps for clients. I have a few patches that use bitmaps for things like ahead/behind and --contains checks, but the utility of those may be lessened quite a bit by Stolee's commit-graph work. And if it isn't, I'm mildly in favor of replacing the existing .bitmap format with something better integrated with commit-graphs (which would give us an opportunity to clean up some of the rough edges). If the commit-graph doesn't improve enough on those applications, then we could consider adding a commit-to-commit reachability bitmap inside the commit-graph. ;) -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Oct 03 2018, Stefan Beller wrote: > > >> So we wouldn't be spending 5 minutes repacking linux.git right after > >> cloning it, just ~10s generating the commit graph, and the same would > >> happen if you rm'd .git/objects/info/commit-graph and ran "git commit", > >> which would kick of "gc --auto" in the background and do the same thing. > > > > Or generating local bitmaps or pack idx files as well? > > I'm less familiar with this area, but when I clone I get a pack *.idx > file, why does it need to be regenerated? > > But yeah, in principle this would be a sensible addition, but I'm not > aware of cases where clients get significant benefits from bitmaps (see > https://githubengineering.com/counting-objects/), and I don't turn it on > for clients, but maybe I've missed something. They don't help yet, and there's no good reason to enable bitmaps for clients. I have a few patches that use bitmaps for things like ahead/behind and --contains checks, but the utility of those may be lessened quite a bit by Stolee's commit-graph work. And if it isn't, I'm mildly in favor of replacing the existing .bitmap format with something better integrated with commit-graphs (which would give us an opportunity to clean up some of the rough edges). -Peff
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, Stefan Beller wrote: >> So we wouldn't be spending 5 minutes repacking linux.git right after >> cloning it, just ~10s generating the commit graph, and the same would >> happen if you rm'd .git/objects/info/commit-graph and ran "git commit", >> which would kick of "gc --auto" in the background and do the same thing. > > Or generating local bitmaps or pack idx files as well? I'm less familiar with this area, but when I clone I get a pack *.idx file, why does it need to be regenerated? But yeah, in principle this would be a sensible addition, but I'm not aware of cases where clients get significant benefits from bitmaps (see https://githubengineering.com/counting-objects/), and I don't turn it on for clients, but maybe I've missed something.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
> So we wouldn't be spending 5 minutes repacking linux.git right after > cloning it, just ~10s generating the commit graph, and the same would > happen if you rm'd .git/objects/info/commit-graph and ran "git commit", > which would kick of "gc --auto" in the background and do the same thing. Or generating local bitmaps or pack idx files as well?
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 05:19:41PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> >> >> So we should make "git gc --auto" be run on clone, > >> >> > > >> >> > There is no garbage after 'git clone'... > >> >> > >> >> "git gc" is really "git gc-or-create-indexes" these days. > >> > > >> > Because it happens to be convenient to create those indexes at > >> > gc-time. But that should not be an excuse to run gc when by > >> > definition no gc is needed. > >> > >> Ah, I thought you just had an objection to the "gc" name being used for > >> non-gc stuff, > > > > But you thought right, I do have an objection against that. 'git gc' > > should, well, collect garbage. Any non-gc stuff is already violating > > separation of concerns. > > Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to > perform common maintenance operations.", 2006-12-27) it has been > described as: > > git-gc - Cleanup unnecessary files and optimize the local repository > > Creating these indexes like the commit-graph falls under "optimize the > local repository", But it doesn't fall under "cleanup unnecessary files", which the commit-graph file is, since, strictly speaking, it's purely optimization. That description came about, because cleaning up unnecessary files, notably combining lots of loose refs into a single packed-refs file and combining lots of loose objects and pack files into a single pack file, could not only make the repository smaller (barring too many exploding unreachable objects), but, as it turned out, could also make Git operations in that repository faster. To me, the main goal of the command is cleanup. Optimization, however beneficial, is its side effect, and I assume the "optimize" part was added to the description mainly to inform and "encourage" users. After all, the command is called 'git gc', not 'git optimize-repo'. > and 3rd party tools (e.g. the repo tool doing this > came up on list recently) have been calling "gc --auto" with this > assumption. > > >> but if you mean we shouldn't do a giant repack right after > >> clone I agree. > > > > And, I also mean that since 'git clone' knows that there can't > > possibly be any garbage in the first place, then it shouldn't call 'gc > > --auto' at all. However, since it also knows that there is a lot of > > new stuff, then it should create a commit-graph if enabled. > > Is this something you think just because the tool isn't called > git-gc-and-optimzie, or do you think this regardless of what it's > called? Well, that still has 'gc' in it... > I don't see how splitting up the entry points for "detect if we need to > cleanup or optimize the repo" leaves us with a better codebase for the > reasons noted in > https://public-inbox.org/git/87pnwrgll2@evledraar.gmail.com/ Such a separation would be valuable for those having gc.auto = 0 in their config. Or, in general, to have a clearly marked entry point to update all the enabled "purely-optimization" files without 'gc' exploding a bunch of "just-became-unreachable" objects from deleted reflog entries and packfiles, or without performing a comparatively expensive repacking. Note the "clearly marked"; I don't think teaching 'gc [--auto]' various tricks to only create/update these files without doing what it is fundamentally supposed to do qualifies for that.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 3, 2018 at 3:23 PM Ævar Arnfjörð Bjarmason wrote: > > Don't have time to patch this now, but thought I'd send a note / RFC > about this. > > Now that we have the commit graph it's nice to be able to set > e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or > /etc/gitconfig to apply them to all repos. > > But when I clone e.g. linux.git stuff like 'tag --contains' will be slow > until whenever my first "gc" kicks in, which may be quite some time if > I'm just using it passively. Since you have core.hooksPath, you can already force gc (even in detached mode) in post-checkout hook. I'm adding a new "post-init-repo" hook to allow more customization right after clone or init-db which may also be useful for other things than gc. -- Duy
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, SZEDER Gábor wrote: > On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Oct 03 2018, SZEDER Gábor wrote: >> >> > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> >> >> On Wed, Oct 03 2018, SZEDER Gábor wrote: >> >> >> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> >> Don't have time to patch this now, but thought I'd send a note / RFC >> >> >> about this. >> >> >> >> >> >> Now that we have the commit graph it's nice to be able to set >> >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig >> >> >> or >> >> >> /etc/gitconfig to apply them to all repos. >> >> >> >> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be >> >> >> slow >> >> >> until whenever my first "gc" kicks in, which may be quite some time if >> >> >> I'm just using it passively. >> >> >> >> >> >> So we should make "git gc --auto" be run on clone, >> >> > >> >> > There is no garbage after 'git clone'... >> >> >> >> "git gc" is really "git gc-or-create-indexes" these days. >> > >> > Because it happens to be convenient to create those indexes at >> > gc-time. But that should not be an excuse to run gc when by >> > definition no gc is needed. >> >> Ah, I thought you just had an objection to the "gc" name being used for >> non-gc stuff, > > But you thought right, I do have an objection against that. 'git gc' > should, well, collect garbage. Any non-gc stuff is already violating > separation of concerns. Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to perform common maintenance operations.", 2006-12-27) it has been described as: git-gc - Cleanup unnecessary files and optimize the local repository Creating these indexes like the commit-graph falls under "optimize the local repository", and 3rd party tools (e.g. the repo tool doing this came up on list recently) have been calling "gc --auto" with this assumption. >> but if you mean we shouldn't do a giant repack right after >> clone I agree. > > And, I also mean that since 'git clone' knows that there can't > possibly be any garbage in the first place, then it shouldn't call 'gc > --auto' at all. However, since it also knows that there is a lot of > new stuff, then it should create a commit-graph if enabled. Is this something you think just because the tool isn't called git-gc-and-optimzie, or do you think this regardless of what it's called? I don't see how splitting up the entry points for "detect if we need to cleanup or optimize the repo" leaves us with a better codebase for the reasons noted in https://public-inbox.org/git/87pnwrgll2@evledraar.gmail.com/
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Oct 03 2018, SZEDER Gábor wrote: > > > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > >> On Wed, Oct 03 2018, SZEDER Gábor wrote: > >> > >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> >> Don't have time to patch this now, but thought I'd send a note / RFC > >> >> about this. > >> >> > >> >> Now that we have the commit graph it's nice to be able to set > >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or > >> >> /etc/gitconfig to apply them to all repos. > >> >> > >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow > >> >> until whenever my first "gc" kicks in, which may be quite some time if > >> >> I'm just using it passively. > >> >> > >> >> So we should make "git gc --auto" be run on clone, > >> > > >> > There is no garbage after 'git clone'... > >> > >> "git gc" is really "git gc-or-create-indexes" these days. > > > > Because it happens to be convenient to create those indexes at > > gc-time. But that should not be an excuse to run gc when by > > definition no gc is needed. > > Ah, I thought you just had an objection to the "gc" name being used for > non-gc stuff, But you thought right, I do have an objection against that. 'git gc' should, well, collect garbage. Any non-gc stuff is already violating separation of concerns. > but if you mean we shouldn't do a giant repack right after > clone I agree. And, I also mean that since 'git clone' knows that there can't possibly be any garbage in the first place, then it shouldn't call 'gc --auto' at all. However, since it also knows that there is a lot of new stuff, then it should create a commit-graph if enabled. > I meant that "gc --auto" would learn to do a subset of > its work, instead of the current "I have work to do, let's do all of > pack-refs/repack/commit-graph etc.". > > So we wouldn't be spending 5 minutes repacking linux.git right after > cloning it, just ~10s generating the commit graph, and the same would > happen if you rm'd .git/objects/info/commit-graph and ran "git commit", > which would kick of "gc --auto" in the background and do the same thing.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 3, 2018 at 4:01 PM Ævar Arnfjörð Bjarmason wrote: > >> and change the > >> need_to_gc() / cmd_gc() behavior so that we detect that the > >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and > >> then just generate that without doing a full repack. > > > > Or just teach 'git clone' to run 'git commit-graph write ...' > > Then when adding something like the commit graph we'd need to patch both > git-clone and git-gc, it's much more straightforward to make > need_to_gc() more granular. It is straightforward and misleading. If we organize the code well, patching both would not take much more effort and it reduces wtf? moments reading the code. -- Duy
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, SZEDER Gábor wrote: > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Oct 03 2018, SZEDER Gábor wrote: >> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> Don't have time to patch this now, but thought I'd send a note / RFC >> >> about this. >> >> >> >> Now that we have the commit graph it's nice to be able to set >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or >> >> /etc/gitconfig to apply them to all repos. >> >> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow >> >> until whenever my first "gc" kicks in, which may be quite some time if >> >> I'm just using it passively. >> >> >> >> So we should make "git gc --auto" be run on clone, >> > >> > There is no garbage after 'git clone'... >> >> "git gc" is really "git gc-or-create-indexes" these days. > > Because it happens to be convenient to create those indexes at > gc-time. But that should not be an excuse to run gc when by > definition no gc is needed. Ah, I thought you just had an objection to the "gc" name being used for non-gc stuff, but if you mean we shouldn't do a giant repack right after clone I agree. I meant that "gc --auto" would learn to do a subset of its work, instead of the current "I have work to do, let's do all of pack-refs/repack/commit-graph etc.". So we wouldn't be spending 5 minutes repacking linux.git right after cloning it, just ~10s generating the commit graph, and the same would happen if you rm'd .git/objects/info/commit-graph and ran "git commit", which would kick of "gc --auto" in the background and do the same thing.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, Derrick Stolee wrote: > On 10/3/2018 9:36 AM, SZEDER Gábor wrote: >> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: >>> Don't have time to patch this now, but thought I'd send a note / RFC >>> about this. >>> >>> Now that we have the commit graph it's nice to be able to set >>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or >>> /etc/gitconfig to apply them to all repos. >>> >>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow >>> until whenever my first "gc" kicks in, which may be quite some time if >>> I'm just using it passively. >>> >>> So we should make "git gc --auto" be run on clone, >> There is no garbage after 'git clone'... > > And since there is no garbage, the gc will not write the commit-graph. I should probably have replied to this instead of SZEDER's in https://public-inbox.org/git/87r2h7gmd7@evledraar.gmail.com/ anyway my 0.02 on that there. >> >>> and change the >>> need_to_gc() / cmd_gc() behavior so that we detect that the >>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and >>> then just generate that without doing a full repack. >> Or just teach 'git clone' to run 'git commit-graph write ...' > > I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting > until the file is incremental (on my to-do list soon), so the write is > fast when only adding a few commits at a time. This would cover the > clone case, too. It's re-arranging deck chairs on the Titanic at this point, but this approach seems like the wrong way to go in this whole "do we have crap to do?" git-gc state-machine. In my mind we should have only one entry point into that, and there shouldn't be magic like "here's the gc-ish stuff we do on fetch". Because if we care about a bunch of new commits being added on "fetch", that can also happen on "commit", "am", "merge", all of which run "gc --auto" now. Which is why I'm suggesting that we could add a sub-mode in need_to_gc() that detects if a file we want to generate is entirely missing, which is extendable to future formats, and the only caveat at that point is if we'd like that subset of "gc" to block and run in the foreground in the "clone" (or "fetch", ...) case. And then if we have a desire to incrementally add recently added commits to such formats, "gc --auto" could learn to consume reflogs or some other general inventory of "stuff added since last gc", and then we wouldn't have to instrument "fetch" specifically, the same would work for "commit", "am", "merge" etc.
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Oct 03 2018, SZEDER Gábor wrote: > > > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> Don't have time to patch this now, but thought I'd send a note / RFC > >> about this. > >> > >> Now that we have the commit graph it's nice to be able to set > >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or > >> /etc/gitconfig to apply them to all repos. > >> > >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow > >> until whenever my first "gc" kicks in, which may be quite some time if > >> I'm just using it passively. > >> > >> So we should make "git gc --auto" be run on clone, > > > > There is no garbage after 'git clone'... > > "git gc" is really "git gc-or-create-indexes" these days. Because it happens to be convenient to create those indexes at gc-time. But that should not be an excuse to run gc when by definition no gc is needed. > >> and change the > >> need_to_gc() / cmd_gc() behavior so that we detect that the > >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and > >> then just generate that without doing a full repack. > > > > Or just teach 'git clone' to run 'git commit-graph write ...' > > Then when adding something like the commit graph we'd need to patch both > git-clone and git-gc, it's much more straightforward to make > need_to_gc() more granular. > > >> As an aside such more granular "gc" would be nice for e.g. pack-refs > >> too. It's possible for us to just have one pack, but to have 100k loose > >> refs. > >> > >> It might also be good to have some gc.autoDetachOnClone option and have > >> it false by default, so we don't have a race condition where "clone > >> linux && git -C linux tag --contains" is slow because the graph hasn't > >> been generated yet, and generating the graph initially doesn't take that > >> long compared to the time to clone a large repo (and on a small one it > >> won't matter either way). > >> > >> I was going to say "also for midx", but of course after clone we have > >> just one pack, so I can't imagine us needing this. But I can see us > >> having other such optional side-indexes in the future generated by gc, > >> and they'd also benefit from this. > >> > >> #leftoverbits
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03 2018, SZEDER Gábor wrote: > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: >> Don't have time to patch this now, but thought I'd send a note / RFC >> about this. >> >> Now that we have the commit graph it's nice to be able to set >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or >> /etc/gitconfig to apply them to all repos. >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow >> until whenever my first "gc" kicks in, which may be quite some time if >> I'm just using it passively. >> >> So we should make "git gc --auto" be run on clone, > > There is no garbage after 'git clone'... "git gc" is really "git gc-or-create-indexes" these days. >> and change the >> need_to_gc() / cmd_gc() behavior so that we detect that the >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and >> then just generate that without doing a full repack. > > Or just teach 'git clone' to run 'git commit-graph write ...' Then when adding something like the commit graph we'd need to patch both git-clone and git-gc, it's much more straightforward to make need_to_gc() more granular. >> As an aside such more granular "gc" would be nice for e.g. pack-refs >> too. It's possible for us to just have one pack, but to have 100k loose >> refs. >> >> It might also be good to have some gc.autoDetachOnClone option and have >> it false by default, so we don't have a race condition where "clone >> linux && git -C linux tag --contains" is slow because the graph hasn't >> been generated yet, and generating the graph initially doesn't take that >> long compared to the time to clone a large repo (and on a small one it >> won't matter either way). >> >> I was going to say "also for midx", but of course after clone we have >> just one pack, so I can't imagine us needing this. But I can see us >> having other such optional side-indexes in the future generated by gc, >> and they'd also benefit from this. >> >> #leftoverbits
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/3/2018 9:36 AM, SZEDER Gábor wrote: On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: Don't have time to patch this now, but thought I'd send a note / RFC about this. Now that we have the commit graph it's nice to be able to set e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or /etc/gitconfig to apply them to all repos. But when I clone e.g. linux.git stuff like 'tag --contains' will be slow until whenever my first "gc" kicks in, which may be quite some time if I'm just using it passively. So we should make "git gc --auto" be run on clone, There is no garbage after 'git clone'... And since there is no garbage, the gc will not write the commit-graph. and change the need_to_gc() / cmd_gc() behavior so that we detect that the gc.writeCommitGraph=true setting is on, but we have no commit graph, and then just generate that without doing a full repack. Or just teach 'git clone' to run 'git commit-graph write ...' I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting until the file is incremental (on my to-do list soon), so the write is fast when only adding a few commits at a time. This would cover the clone case, too. Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: > Don't have time to patch this now, but thought I'd send a note / RFC > about this. > > Now that we have the commit graph it's nice to be able to set > e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or > /etc/gitconfig to apply them to all repos. > > But when I clone e.g. linux.git stuff like 'tag --contains' will be slow > until whenever my first "gc" kicks in, which may be quite some time if > I'm just using it passively. > > So we should make "git gc --auto" be run on clone, There is no garbage after 'git clone'... > and change the > need_to_gc() / cmd_gc() behavior so that we detect that the > gc.writeCommitGraph=true setting is on, but we have no commit graph, and > then just generate that without doing a full repack. Or just teach 'git clone' to run 'git commit-graph write ...' > As an aside such more granular "gc" would be nice for e.g. pack-refs > too. It's possible for us to just have one pack, but to have 100k loose > refs. > > It might also be good to have some gc.autoDetachOnClone option and have > it false by default, so we don't have a race condition where "clone > linux && git -C linux tag --contains" is slow because the graph hasn't > been generated yet, and generating the graph initially doesn't take that > long compared to the time to clone a large repo (and on a small one it > won't matter either way). > > I was going to say "also for midx", but of course after clone we have > just one pack, so I can't imagine us needing this. But I can see us > having other such optional side-indexes in the future generated by gc, > and they'd also benefit from this. > > #leftoverbits
We should add a "git gc --auto" after "git clone" due to commit graph
Don't have time to patch this now, but thought I'd send a note / RFC about this. Now that we have the commit graph it's nice to be able to set e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or /etc/gitconfig to apply them to all repos. But when I clone e.g. linux.git stuff like 'tag --contains' will be slow until whenever my first "gc" kicks in, which may be quite some time if I'm just using it passively. So we should make "git gc --auto" be run on clone, and change the need_to_gc() / cmd_gc() behavior so that we detect that the gc.writeCommitGraph=true setting is on, but we have no commit graph, and then just generate that without doing a full repack. As an aside such more granular "gc" would be nice for e.g. pack-refs too. It's possible for us to just have one pack, but to have 100k loose refs. It might also be good to have some gc.autoDetachOnClone option and have it false by default, so we don't have a race condition where "clone linux && git -C linux tag --contains" is slow because the graph hasn't been generated yet, and generating the graph initially doesn't take that long compared to the time to clone a large repo (and on a small one it won't matter either way). I was going to say "also for midx", but of course after clone we have just one pack, so I can't imagine us needing this. But I can see us having other such optional side-indexes in the future generated by gc, and they'd also benefit from this. #leftoverbits