Re: [PATCH 0/7] speeding up cat-file by reordering object access
On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote: > Jeff King writes: > > > I think that makes sense. We already see duplicates from > > for_each_packed_object() when they're in multiple packs, and callers > > just need to be ready to deal with it (and depending on what you're > > doing, you may actually _want_ the duplicates). > > You of course would also see dups between loose and packed until > prune-packed is run. Yes, although there are two separate calls to look at those two sources, so it's a bit more obvious that the caller has to de-dup. I'm not opposed to a flag to ask the function to de-dup for us, but it really only makes sense if we do a combined for_each_object() call. De-duping is potentially expensive, and there's little point in de-duping the packs if we have to then de-dup the result with the loose objects. One benefit of having the iterator function de-dup is that it _could_ be done more efficiently. Right now, even before my recent patches the cat-file de-dup happens by creating a secondary list, sorting it, and then skipping the duplicates. But if a function knew _all_ of the sources we were going to look at (any loose directories, including alternates, and all available packs), and it could walk each individual source in sorted order (easy for packs by .idx, not too bad for loose by sorting the readdir() results), then we could do it as an online list-merge, with a single walk through the results. In practice I don't know if it's worth the effort. If you're accessing the object contents, sorted order really is bad. And if you're just collecting the hashes, then you're likely generating some data structure for future lookups, and you can just de-dup as part of that process. > I also was thinking about the same thing after Derrick's response, > but unless you are very specialized caller, it does not allow you do > very much to learn that object X exists as a loose object locally, > also as a loose object in our alternate, and also in pack A, but not > in other packs. You need a way to say "Read the contents of object > X from that place, not from any other place", "Remove that copy of > object X at that place, but not at any other place" etc. to make > effective use of that kind of information. We do have those functions, and use them. E.g., fsck uses read_loose_object() to actually check that particular copy of the object. That's obviously a specialized caller as you note, but then anybody iterating over all of the objects is pretty specialized already. > The codepath that implements runtime access has "I found a copy > here, but it is unusable, so let's go on to look for another usable > copy" fallback. This is a tangent but it is something we should not > lose in the midx-enabled world. Yeah, good catch. That's orthogonal to most of this discussion, I think, but it's a possible downside of the midx because it has literally discarded those duplicate index entries (or at least that's my understanding). If we kept the .idx for each pack as a fallback for older Gits, then it's easy to solve: in the unlikely case the .midx-referenced copy fails, you look in each individual pack for another copy. But I think eventually you'd want to stop generating those .idx files, too, if you know that you'll only use a modern version of Git. I wonder if the .midx format should have an extension for "here are all the duplicates I found". They could even be part of the main index (it's easy enough for a binary-search lookup to always point to the start of a run of duplicates), but if you had a ton of duplicates they would slow your binary searches a little. I'll leave all that to Stolee to ponder. :) -Peff
Re: [PATCH 0/7] speeding up cat-file by reordering object access
Jeff King writes: > I think that makes sense. We already see duplicates from > for_each_packed_object() when they're in multiple packs, and callers > just need to be ready to deal with it (and depending on what you're > doing, you may actually _want_ the duplicates). You of course would also see dups between loose and packed until prune-packed is run. I also was thinking about the same thing after Derrick's response, but unless you are very specialized caller, it does not allow you do very much to learn that object X exists as a loose object locally, also as a loose object in our alternate, and also in pack A, but not in other packs. You need a way to say "Read the contents of object X from that place, not from any other place", "Remove that copy of object X at that place, but not at any other place" etc. to make effective use of that kind of information. The codepath that implements runtime access has "I found a copy here, but it is unusable, so let's go on to look for another usable copy" fallback. This is a tangent but it is something we should not lose in the midx-enabled world.
Re: [PATCH 0/7] speeding up cat-file by reordering object access
On Wed, Aug 15, 2018 at 10:05:04AM -0400, Derrick Stolee wrote: > One thing that I realized while reading it is that the multi-pack-index is > not integrated into the for_each_packed_object method. I was already going > to work on some cleanups in that area [1][2]. > > When using the new flag with the multi-pack-index, I expect that we will > want to load the pack-files that are covered by the multi-pack-index > (simply, the 'packs' array) and use the same mechanism to traverse them in > order. The only "strange" thing about this is that we would see duplicate > objects when traversing the pack-files directly but not when traversing the > multi-pack-index (since it de-duplicates when indexing). I think that makes sense. We already see duplicates from for_each_packed_object() when they're in multiple packs, and callers just need to be ready to deal with it (and depending on what you're doing, you may actually _want_ the duplicates). Thanks for thinking through the implications for other topics. I hadn't even considered how this would interact with midx. -Peff
Re: [PATCH 0/7] speeding up cat-file by reordering object access
On 8/10/2018 7:07 PM, Jeff King wrote: The general idea is that accessing objects in packfile order is way kinder to the delta base cache, and thus way more efficient. See patches 4 and 7 in particular for discussion and numbers. I'm primarily interested in cat-file, so this series is focused there. But there may be other callers of for_each_packed_object() who could benefit. Most of the existing ones just care about getting the oid, so they're better off as-is. It's possible the call in is_promisor_object() could benefit, since it calls parse_object() on each entry it visits. I didn't experiment with it. I like this series, and the follow-up. I could not find any problems with it. One thing that I realized while reading it is that the multi-pack-index is not integrated into the for_each_packed_object method. I was already going to work on some cleanups in that area [1][2]. When using the new flag with the multi-pack-index, I expect that we will want to load the pack-files that are covered by the multi-pack-index (simply, the 'packs' array) and use the same mechanism to traverse them in order. The only "strange" thing about this is that we would see duplicate objects when traversing the pack-files directly but not when traversing the multi-pack-index (since it de-duplicates when indexing). I hope to have a series working on top of this series by end-of-week. Thanks, -Stolee [1] https://public-inbox.org/git/CAPig+cTU--KrGcv4C_CwBZEuec4dgm_tJqL=cfwkt6vxxr0...@mail.gmail.com/ Re: [PATCH v4 04/23] multi-pack-index: add 'write' verb (Recommends more user-friendly usage reporting in 'git multi-pack-index') [2] https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/ [PATCH] partial-clone: render design doc using asciidoc (The commit-graph and multi-pack-index docs are not in the Makefile, either.)
Re: [PATCH 0/7] speeding up cat-file by reordering object access
> [1/7]: for_each_*_object: store flag definitions in a single location > [2/7]: for_each_*_object: take flag arguments as enum > [3/7]: for_each_*_object: give more comprehensive docstrings > [4/7]: for_each_packed_object: support iterating in pack-order > [5/7]: t1006: test cat-file --batch-all-objects with duplicates > [6/7]: cat-file: rename batch_{loose,packed}_object callbacks > [7/7]: cat-file: support "unordered" output for --batch-all-objects Thanks for laying all the patches out so cleanly! All of them are: Reviewed-by: Jonathan Tan Normally I would re-explain the patches to demonstrate that I understand them, but in this case, I think they are simple enough - patches 1, 2, 3, and 6 are refactorings that I agree with, patch 5 just makes a test more comprehensive, and patches 4 and 7 do what their commit messages say. Stefan brought up the concern that cache.h is increasing in size, but I agree with the patch as written that it's probably best that we centralize all the flags somewhere, and we can deal with the location in a future patch.