Re: [PATCH 0/7] speeding up cat-file by reordering object access

2018-08-16 Thread Jeff King
On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I think that makes sense. We already see duplicates from
> > for_each_packed_object() when they're in multiple packs, and callers
> > just need to be ready to deal with it (and depending on what you're
> > doing, you may actually _want_ the duplicates).
> 
> You of course would also see dups between loose and packed until
> prune-packed is run.

Yes, although there are two separate calls to look at those two sources,
so it's a bit more obvious that the caller has to de-dup. I'm not
opposed to a flag to ask the function to de-dup for us, but it really
only makes sense if we do a combined for_each_object() call.
De-duping is potentially expensive, and there's little point in
de-duping the packs if we have to then de-dup the result with the loose
objects.

One benefit of having the iterator function de-dup is that it _could_ be
done more efficiently. Right now, even before my recent patches the
cat-file de-dup happens by creating a secondary list, sorting it, and
then skipping the duplicates.

But if a function knew _all_ of the sources we were going to look at
(any loose directories, including alternates, and all available packs),
and it could walk each individual source in sorted order (easy for packs
by .idx, not too bad for loose by sorting the readdir() results), then
we could do it as an online list-merge, with a single walk through the
results.

In practice I don't know if it's worth the effort. If you're accessing
the object contents, sorted order really is bad. And if you're just
collecting the hashes, then you're likely generating some data structure
for future lookups, and you can just de-dup as part of that process.

> I also was thinking about the same thing after Derrick's response,
> but unless you are very specialized caller, it does not allow you do
> very much to learn that object X exists as a loose object locally,
> also as a loose object in our alternate, and also in pack A, but not
> in other packs.  You need a way to say "Read the contents of object
> X from that place, not from any other place", "Remove that copy of
> object X at that place, but not at any other place" etc. to make
> effective use of that kind of information.

We do have those functions, and use them. E.g., fsck uses
read_loose_object() to actually check that particular copy of the
object. That's obviously a specialized caller as you note, but then
anybody iterating over all of the objects is pretty specialized already.

> The codepath that implements runtime access has "I found a copy
> here, but it is unusable, so let's go on to look for another usable
> copy" fallback.  This is a tangent but it is something we should not
> lose in the midx-enabled world.

Yeah, good catch. That's orthogonal to most of this discussion, I think,
but it's a possible downside of the midx because it has literally
discarded those duplicate index entries (or at least that's my
understanding). If we kept the .idx for each pack as a fallback for
older Gits, then it's easy to solve: in the unlikely case the
.midx-referenced copy fails, you look in each individual pack for
another copy.

But I think eventually you'd want to stop generating those .idx files,
too, if you know that you'll only use a modern version of Git. I wonder
if the .midx format should have an extension for "here are all the
duplicates I found". They could even be part of the main index (it's
easy enough for a binary-search lookup to always point to the start of a
run of duplicates), but if you had a ton of duplicates they would slow
your binary searches a little.

I'll leave all that to Stolee to ponder. :)

-Peff


Re: [PATCH 0/7] speeding up cat-file by reordering object access

2018-08-16 Thread Junio C Hamano
Jeff King  writes:

> I think that makes sense. We already see duplicates from
> for_each_packed_object() when they're in multiple packs, and callers
> just need to be ready to deal with it (and depending on what you're
> doing, you may actually _want_ the duplicates).

You of course would also see dups between loose and packed until
prune-packed is run.  

I also was thinking about the same thing after Derrick's response,
but unless you are very specialized caller, it does not allow you do
very much to learn that object X exists as a loose object locally,
also as a loose object in our alternate, and also in pack A, but not
in other packs.  You need a way to say "Read the contents of object
X from that place, not from any other place", "Remove that copy of
object X at that place, but not at any other place" etc. to make
effective use of that kind of information.

The codepath that implements runtime access has "I found a copy
here, but it is unusable, so let's go on to look for another usable
copy" fallback.  This is a tangent but it is something we should not
lose in the midx-enabled world.


Re: [PATCH 0/7] speeding up cat-file by reordering object access

2018-08-16 Thread Jeff King
On Wed, Aug 15, 2018 at 10:05:04AM -0400, Derrick Stolee wrote:

> One thing that I realized while reading it is that the multi-pack-index is
> not integrated into the for_each_packed_object method. I was already going
> to work on some cleanups in that area [1][2].
> 
> When using the new flag with the multi-pack-index, I expect that we will
> want to load the pack-files that are covered by the multi-pack-index
> (simply, the 'packs' array) and use the same mechanism to traverse them in
> order. The only "strange" thing about this is that we would see duplicate
> objects when traversing the pack-files directly but not when traversing the
> multi-pack-index (since it de-duplicates when indexing).

I think that makes sense. We already see duplicates from
for_each_packed_object() when they're in multiple packs, and callers
just need to be ready to deal with it (and depending on what you're
doing, you may actually _want_ the duplicates).

Thanks for thinking through the implications for other topics. I hadn't
even considered how this would interact with midx.

-Peff


Re: [PATCH 0/7] speeding up cat-file by reordering object access

2018-08-15 Thread Derrick Stolee

On 8/10/2018 7:07 PM, Jeff King wrote:

The general idea is that accessing objects in packfile order is way
kinder to the delta base cache, and thus way more efficient. See patches
4 and 7 in particular for discussion and numbers.

I'm primarily interested in cat-file, so this series is focused there.
But there may be other callers of for_each_packed_object() who could
benefit. Most of the existing ones just care about getting the oid, so
they're better off as-is. It's possible the call in is_promisor_object()
could benefit, since it calls parse_object() on each entry it visits. I
didn't experiment with it.


I like this series, and the follow-up. I could not find any problems 
with it.


One thing that I realized while reading it is that the multi-pack-index 
is not integrated into the for_each_packed_object method. I was already 
going to work on some cleanups in that area [1][2].


When using the new flag with the multi-pack-index, I expect that we will 
want to load the pack-files that are covered by the multi-pack-index 
(simply, the 'packs' array) and use the same mechanism to traverse them 
in order. The only "strange" thing about this is that we would see 
duplicate objects when traversing the pack-files directly but not when 
traversing the multi-pack-index (since it de-duplicates when indexing).


I hope to have a series working on top of this series by end-of-week.

Thanks,

-Stolee

[1] 
https://public-inbox.org/git/CAPig+cTU--KrGcv4C_CwBZEuec4dgm_tJqL=cfwkt6vxxr0...@mail.gmail.com/


    Re: [PATCH v4 04/23] multi-pack-index: add 'write' verb

    (Recommends more user-friendly usage reporting in 'git 
multi-pack-index')


[2] 
https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/


    [PATCH] partial-clone: render design doc using asciidoc

    (The commit-graph and multi-pack-index docs are not in the 
Makefile, either.)




Re: [PATCH 0/7] speeding up cat-file by reordering object access

2018-08-13 Thread Jonathan Tan
>   [1/7]: for_each_*_object: store flag definitions in a single location
>   [2/7]: for_each_*_object: take flag arguments as enum
>   [3/7]: for_each_*_object: give more comprehensive docstrings
>   [4/7]: for_each_packed_object: support iterating in pack-order
>   [5/7]: t1006: test cat-file --batch-all-objects with duplicates
>   [6/7]: cat-file: rename batch_{loose,packed}_object callbacks
>   [7/7]: cat-file: support "unordered" output for --batch-all-objects

Thanks for laying all the patches out so cleanly! All of them are:
Reviewed-by: Jonathan Tan 

Normally I would re-explain the patches to demonstrate that I understand
them, but in this case, I think they are simple enough - patches 1, 2,
3, and 6 are refactorings that I agree with, patch 5 just makes a test
more comprehensive, and patches 4 and 7 do what their commit messages
say.

Stefan brought up the concern that cache.h is increasing in size, but I
agree with the patch as written that it's probably best that we
centralize all the flags somewhere, and we can deal with the location in
a future patch.