On Fri, Oct 26 2018, Junio C Hamano wrote:

> "Jansen, Geert" <gera...@amazon.com> writes:
>
>> The index-pack command determines if a sha1 collision test is needed by
>> checking the existence of a loose object with the given hash. In my tests, I
>> can improve performance of “git clone” on Amazon EFS by 8x when used with a
>> non-default mount option (lookupcache=pos) that's required for a Gitlab HA
>> setup. My assumption is that this check is unnecessary when cloning into a 
>> new
>> repository because the repository will be empty.
>
> My knee-jerk reaction is that your insight that we can skip the "dup
> check" when starting from emptiness is probably correct, but your
> use of .cloning flag as an approximation of "are we starting from
> emptiness?" is probably wrong, at least for two reasons.
>
>  - "git clone --reference=..." does not strictly start from
>    emptiness, and would still have to make sure that incoming pack
>    does not try to inject an object with different contents but with
>    the same name.
>
>  - "git init && git fetch ..." starts from emptiness and would want
>    to benefit from the same optimization as you are implementing
>    here.
>
> As to the implementation, I think the patch adjusts the right "if()"
> condition to skip the collision test here.
>
>> -    if (startup_info->have_repository) {
>> +    if (startup_info->have_repository && !cloning) {
>>              read_lock();
>>              collision_test_needed =
>>                      has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
>
> I just do not think !cloning is quite correct.

Geert: Thanks for working on this. A GitLab instance I'm involved in
managing at work has suffered from this issue, e.g. with "fork" being a
"clone" under the hood on GitLab, and taking ages on the NetApp NFS
filer due to this issue, so I'm very interested in this moving forward.

But as Junio notes the devil's in the details, another one I thought of
is:

    GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...

It seems to me that it's better to stick this into
setup_git_directory_gently() in setup.c and check various edge cases
there. I.e. make this a new "have_object_already" member of the
startup_info struct.

That would be set depending on whether we find objects/packs in the
objects dir, and would know about GIT_OBJECT_DIRECTORY (either just
punting, or looking at those too). It would also need to know about
read_info_alternates(), depending on when that's checked it would handle
git clone --reference too since it just sets it up via
add_to_alternates_file().

Then you'd change sha1_object() to just check
startup_info->have_objects_already instead of
startup_info->have_repository, and since we'd checked "did we have
objects already?" it would work for the init && fetch case Junio
mentioned.

It would also be very nice to have a test for this, even if it's
something OS-specific that only works on Linux after we probe for
strace(1).

Testing (without your patch, because git-am barfs on it , seems to
dislake the base64 encoding):

    rm -rf /tmp/df; strace -f git clone --bare g...@github.com:git/git.git 
/tmp/df 2>&1 | grep '".*ENOENT' 2>&1|perl -pe 's/^.*?"([^"]+)".*/$1/'|sort|uniq 
-c|sort -nr|less

I see we also check if packed-refs exists ~2800 times, and check for
each ref we find on the remote. Those are obviously less of a
performance issue when cloning in the common case, but perhaps another
place where we can insert a "don't check, we don't have anything
already" condition.

Reply via email to