On Wed, Oct 3, 2012 at 8:03 PM, Jeff King <p...@peff.net> wrote:
> On Wed, Oct 03, 2012 at 02:36:00PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> I'm creating a system where a lot of remotes constantly fetch from a
>> central repository for deployment purposes, but I've noticed that even
>> with a remote.$name.fetch configuration to only get certain refs a
>> "git fetch" will still call git-upload pack which will provide a list
>> of all references.
>> This is being done against a repository with tens of thousands of refs
>> (it has a tag for each deployment), so it ends up burning a lot of CPU
>> time on the uploader/receiver side.
> Where is the CPU being burned? Are your refs packed (that's a huge
> savings)? What are the refs like? Are they .have refs from an alternates
> repository, or real refs? Are they pointing to commits or tag objects?
> What version of git are you using?  In the past year or so, I've made
> several tweaks to speed up large numbers of refs, including:
>   - cff38a5 (receive-pack: eliminate duplicate .have refs, v1.7.6); note
>     that this only helps if they are being pulled in by an alternates
>     repo. And even then, it only helps if they are mostly duplicates;
>     distinct ones are still O(n^2).
>   - 7db8d53 (fetch-pack: avoid quadratic behavior in remove_duplicates)
>     a0de288 (fetch-pack: avoid quadratic loop in filter_refs)
>     Both in v1.7.11. I think there is still a potential quadratic loop
>     in mark_complete()
>   - 90108a2 (upload-pack: avoid parsing tag destinations)
>     926f1dd (upload-pack: avoid parsing objects during ref advertisement)
>     Both in v1.7.10. Note that tag objects are more expensive to
>     advertise than commits, because we have to load and peel them.
> Even with those patches, though, I found that it was something like ~2s
> to advertise 100,000 refs.

I can't provide all the details now (not with access to that machine
now), but briefly:

 * The git client/server version is 1.7.8

 * The repository has around 50k refs, they're "real" refs, almost all
   of them (say all but 0.5k-1k) are annotated tags, the rest are

 * >99% of them are packed, there's a weekly cronjob that packs them
   all up, there were a few newly pushed branches and tags outside of

 * I tried "echo -n | git upload-pack <repo>" on both that 50k
   repository and a repository with <100 refs, the former took around
   ~1-2s to run on a 24 core box and the latter ~500ms.

 * When I ran git-upload-pack with GNU parallel I managed around 20/s
   packs on the 24 core box on the 50k ref one, 40/s on the 100 ref

 * A co-worker who was working on this today tried it on 1.7.12 and
   claimed that it had the same performance characteristics.

 * I tried to profile it under gcc -pg && echo -n | ./git-upload-pack
   <repo> but it doesn't produce a profile like that, presumably
   because the process exits unsuccessfully.

   Maybe someone here knows offhand what mock data I could feed
   git-upload-pack to make it happy to just list the refs, or better
   yet do a bit more work which it would do if it were actually doing
   the fetch (I suppose I could just do a fetch, but I wanted to do
   this from a locally compiled checkout).

>> Has there been any work on extending the protocol so that the client
>> tells the server what refs it's interested in?
> I don't think so. It would be hard to do in a backwards-compatible way,
> because the advertisement is the first thing the server says, before it
> has negotiated any capabilities with the client at all.

I suppose at least for the ssh protocol we could just do:

    ssh server "(git upload-pack <repo> --refs=* || git upload-pack <repo>)"

And something similar with HTTP headers, but that of course leaves the
git:// protocol.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to