Re: Git Scaling: What factors most affect Git performance for a large repo?

Martin Fick Thu, 19 Feb 2015 22:59:06 -0800

On Feb 19, 2015 5:42 PM, David Turner <[email protected]> wrote:
>
> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > >    * 'git push'? 
> > 
> > This one is not affected by how deep your repo's history is, or how 
> > wide your tree is, so should be quick.. 
> > 
> > Ah the number of refs may affect both git-push and git-pull. I think 
> > Stefan knows better than I in this area. 
>
> I can tell you that this is a bit of a problem for us at Twitter.  We 
> have over 100k refs, which adds ~20MiB of downstream traffic to every 
> push. 
>
> I added a hack to improve this locally inside Twitter: The client sends 
> a bloom filter of shas that it believes that the server knows about; the 
> server sends only the sha of master and any refs that are not in the 
> bloom filter.  The client  uses its local version of the servers' refs 
> as if they had just been sent.  This means that some packs will be 
> suboptimal, due to false positives in the bloom filter leading some new 
> refs to not be sent.  Also, if there were a repack between the pull and 
> the push, some refs might have been deleted on the server; we repack 
> rarely enough and pull frequently enough that this is hopefully not an 
> issue. 
>
> We're still testing to see if this works.  But due to the number of 
> assumptions it makes, it's probably not that great an idea for general 
> use.

Good to hear that others are starting to experiment with solutions to this
problem! I hope to hear more updates on this.

I have a prototype of a simpler, and
I believe more robust solution, but aimed at a smaller use case I think. On
connecting, the client sends a sha of all its refs/shas as defined by a
refspec, which it also sends to the server, which it believes the server might
have the same refs/shas values for. The server can then calculate the value of
its refs/shas which meet the same refspec, and then omit sending those refs if
the "verification" sha matches, and instead send only a confirmation that they
matched (along with any refs outside of the refspec). On a match, the client
can inject the local values of the refs which met the refspec and be guaranteed
that they match the server's values.

This optimization is aimed at the worst case scenario (and is thus the
potentially best case "compression"), when the client and server match for all
refs (a refs/* refspec) This is something that happens often on Gerrit server
startup, when it verifies that its mirrors are up-to-date. One reason I chose
this as a starting optimization, is because I think it is one use case which
will actually not benefit from "fixing" the git protocol to only send relevant
refs since all the refs are in fact relevant here! So something like this will
likely be needed in any future git protocol in order for it to be efficient for
this use case. And I believe this use case is likely to stick around.

With a minor tweak, this optimization should work when replicating actual
expected updates also by excluding the expected updating refs from the
verification so that the server always sends their values since they will
likely not match and would wreck the optimization. However, for this use case
it is not clear whether it is actually even worth caring about the non updating
refs? In theory the knowledge of the non updating refs can potentially reduce
the amount of data transmitted, but I suspect that as the ref count increases,
this has diminishing returns and mostly ends up chewing up CPU and memory in a
vain attempt to reduce network traffic.

Please do keep us up-to-date of your results,

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a
Linux Foundation Collaborative
ProjectN�����r��y����b�X��ǧv�^�)޺{.n�+����ا���ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf

Re: Git Scaling: What factors most affect Git performance for a large repo?

Reply via email to