> Pegging CPU for a few seconds doesn't sound out-of-place for
> pack-objects serving a fetch or clone on a large repository. And I can
> certainly believe "minutes", especially if it was not serving a fetch,
> but doing repository maintenance on a large repository.
> Talk to GitHub Enterprise support folks about what kind of process
> monitoring and accounting is available. Recent versions of GHE can
> easily tell things like which repositories and processes are using the
> most CPU, RAM, I/O, and network, which ones are seeing a lot of
> parallelism, etc.

We have an open support thread with the GHE support folks, but the
only feedback we've gotten so far on this subject is that they believe
our CPU load is being driven by the quantity of fetch operations (an
average of 16 fetch requests per second during normal business hours,
so 4 requests per second per subscriber box). About 4,000 fetch
requests on our main repository per day.

> None of those operations is triggered by client fetches. You'll see
> "rev-list" for a variety of operations, so that's hard to pinpoint. But
> I'm surprised that "prune" is a common one for you. It is run as part of
> the post-push, but I happen to know that the version that ships on GHE
> is optimized to use bitmaps, and to avoid doing any work when there are
> no loose objects that need pruning in the first place.

Would regular force-pushing trigger prune operations? Our engineering
body loves to force-push.

>> I might be misunderstanding this, but if the subscriber is already "up
>> to date" modulo a single updated ref tip, then this problem shouldn't
>> occur, right? Concretely: if ref A is built off of ref B, and the
>> subscriber already has B when it requests A, that shouldn't cause this
>> behavior, but it would cause this behavior if the subscriber didn't
>> have B when it requested A.
> Correct. So this shouldn't be a thing you are running into now, but it's
> something that might be made worse if you switch to fetching only single
> refs.

But in theory if we were always up-to-date (since we'd always fetch
any updated ref) we wouldn't run into this problem? We could have a
cron job to ensure that we run a full git fetch every once in a while
but I'd hope that if this was written properly we'd almost always have
the most recent commit for any dependency ref.

> That really sounds like repository maintenance. Repacks of
> torvalds/linux (including all of its forks) on github.com take ~15
> minutes of CPU. There may be some optimization opportunities there (I
> have a few things I'd like to explore in the next few months), but most
> of it is pretty fundamental. It literally takes a few minutes just to
> walk the entire object graph for that repo (that's one of the more
> extreme cases, of course, but presumably you are hosting some large
> repositories).
> Maintenance like that should be a very occasional operation, but it's
> possible that you have a very busy repo.

Our primary repository is fairly busy. It has about 1/3 the commits of
Linux and about 1/3 the refs, but seems otherwise to be on the same
scale. And, of course, it both hasn't been around for as long as Linux
has and has been experiencing exponential growth, which means its
current activity is higher than it has ever been before -- might put
it on a similar scale to Linux's current activity.

> OK, I double-checked, and your version should be coalescing identical
> fetches.
> Given that, and that a lot of the load you mentioned above is coming
> from non-fetch sources, it sounds like switching anything with your
> replica fetch strategy isn't likely to help much. And a multi-tiered
> architecture won't help if the load is being generated by requests that
> are serving the web-views directly on the box.
> I'd really encourage you to talk with GitHub Support about performance
> and clustering. It sounds like there may be some GitHub-specific things
> to tweak. And it may be that the load is just too much for a single
> machine, and would benefit from spreading the load across multiple git
> servers.

What surprised us is that we had been running this on an r3.4xlarge
(16vCPU) on AWS for two years without too much issue. Then in a span
of months we started experiencing massive CPU load, which forced us to
upgrade the box to one with 32 vCPU (and better CPUs). We just don't
understand what the precise driver of load is here.

As noted above, we are talking with GitHub about performance -- we've
also been pushing them to start working on a clustering plan, but my
impression has been that they're reluctant to go down that path. I
suspect that we use GHE much more aggressively than the typical GHE
client, but I could be wrong about that.

 - V


Reply via email to