Philippe Vaucher wrote:
> >   git daemon --init-timeout=10 --timeout=28800 --max-connections=15 
> > --export-all --base-path=/srv/git --detach
> >
> > In this configuration git-daemon acts as the supervisor process
> > managing its children.  The limit values are tweaked and learned from
> > having too many connections causing failures and tuned lower in order
> > to prevent this.
> >
> > When connections exceed the max-connection limit then they will queue
> > to the kernel limit /proc/sys/net/core/somaxconn (defaults to 128) and
> > be serviced as able.  Any queued connections past the 128 default the
> > client will get a connection failure.  The behavior at that point is
> > client dependent.  It might retry.
> 
> Interesting. So you're saying the timeouts I had when using git://
> meant that there was too much queue.

Likely.  Yes.

Because SO_MAXCONN defaults to 128 this means that 128 connections
will connect.  The 129th connection will not be able to connect and
will get a failure to connect.  The client should report it as a
failure to connect error.  The git-daemon will have 15 concurrent
processes draining the queue continuously.  We currently have 8 cpu
cores processing.

> > The nginx configuration is this:
> >
> >         location /git/ {
> >                 autoindex on;
> >                 root /srv;
> >                 location ~ ^/git(/.*/(info/refs|git-upload-pack)$) {
> >                         gzip off;
> >                         include fastcgi_params;
> >                         fastcgi_pass unix:/var/run/fcgiwrap.socket;
> >                         fastcgi_param SCRIPT_FILENAME 
> > /usr/local/sbin/git-http-backend;
> >                         fastcgi_param PATH_INFO $1;
> >                         fastcgi_param GIT_HTTP_EXPORT_ALL true;
> >                         fastcgi_param GIT_PROJECT_ROOT /srv/git;
> >                         client_max_body_size 0;
> >                 }
> >         }
> >
> > Looking at this now I see there is no rate limit being applied to this
> > section.  Therefore what I mentioned previously applies to the cgit
> > and gitweb sections which have been more problematic.  With no rate
> > limits all clients will be attempted.  Hmm...  I think that may have
> > been a mistake.  It is possible that adding a rate limit will smooth
> > the resource use and actually improve the situation.  The cgit and
> > gitweb sections use a "limit_req zone=one burst=15;" limit.  cgit in
> > particular is resource intensive for various reasons.  I'll need to do
> > some testing.
> 
> So, where did my 502/504 errors come from? Each job was retried 3
> times, with a 5 seconds delay. I'd understand some of them failing but
> not all of them.

The http/https side comes under DDoS abusive much more often.
Probably because script kiddies target it more often.  And at the
times when it is being driving off the network then the result is that
it is pretty much driven off the network.

If I am looking at it when this is happening I can sometimes mitigate
the attach by fail2ban rules and other things.  But sometimes it just
happens for a few hours and stops.  Sometimes in the 4am timeframe
when I don't see them until 10am or so the next day at which time
things have reset back to normal.

And then on top of that is all of the sum total of everything that is
running there.

> > When you are seeing proxy gateway failures I think it most likely that
> > the system is under resource stress and is unable to launch a
> > git-http-backend process within the timeouts.  This resource stress
> > can occur as a sum total of everything that is happening on the server
> > at the same time.  It includes git://, http(s)://, and also svn and
> > bzr and hg.  (Notably all of the CVS operations are on a different VM,
> > though likely on the same host server.)  All of those are running on
> > this system and when all of them coincidentally spike use at the same
> > time then they will compete with themselves for resources.  The system
> > will run very slowly.  I/O is shared.  Memory is shared.
> 
> Ah, ignore my question above then :-) Interesting!

Plus a few of the services seem very heavy.  For example cgit really
seems to grind on the system.  And some features of cgit don't help.
One can generate a tar.gz file of any tag, branch, commit.  So if an
anti-social web crawler is crawling every link they can find
(regardless of robots.txt) then they have at times tried to download a
tar.gz of every historical version of every project that is hosted.
It would seem to me that doing this would need a lot of disk space on
the receiving end.  But that has been one of the repeating themes.

> > Among other things the current VM has Linux memory overcommit
> > enabled.  Which means that the OOM Out of Memory Killer is triggered
> > at times.  And when that happens there is no longer any guarentee that
> > the machine is in a happy state.  Pretty much it requires a reboot to
> > ensure that everything is happy after the OOM Killer is invoked.  The
> > new system has more resources and I will be disabling overcommit which
> > avoids the OOM Killer.  I strongly feel the OOM killer is
> > inappropriate for enterprise level production servers.  (I would have
> > sworn it was already disabled.  But looking a bit ago I saw that it
> > was enabled.  Did someone else enable it?  Maybe.  That's the problem
> > of cooking in a shared kitchen.  Things move around and it could have
> > been any of the cooks.)
> 
> Good lead, maybe the parallel git clones trigger too much memory and
> basically each one of them gets killed eventually.

It's possible.  I don't think it is that likely.  Because I hadn't
been seeing a lot of oom-killer activity and then just recently, the
last couple of weeks, I noticed a few events.  That got me looking
further.  But the older logs had already rolled off so I don't know
what had happened before.

> > > It's what I actually used back in the days, the Dockerfile didn't
> > > clone the repository but did copy the already checked-out repository
> > > inside the image. That has all the advantages you cited, but cloning
> > > straight from your repository makes my images more trustworthy because
> > > the user sees that nothing fishy is going on.
> >
> > Since git commits are hash ids there should be no difference in the
> > end.  Since commit a given hash id will be the same regardless of how
> > it arrived there.  I don't see how anyone can say anything fishy is
> > happening.  I might liken it to newsgroups.  It doesn't matter how the
> > article arrive and may have come from any of a number of routes.  It
> > will be the same article regardless.  With git the hash id ensures
> > that the object content is identical.
> >
> > > Also he can just take my Dockerfile and build it directly without
> > > having to clone something locally first.
> >
> > I didn't quite follow the why of this being different.  Generally I
> > would like to see cpu effort distributed so that it is amortized
> > across all participants as much as possible.  As opposed to having it
> > lumped.  However if something can be done once instead of done
> > repeatedly then of course that is better for that reason.  Since I
> > didn't quite follow the detail here I can only comment with a vague
> > hand waving response that is without deep meaning.
> 
> I take it you are not really familiar with docker and dockerfiles and
> that's why you don't really understand why I'm making a point about
> having the clone as "clean" as possible.

I am not really that much of a fan of Docker and so haven't invested
the time to learn all of the details of it.  But I've worked with
other container systems quite a bit.

> In the docker world you have images, which is the entire OS plus
> usually one program. Then you can run these images and have complete
> reproductibility no matter where you run this image as all the
> dependencies are bundled together.
> 
> To build these images you use a dockerfile, which contains the
> instructions to build this image. Thus when you download one of the
> image, it's common to go have a look at how it is built. If you see
> one dockerfile where it simply clones a repository, versus another
> where it copies a local directory that you are told is a clone from
> the repository, you tend to trust the first one more.

And if I were to learn that the builder had faked out the naming of
the clone source by overriding the names in the /etc/hosts file?  Then
is the trust still there or not?  :-)

Meanwhile...  If it were me I would look at the commit hash and verify
that hash against the upstream.  If they matched then I would know it
was good.

I understand however that keeping up appearances can be important
marketing and public relations.

However for me if I saw that someone git cloned from a local mirror,
then "topped off" from the main upstream, that would look pretty good
to me.

> > Another time honored technique is to wrap "the action" with a retry
> > loop.  Try.  If failed then sleep for a bit and retry.  As long as the
> > entire retry loop succeeds then report that as a success not a
> > failure.  Too bad git clone does not include such functionality by
> > default.  But it shouldn't be too hard to apply.
> 
> Already done, didn't change anything. My guess is that the parralel
> git clone triggers OOM and once one fails all the other fails too.
> Because it is retried in parralel too the number of concurrent git
> clones is still too high and that fails. The only thing I can do here
> is limit the amount of clones drastically or use a local repository as
> you mentionned.

My guess is that the sum of everything triggers resource problems.
But that when this is happening things are happening for an hour or
two solid.  But then eventually everything resets.

Bob

Reply via email to