Hi!
So I've resolved this issue, and I'll update future readers about what
happened and what they can look into.
When I posted this originally I didn't realize the number of factors
working against me that ultimately resulted in our GoCD server being
overwhelmed.
Here is what I experienced, with the fixes I applied:
- Jobs intermittently failing and agents stuck in cancelling state, this
error exclusively appeared during material updates in the go agents.
- This was correlated to a set of fan out pipelines whose jobs
effectively DDOSd our Git Stash server by cloning/updating a problematic
repo (problematic in its contents, not GoCD related). This was fixed by
applying 2 immediate changes:
1. Using shallowClone.
2. Reducing the raw number of clones by putting that in an initial
stage and passing a trimmed down set of files to subsequent stages
(most
notably omitting the .git directory) this also requires following
stages to
not check out materials and rely on fetch artifacts instead:
https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#stage
and
https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#git
3. Finally, file a future to use git large file store or change
the storage mechanism for the problematic files.
- Jobs intermittently failing and agents recover gracefully. Ultimately
the NoHttpResponseException was sort of a *canary* in this case and its
frequency increased steadily finally resulting in an OOM exception on the
GoCD server. It appears that this problem had several contributing factors:
- Increased Go server statistic monitoring. The most notable offender
was a monitor for job runtime averages.
- This was addressed by decreasing both the frequency and
shrinking the time window of this statistic.
- Increased total Jobs executed/required due to more concurrent
product development. This resulted in more fanning out in our dependency
graph.
- The solution is the same as the following point, however a word
of warning would be that one should closely monitor the GoCD server
CPU and
memory usage!
- Increased Go agent power, while giving us 30~% wins in end to
end time there was enough additional load from agents polling to finally
push our GoCD server over the edge.
- The brute force solution was to give our Go server more
resources, the presence of the following log messages forced my hand
in
requesting this (I had not seen these log messages initially).
2017-11-14 09:22:45,306 WARN [Scheduler-1678207403] LowResourceMonitor:292
- Low Resources: Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86
}, Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86}
...
2017-11-14 10:05:51,556 WARN [qtp1793329556-11768] QueuedThreadPool:617 -
Unexpected thread death: org.eclipse.jetty.util.thread.
QueuedThreadPool$3@24550918 in qtp1793329556{STARTED,20<=300<=300,i=0,q=469}
- Here are some of the resources I used:
-
https://docs.gocd.org/17.8.0/installation/performance_tuning.html
- GC logging is currently enabled, I'll monitor that until
I'm satisfied.
-
https://github.com/gocd/gocd/blob/17.8.0/base/src/com/thoughtworks/go/util/SystemEnvironment.java#L136
-
http://www.eclipse.org/jetty/documentation/current/optimizing.html#tuning-examples
-
https://github.com/gocd/gocd/blob/17.8.0/server/config/jetty.xml
-
http://www.eclipse.org/jetty/documentation/current/high-load.html#_operating_system_tuning
Cheers,
On Monday, 6 November 2017 11:34:26 UTC-5, Aravind SV wrote:
>
> Hello Carl,
>
> GoAgentServerHttpClient
> <https://github.com/gocd/gocd/blob/fedb5c57b5c62fe5f16cb9fa3029e1f6a9195302/base/src/com/thoughtworks/go/agent/common/ssl/GoAgentServerHttpClient.java>
>
> and GoAgentServerHttpClientBuilder
> <https://github.com/gocd/gocd/blob/fedb5c57b5c62fe5f16cb9fa3029e1f6a9195302/base/src/com/thoughtworks/go/agent/common/ssl/GoAgentServerHttpClientBuilder.java>
>
> might interest you. I find it strange that the server is not busy, but says
> NoHttpResponseException. I would have assumed it's somehow so busy that it
> cannot service this request.
>
> Of course, since you say it never recovers, you might be right that it's a
> problem with the connection somehow. If you find out anything, by adding a
> retry, let us know.
>
> Cheers,
> Aravind
>
>
> On Mon, Oct 30, 2017 at 1:57 PM, Carl D <[email protected] <javascript:>>
> wrote:
>
>>
>> We are currently using GoCD 17.8.0 and occasionally we see our agents
>> getting stuck in Cancelling state, I managed to capture the logs from the
>> latest instance and I believe applying some configuration to tinker with
>> connection pool settings should mitigate or resolve the issue (this is
>> based on
>> https://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception
>>
>> and the source documentation from http client).
>>
>> As an aside, we've had another similar issue overwhelming our repository
>> server which causes Jobs to fail intermittently (This has spawned a
>> separate discussion internally). We can separate these problems into two
>> broad categories: one is recoverable and rerunning the Job is sufficient,
>> the other (this case) the agent does not recover and one must restart the
>> agent service.
>>
>> My question is, has anyone else encountered this issue where agents
>> become stuck in Cancelling state? How did you overcome it?
>>
>> Could someone enlighten me about where in the GoCD configuration I can
>> look to follow up on connection pool settings or injecting a retry handler?
>>
>
--
You received this message because you are subscribed to the Google Groups
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.