Re: [go-cd] Intermittent communication failure between Agents and Controller

Carl D Tue, 14 Nov 2017 15:40:28 -0800

Hi!

So I've resolved this issue, and I'll update future readers about what 
happened and what they can look into.


When I posted this originally I didn't realize the number of factors 
working against me that ultimately resulted in our GoCD server being 
overwhelmed.

Here is what I experienced, with the fixes I applied:

   - Jobs intermittently failing and agents stuck in cancelling state, this 
   error exclusively appeared during material updates in the go agents.
      - This was correlated to a set of fan out pipelines whose jobs 
      effectively DDOSd our Git Stash server by cloning/updating a problematic 
      repo (problematic in its contents, not GoCD related). This was fixed by 
      applying 2 immediate changes:
         1. Using shallowClone.
         2. Reducing the raw number of clones by putting that in an initial 
         stage and passing a trimmed down set of files to subsequent stages 
(most 
         notably omitting the .git directory) this also requires following 
stages to 
         not check out materials and rely on fetch artifacts instead: 
         
https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#stage 
         and 
         
https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#git
         3. Finally, file a future to use git large file store or change 
         the storage mechanism for the problematic files.
      


   - Jobs intermittently failing and agents recover gracefully. Ultimately 
   the NoHttpResponseException was sort of a *canary* in this case and its 
   frequency increased steadily finally resulting in an OOM exception on the 
   GoCD server. It appears that this problem had several contributing factors:
      - Increased Go server statistic monitoring. The most notable offender 
      was a monitor for job runtime averages.
         - This was addressed by decreasing both the frequency and 
         shrinking the time window of this statistic.
         - Increased total Jobs executed/required due to more concurrent 
      product development. This resulted in more fanning out in our dependency 
      graph.
         - The solution is the same as the following point, however a word 
         of warning would be that one should closely monitor the GoCD server 
CPU and 
         memory usage!
         - Increased Go agent power, while giving us 30~% wins in end to 
      end time there was enough additional load from agents polling to finally 
      push our GoCD server over the edge.
         - The brute force solution was to give our Go server more 
         resources, the presence of the following log messages forced my hand 
in 
         requesting this (I had not seen these log messages initially).
      
2017-11-14 09:22:45,306  WARN [Scheduler-1678207403] LowResourceMonitor:292 
- Low Resources: Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86
}, Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86}
...
2017-11-14 10:05:51,556  WARN [qtp1793329556-11768] QueuedThreadPool:617 - 
Unexpected thread death: org.eclipse.jetty.util.thread.
QueuedThreadPool$3@24550918 in qtp1793329556{STARTED,20<=300<=300,i=0,q=469}

   - Here are some of the resources I used:
            - 
            https://docs.gocd.org/17.8.0/installation/performance_tuning.html
               - GC logging is currently enabled, I'll monitor that until 
               I'm satisfied.
               - 
            
https://github.com/gocd/gocd/blob/17.8.0/base/src/com/thoughtworks/go/util/SystemEnvironment.java#L136
            - 
            
http://www.eclipse.org/jetty/documentation/current/optimizing.html#tuning-examples
               - 
               https://github.com/gocd/gocd/blob/17.8.0/server/config/jetty.xml
            - 
            
http://www.eclipse.org/jetty/documentation/current/high-load.html#_operating_system_tuning
         

Cheers,


On Monday, 6 November 2017 11:34:26 UTC-5, Aravind SV wrote:
>
> Hello Carl,
>
> GoAgentServerHttpClient 
> <https://github.com/gocd/gocd/blob/fedb5c57b5c62fe5f16cb9fa3029e1f6a9195302/base/src/com/thoughtworks/go/agent/common/ssl/GoAgentServerHttpClient.java>
>  
> and GoAgentServerHttpClientBuilder 
> <https://github.com/gocd/gocd/blob/fedb5c57b5c62fe5f16cb9fa3029e1f6a9195302/base/src/com/thoughtworks/go/agent/common/ssl/GoAgentServerHttpClientBuilder.java>
>  
> might interest you. I find it strange that the server is not busy, but says 
> NoHttpResponseException. I would have assumed it's somehow so busy that it 
> cannot service this request.
>
> Of course, since you say it never recovers, you might be right that it's a 
> problem with the connection somehow. If you find out anything, by adding a 
> retry, let us know.
>
> Cheers,
> Aravind
>
>
> On Mon, Oct 30, 2017 at 1:57 PM, Carl D <[email protected] <javascript:>> 
> wrote:
>
>>
>> We are currently using GoCD 17.8.0 and occasionally we see our agents 
>> getting stuck in Cancelling state, I managed to capture the logs from the 
>> latest instance and I believe applying some configuration to tinker with 
>> connection pool settings should mitigate or resolve the issue (this is 
>> based on 
>> https://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception
>>  
>> and the source documentation from http client).
>>
>> As an aside, we've had another similar issue overwhelming our repository 
>> server which causes Jobs to fail intermittently (This has spawned a 
>> separate discussion internally). We can separate these problems into two 
>> broad categories: one is recoverable and rerunning the Job is sufficient, 
>> the other (this case) the agent does not recover and one must restart the 
>> agent service.
>>
>> My question is, has anyone else encountered this issue where agents 
>> become stuck in Cancelling state? How did you overcome it?
>>
>> Could someone enlighten me about where in the GoCD configuration I can 
>> look to follow up on connection pool settings or injecting a retry handler?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [go-cd] Intermittent communication failure between Agents and Controller

Reply via email to