Re: Jenkins queue times steadily increasing for a few months now

Robert Bradshaw Tue, 24 Sep 2019 17:04:33 -0700

Yeah, that's useful. I was asking about getting things at the jenkins
job level. E.g. are our PostCommits taking up all the time, or our
Precommits?


On Tue, Sep 24, 2019 at 1:23 PM Lukasz Cwik <lc...@google.com> wrote:
>
> We can get the per gradle task profile with the --profile flag: 
> https://jakewharton.com/static/files/trace/profile.html
> This information also appears within the build scans that are sent to Gradle.
>
> Integrating with either of these sources of information would allow us to 
> figure out whether its new tasks or old tasks taking longer.
>
> On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw <rober...@google.com> wrote:
>>
>> Does anyone know how to gather stats on where the time is being spent?
>> Several times the idea of consolidating many of the (expensive)
>> validates runner integration tests into a single pipeline, and then
>> running things individually only if that fails, has come up. I think
>> that'd be a big win if indeed this is where our time is being spent.
>>
>> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira <danolive...@google.com> 
>> wrote:
>> >
>> > Those ideas all sound good. I especially agree with trying to reduce tests 
>> > first and then if we've done all we can there and latency is still too 
>> > high, it means we need more workers. Also in addition to reducing the 
>> > amount of tests, there's also running less important tests less 
>> > frequently, particularly when it comes to postcommits since many of those 
>> > are resource intensive. That would require people with good context around 
>> > what our many postcommits are used for.
>> >
>> > Another idea I thought of is trying to avoid running automated tests 
>> > outside of peak coding times. Ideally, during the times when we get the 
>> > greatest amounts of PRs (and therefore precommits) we shouldn't have any 
>> > postcommits running. If we have both pre and postcommits going at the same 
>> > time during peak hours, our queue times will shoot up even if the total 
>> > amount of work doesn't change much.
>> >
>> > Btw, you mentioned that this was a problem last year. Do you have any 
>> > links to discussions about that? It seems like it could be useful.
>> >
>> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin <mig...@google.com> 
>> > wrote:
>> >>
>> >> Hi Daniel,
>> >>
>> >> Generally this looks feasible since jobs wait for new worker to be 
>> >> available to start.
>> >>
>> >> Over time we added more tests and did not deprecate enough, this 
>> >> increases load on workers. I wonder if we can add something like total 
>> >> runtime of all running jobs? This will be a safeguard metric that will 
>> >> show amount of time we actually run jobs. If it increases with same 
>> >> amount of workers, that will prove that we are overloading them (inverse 
>> >> is not necessarily correct).
>> >>
>> >> On addressing this, we can review approaches we took last year and see if 
>> >> any of them apply. If I do some brainstorming, following ideas come to 
>> >> mind: add more work force, reduce amount of tests, do better work on 
>> >> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests 
>> >> if linter fails) and/or add option for cancelling irrelevant jobs. One 
>> >> more big point can be effort on deflaking, but we seem to be decent in 
>> >> this area.
>> >>
>> >> Regards,
>> >> Mikhail.
>> >>
>> >>
>> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <danolive...@google.com> 
>> >> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> A little while ago I was taking a look at the Precommit Latency metrics 
>> >>> on Grafana (link) and saw that the monthly 90th percentile metric has 
>> >>> been really increasing the past few months, from around 10 minutes to 
>> >>> currently around 30 minutes.
>> >>>
>> >>> After doing some light digging I was shown this page (beam load 
>> >>> statistics) which seems to imply that queue times are shooting up when 
>> >>> all the test executors are occupied, and it seems this is happening 
>> >>> longer and more often recently. I also took a look at the commit history 
>> >>> for our Jenkins tests and I see that new tests have steadily been added.
>> >>>
>> >>> I wanted to bring this up with the dev@ to ask:
>> >>>
>> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>> >>> anyone know how to double check my assumptions with more concrete 
>> >>> metrics?
>> >>>
>> >>> 2. Does anyone have ideas on how to address this?
>> >>>
>> >>> Thanks,
>> >>> Daniel Oliveira

Re: Jenkins queue times steadily increasing for a few months now

Reply via email to