Does anyone know how to gather stats on where the time is being spent?
Several times the idea of consolidating many of the (expensive)
validates runner integration tests into a single pipeline, and then
running things individually only if that fails, has come up. I think
that'd be a big win if indeed this is where our time is being spent.

On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira <danolive...@google.com> wrote:
>
> Those ideas all sound good. I especially agree with trying to reduce tests 
> first and then if we've done all we can there and latency is still too high, 
> it means we need more workers. Also in addition to reducing the amount of 
> tests, there's also running less important tests less frequently, 
> particularly when it comes to postcommits since many of those are resource 
> intensive. That would require people with good context around what our many 
> postcommits are used for.
>
> Another idea I thought of is trying to avoid running automated tests outside 
> of peak coding times. Ideally, during the times when we get the greatest 
> amounts of PRs (and therefore precommits) we shouldn't have any postcommits 
> running. If we have both pre and postcommits going at the same time during 
> peak hours, our queue times will shoot up even if the total amount of work 
> doesn't change much.
>
> Btw, you mentioned that this was a problem last year. Do you have any links 
> to discussions about that? It seems like it could be useful.
>
> On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin <mig...@google.com> wrote:
>>
>> Hi Daniel,
>>
>> Generally this looks feasible since jobs wait for new worker to be available 
>> to start.
>>
>> Over time we added more tests and did not deprecate enough, this increases 
>> load on workers. I wonder if we can add something like total runtime of all 
>> running jobs? This will be a safeguard metric that will show amount of time 
>> we actually run jobs. If it increases with same amount of workers, that will 
>> prove that we are overloading them (inverse is not necessarily correct).
>>
>> On addressing this, we can review approaches we took last year and see if 
>> any of them apply. If I do some brainstorming, following ideas come to mind: 
>> add more work force, reduce amount of tests, do better work on filtering out 
>> irrelevant tests, cancel irrelevant jobs (ie: cancel tests if linter fails) 
>> and/or add option for cancelling irrelevant jobs. One more big point can be 
>> effort on deflaking, but we seem to be decent in this area.
>>
>> Regards,
>> Mikhail.
>>
>>
>> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <danolive...@google.com> 
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> A little while ago I was taking a look at the Precommit Latency metrics on 
>>> Grafana (link) and saw that the monthly 90th percentile metric has been 
>>> really increasing the past few months, from around 10 minutes to currently 
>>> around 30 minutes.
>>>
>>> After doing some light digging I was shown this page (beam load statistics) 
>>> which seems to imply that queue times are shooting up when all the test 
>>> executors are occupied, and it seems this is happening longer and more 
>>> often recently. I also took a look at the commit history for our Jenkins 
>>> tests and I see that new tests have steadily been added.
>>>
>>> I wanted to bring this up with the dev@ to ask:
>>>
>>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>>> anyone know how to double check my assumptions with more concrete metrics?
>>>
>>> 2. Does anyone have ideas on how to address this?
>>>
>>> Thanks,
>>> Daniel Oliveira

Reply via email to