Does anyone know how to gather stats on where the time is being spent? Several times the idea of consolidating many of the (expensive) validates runner integration tests into a single pipeline, and then running things individually only if that fails, has come up. I think that'd be a big win if indeed this is where our time is being spent.
On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira <danolive...@google.com> wrote: > > Those ideas all sound good. I especially agree with trying to reduce tests > first and then if we've done all we can there and latency is still too high, > it means we need more workers. Also in addition to reducing the amount of > tests, there's also running less important tests less frequently, > particularly when it comes to postcommits since many of those are resource > intensive. That would require people with good context around what our many > postcommits are used for. > > Another idea I thought of is trying to avoid running automated tests outside > of peak coding times. Ideally, during the times when we get the greatest > amounts of PRs (and therefore precommits) we shouldn't have any postcommits > running. If we have both pre and postcommits going at the same time during > peak hours, our queue times will shoot up even if the total amount of work > doesn't change much. > > Btw, you mentioned that this was a problem last year. Do you have any links > to discussions about that? It seems like it could be useful. > > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin <mig...@google.com> wrote: >> >> Hi Daniel, >> >> Generally this looks feasible since jobs wait for new worker to be available >> to start. >> >> Over time we added more tests and did not deprecate enough, this increases >> load on workers. I wonder if we can add something like total runtime of all >> running jobs? This will be a safeguard metric that will show amount of time >> we actually run jobs. If it increases with same amount of workers, that will >> prove that we are overloading them (inverse is not necessarily correct). >> >> On addressing this, we can review approaches we took last year and see if >> any of them apply. If I do some brainstorming, following ideas come to mind: >> add more work force, reduce amount of tests, do better work on filtering out >> irrelevant tests, cancel irrelevant jobs (ie: cancel tests if linter fails) >> and/or add option for cancelling irrelevant jobs. One more big point can be >> effort on deflaking, but we seem to be decent in this area. >> >> Regards, >> Mikhail. >> >> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <danolive...@google.com> >> wrote: >>> >>> Hi everyone, >>> >>> A little while ago I was taking a look at the Precommit Latency metrics on >>> Grafana (link) and saw that the monthly 90th percentile metric has been >>> really increasing the past few months, from around 10 minutes to currently >>> around 30 minutes. >>> >>> After doing some light digging I was shown this page (beam load statistics) >>> which seems to imply that queue times are shooting up when all the test >>> executors are occupied, and it seems this is happening longer and more >>> often recently. I also took a look at the commit history for our Jenkins >>> tests and I see that new tests have steadily been added. >>> >>> I wanted to bring this up with the dev@ to ask: >>> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does >>> anyone know how to double check my assumptions with more concrete metrics? >>> >>> 2. Does anyone have ideas on how to address this? >>> >>> Thanks, >>> Daniel Oliveira