Yeah, that's useful. I was asking about getting things at the jenkins job level. E.g. are our PostCommits taking up all the time, or our Precommits?
On Tue, Sep 24, 2019 at 1:23 PM Lukasz Cwik <lc...@google.com> wrote: > > We can get the per gradle task profile with the --profile flag: > https://jakewharton.com/static/files/trace/profile.html > This information also appears within the build scans that are sent to Gradle. > > Integrating with either of these sources of information would allow us to > figure out whether its new tasks or old tasks taking longer. > > On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw <rober...@google.com> wrote: >> >> Does anyone know how to gather stats on where the time is being spent? >> Several times the idea of consolidating many of the (expensive) >> validates runner integration tests into a single pipeline, and then >> running things individually only if that fails, has come up. I think >> that'd be a big win if indeed this is where our time is being spent. >> >> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira <danolive...@google.com> >> wrote: >> > >> > Those ideas all sound good. I especially agree with trying to reduce tests >> > first and then if we've done all we can there and latency is still too >> > high, it means we need more workers. Also in addition to reducing the >> > amount of tests, there's also running less important tests less >> > frequently, particularly when it comes to postcommits since many of those >> > are resource intensive. That would require people with good context around >> > what our many postcommits are used for. >> > >> > Another idea I thought of is trying to avoid running automated tests >> > outside of peak coding times. Ideally, during the times when we get the >> > greatest amounts of PRs (and therefore precommits) we shouldn't have any >> > postcommits running. If we have both pre and postcommits going at the same >> > time during peak hours, our queue times will shoot up even if the total >> > amount of work doesn't change much. >> > >> > Btw, you mentioned that this was a problem last year. Do you have any >> > links to discussions about that? It seems like it could be useful. >> > >> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin <mig...@google.com> >> > wrote: >> >> >> >> Hi Daniel, >> >> >> >> Generally this looks feasible since jobs wait for new worker to be >> >> available to start. >> >> >> >> Over time we added more tests and did not deprecate enough, this >> >> increases load on workers. I wonder if we can add something like total >> >> runtime of all running jobs? This will be a safeguard metric that will >> >> show amount of time we actually run jobs. If it increases with same >> >> amount of workers, that will prove that we are overloading them (inverse >> >> is not necessarily correct). >> >> >> >> On addressing this, we can review approaches we took last year and see if >> >> any of them apply. If I do some brainstorming, following ideas come to >> >> mind: add more work force, reduce amount of tests, do better work on >> >> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests >> >> if linter fails) and/or add option for cancelling irrelevant jobs. One >> >> more big point can be effort on deflaking, but we seem to be decent in >> >> this area. >> >> >> >> Regards, >> >> Mikhail. >> >> >> >> >> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <danolive...@google.com> >> >> wrote: >> >>> >> >>> Hi everyone, >> >>> >> >>> A little while ago I was taking a look at the Precommit Latency metrics >> >>> on Grafana (link) and saw that the monthly 90th percentile metric has >> >>> been really increasing the past few months, from around 10 minutes to >> >>> currently around 30 minutes. >> >>> >> >>> After doing some light digging I was shown this page (beam load >> >>> statistics) which seems to imply that queue times are shooting up when >> >>> all the test executors are occupied, and it seems this is happening >> >>> longer and more often recently. I also took a look at the commit history >> >>> for our Jenkins tests and I see that new tests have steadily been added. >> >>> >> >>> I wanted to bring this up with the dev@ to ask: >> >>> >> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does >> >>> anyone know how to double check my assumptions with more concrete >> >>> metrics? >> >>> >> >>> 2. Does anyone have ideas on how to address this? >> >>> >> >>> Thanks, >> >>> Daniel Oliveira