I've disabled concurrency for auto-triggered post-commits job. That should reduce job scheduling considerably.
I believe that this change should resolve quota issue we have seen this time. I'll monitor if problem reappears. --Mikhail Have feedback <http://go/migryz-feedback>? On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pabl...@google.com> wrote: > It feels to me like a peak of 60 jobs per minute is pretty high. If I > understand correctly, we run up to 20 dataflow jobs in parallel per test > suite? Or what's the number here? > > It is also true that most our tests are simple NeedsRunner tests, that > test a couple elements, so the whole pipeline overhead is on startup. This > may be improved by lumping tests together (though might we lose > debuggability?). Our average number of jobs is, I hope, muuuch smaller > than 60 per minute... > > With all these considerations, I would lean more towards having a retry > policy as the immediate solution. > -P. > > On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <apill...@google.com> wrote: > >> I like 1 and 2. How do credentials get into Jenkins? Could we create a >> user per Jenkins host? >> >> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote: >> >>> There was also a proposal to lump multiple tests into a single Dataflow >>> job instead of spinning up a separate Dataflow job for each test. >>> >>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com> >>> wrote: >>> >>>> I synced with Rafael. Below is summary of discussion. >>>> >>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per >>>> user by default. >>>> >>>> I've created Jira [BEAM-5053]( >>>> https://issues.apache.org/jira/browse/BEAM-5053) for this. >>>> >>>> I see following options we can utilize: >>>> 1. Add retry logic. Although this limits us to 1 dataflow job start per >>>> second for whole Jenkins. In long scale this can also block one test job if >>>> other jobs take all the slots. >>>> 2. Utilize different users to spin Dataflow jobs. >>>> 3. Find way to rise quota limit on Dataflow. By default the field >>>> limits value to 60 requests per minute. >>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin >>>> up and move tests to the form of unit or component tests. >>>> >>>> Please, fill in any insights or ideas you have on this. >>>> >>>> Regards, >>>> --Mikhail >>>> >>>> Have feedback <http://go/migryz-feedback>? >>>> >>>> >>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com> >>>> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> Seems that we hit quota issue again: >>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull >>>>> >>>>> Can someone share information on how was this triaged last time or >>>>> guide me on possible follow-up actions? >>>>> >>>>> Regards, >>>>> --Mikhail >>>>> >>>>> Have feedback <http://go/migryz-feedback>? >>>>> >>>>> >>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com> >>>>> wrote: >>>>> >>>>>> Summary for all folks following this story -- and many thanks for >>>>>> explaining configs to me and pointing me to files and such. >>>>>> >>>>>> - Scott made changes to the config and we can now run 3 >>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours) >>>>>> - With the latest quota changes, we peaked at ~70% capacity in >>>>>> concurrent Dataflow jobs when running those >>>>>> - I've been keeping an eye on quota peaks for all resources today and >>>>>> have not seen any worryisome limits overall. >>>>>> - Also note there are improvements planned to the >>>>>> ValidatesRunner.Dataflow test so various items get batched and the test >>>>>> itself runs faster -- I believe it's on Alan's radar >>>>>> >>>>>> Cheers, >>>>>> r >>>>>> >>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Done! >>>>>>> >>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. >>>>>>>> Can you take a look? I've filed [BEAM-4722]: >>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722 >>>>>>>> >>>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630 >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez < >>>>>>>> rfern...@google.com> wrote: >>>>>>>> >>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . >>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under >>>>>>>>> gcp-quota. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> r >>>>>>>>> >>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> One thing that is nice when you do this is to be able to share >>>>>>>>>> your results. Though if all you are sharing is "they passed" then I >>>>>>>>>> guess >>>>>>>>>> we don't have to insist on evidence. >>>>>>>>>> >>>>>>>>>> Kenn >>>>>>>>>> >>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> A few thoughts: >>>>>>>>>>> >>>>>>>>>>> * The Jenkins job getting backed up >>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. >>>>>>>>>>> Since >>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly >>>>>>>>>>> requested >>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So >>>>>>>>>>> this job >>>>>>>>>>> is idle more often than backlogged. >>>>>>>>>>> >>>>>>>>>>> * It's difficult to reason about our exact quota needs because >>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have >>>>>>>>>>> different >>>>>>>>>>> parallelism configurations. If we have budget, we could enable >>>>>>>>>>> concurrent >>>>>>>>>>> execution of this job and increase our quota enough to give some >>>>>>>>>>> breathing >>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via >>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>>>>>>>> >>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit >>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects >>>>>>>>>>> of a >>>>>>>>>>> runner. It would be more efficient to run locally only the tests >>>>>>>>>>> affected >>>>>>>>>>> by your change. Note that this requires having access to a GCP >>>>>>>>>>> project with >>>>>>>>>>> billing, but most Dataflow developers probably have access to this >>>>>>>>>>> already. >>>>>>>>>>> The command for this is: >>>>>>>>>>> >>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot >>>>>>>>>>> --tests "org.apache.beam.MyTestClass" >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>>>>>>>> [2] >>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The validates runner test parallelism is controlled here and is >>>>>>>>>>>> currently set to be "unlimited": >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>>>>>>>> >>>>>>>>>>>> Each test fork is run on a different gradle worker, so the >>>>>>>>>>>> number of parallel test runs is limited to the max number of >>>>>>>>>>>> workers >>>>>>>>>>>> configured which is controlled here: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>>>>>>>> It is currently configured to 3 * number of CPU cores. >>>>>>>>>>>> >>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez < >>>>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>>>>>>>> - Where are those settings? >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner >>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to >>>>>>>>>>>>>> control >>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not >>>>>>>>>>>>>> suffer from >>>>>>>>>>>>>> this issue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow >>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 >>>>>>>>>>>>>> hours end to >>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be >>>>>>>>>>>>>> scheduled). If >>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole >>>>>>>>>>>>>> process. In the >>>>>>>>>>>>>> worst case, this process could have taken me days. While this is >>>>>>>>>>>>>> not as >>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run >>>>>>>>>>>>>> the >>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes >>>>>>>>>>>>>> much easier >>>>>>>>>>>>>> to manage. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Reuven >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he >>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took >>>>>>>>>>>>>>> 6 hours or >>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing >>>>>>>>>>>>>>> parallelism. >>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we >>>>>>>>>>>>>>> use. It seems >>>>>>>>>>>>>>> the following is true: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - There seems to always be 16 jenkins machines on (16 >>>>>>>>>>>>>>> CPUs each) >>>>>>>>>>>>>>> - There seems to be three GKE machines always on (1 CPU >>>>>>>>>>>>>>> each) >>>>>>>>>>>>>>> - Most (if not all) unit tests run on 1 machine, and >>>>>>>>>>>>>>> seem to run one-at-a-time <-- I think we can safely >>>>>>>>>>>>>>> parallelize this to 20. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit >>>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs >>>>>>>>>>>>>>> to execute, >>>>>>>>>>>>>>> with 75% of CPU capacity. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thoughts? Additional data? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> r >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- > Got feedback? go/pabloem-feedback > <https://goto.google.com/pabloem-feedback> >