I like 1 and 2. How do credentials get into Jenkins? Could we create a user per Jenkins host?
On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote: > There was also a proposal to lump multiple tests into a single Dataflow > job instead of spinning up a separate Dataflow job for each test. > > On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com> > wrote: > >> I synced with Rafael. Below is summary of discussion. >> >> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per >> user by default. >> >> I've created Jira [BEAM-5053]( >> https://issues.apache.org/jira/browse/BEAM-5053) for this. >> >> I see following options we can utilize: >> 1. Add retry logic. Although this limits us to 1 dataflow job start per >> second for whole Jenkins. In long scale this can also block one test job if >> other jobs take all the slots. >> 2. Utilize different users to spin Dataflow jobs. >> 3. Find way to rise quota limit on Dataflow. By default the field limits >> value to 60 requests per minute. >> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up >> and move tests to the form of unit or component tests. >> >> Please, fill in any insights or ideas you have on this. >> >> Regards, >> --Mikhail >> >> Have feedback <http://go/migryz-feedback>? >> >> >> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com> >> wrote: >> >>> Hi Everyone, >>> >>> Seems that we hit quota issue again: >>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull >>> >>> Can someone share information on how was this triaged last time or guide >>> me on possible follow-up actions? >>> >>> Regards, >>> --Mikhail >>> >>> Have feedback <http://go/migryz-feedback>? >>> >>> >>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com> >>> wrote: >>> >>>> Summary for all folks following this story -- and many thanks for >>>> explaining configs to me and pointing me to files and such. >>>> >>>> - Scott made changes to the config and we can now run 3 >>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours) >>>> - With the latest quota changes, we peaked at ~70% capacity in >>>> concurrent Dataflow jobs when running those >>>> - I've been keeping an eye on quota peaks for all resources today and >>>> have not seen any worryisome limits overall. >>>> - Also note there are improvements planned to the >>>> ValidatesRunner.Dataflow test so various items get batched and the test >>>> itself runs faster -- I believe it's on Alan's radar >>>> >>>> Cheers, >>>> r >>>> >>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com> >>>> wrote: >>>> >>>>> Done! >>>>> >>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote: >>>>> >>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. >>>>>> Can you take a look? I've filed [BEAM-4722]: >>>>>> https://issues.apache.org/jira/browse/BEAM-4722 >>>>>> >>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630 >>>>>> >>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rfern...@google.com> >>>>>> wrote: >>>>>> >>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . >>>>>>> Quotas should not be a problem, if they are, please file a JIRA under >>>>>>> gcp-quota. >>>>>>> >>>>>>> Cheers, >>>>>>> r >>>>>>> >>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> One thing that is nice when you do this is to be able to share your >>>>>>>> results. Though if all you are sharing is "they passed" then I guess we >>>>>>>> don't have to insist on evidence. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> A few thoughts: >>>>>>>>> >>>>>>>>> * The Jenkins job getting backed up >>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since >>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly >>>>>>>>> requested >>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So >>>>>>>>> this job >>>>>>>>> is idle more often than backlogged. >>>>>>>>> >>>>>>>>> * It's difficult to reason about our exact quota needs because >>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have >>>>>>>>> different >>>>>>>>> parallelism configurations. If we have budget, we could enable >>>>>>>>> concurrent >>>>>>>>> execution of this job and increase our quota enough to give some >>>>>>>>> breathing >>>>>>>>> room. If we do this, I recommend limiting the max concurrency via >>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>>>>>> >>>>>>>>> * This test suite is meant to be an exhaustive post-commit >>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects >>>>>>>>> of a >>>>>>>>> runner. It would be more efficient to run locally only the tests >>>>>>>>> affected >>>>>>>>> by your change. Note that this requires having access to a GCP >>>>>>>>> project with >>>>>>>>> billing, but most Dataflow developers probably have access to this >>>>>>>>> already. >>>>>>>>> The command for this is: >>>>>>>>> >>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot >>>>>>>>> --tests "org.apache.beam.MyTestClass" >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>>>>>> [2] >>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The validates runner test parallelism is controlled here and is >>>>>>>>>> currently set to be "unlimited": >>>>>>>>>> >>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>>>>>> >>>>>>>>>> Each test fork is run on a different gradle worker, so the number >>>>>>>>>> of parallel test runs is limited to the max number of workers >>>>>>>>>> configured >>>>>>>>>> which is controlled here: >>>>>>>>>> >>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>>>>>> It is currently configured to 3 * number of CPU cores. >>>>>>>>>> >>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez < >>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>>>>>> - Where are those settings? >>>>>>>>>>> >>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. >>>>>>>>>>>> We currently allow only one of these to run at a time, to control >>>>>>>>>>>> usage of >>>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from >>>>>>>>>>>> this >>>>>>>>>>>> issue. >>>>>>>>>>>> >>>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota >>>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to >>>>>>>>>>>> end to >>>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If >>>>>>>>>>>> there was a >>>>>>>>>>>> failure, I would have had to repeat the whole process. In the >>>>>>>>>>>> worst case, >>>>>>>>>>>> this process could have taken me days. While this is not as >>>>>>>>>>>> pressing as >>>>>>>>>>>> some other issues (as most people don't need to run the Dataflow >>>>>>>>>>>> tests on >>>>>>>>>>>> every PR), fixing it would make such changes much easier to manage. >>>>>>>>>>>> >>>>>>>>>>>> Reuven >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was >>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 >>>>>>>>>>>>> hours or so. I >>>>>>>>>>>>> would like to help reduce these wait times by increasing >>>>>>>>>>>>> parallelism. I >>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It >>>>>>>>>>>>> seems the >>>>>>>>>>>>> following is true: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - There seems to always be 16 jenkins machines on (16 CPUs >>>>>>>>>>>>> each) >>>>>>>>>>>>> - There seems to be three GKE machines always on (1 CPU >>>>>>>>>>>>> each) >>>>>>>>>>>>> - Most (if not all) unit tests run on 1 machine, and seem >>>>>>>>>>>>> to run one-at-a-time <-- I think we can safely parallelize >>>>>>>>>>>>> this to 20. >>>>>>>>>>>>> >>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit >>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs >>>>>>>>>>>>> to execute, >>>>>>>>>>>>> with 75% of CPU capacity. >>>>>>>>>>>>> >>>>>>>>>>>>> Thoughts? Additional data? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> r >>>>>>>>>>>>> >>>>>>>>>>>>