There was also a proposal to lump multiple tests into a single Dataflow job
instead of spinning up a separate Dataflow job for each test.

On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com> wrote:

> I synced with Rafael. Below is summary of discussion.
>
> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
> user by default.
>
> I've created Jira [BEAM-5053](
> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>
> I see following options we can utilize:
> 1. Add retry logic. Although this limits us to 1 dataflow job start per
> second for whole Jenkins. In long scale this can also block one test job if
> other jobs take all the slots.
> 2. Utilize different users to spin Dataflow jobs.
> 3. Find way to rise quota limit on Dataflow. By default the field limits
> value to 60 requests per minute.
> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up
> and move tests to the form of unit or component tests.
>
> Please, fill in any insights or ideas you have on this.
>
> Regards,
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com>
> wrote:
>
>> Hi Everyone,
>>
>> Seems that we hit quota issue again:
>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>
>> Can someone share information on how was this triaged last time or guide
>> me on possible follow-up actions?
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com>
>> wrote:
>>
>>> Summary for all folks following this story -- and many thanks for
>>> explaining configs to me and pointing me to files and such.
>>>
>>> - Scott made changes to the config and we can now run 3
>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>> - With the latest quota changes, we peaked at ~70% capacity in
>>> concurrent Dataflow jobs when running those
>>> - I've been keeping an eye on quota peaks for all resources today and
>>> have not seen any worryisome limits overall.
>>> - Also note there are improvements planned to the
>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>> itself runs faster -- I believe it's on Alan's radar
>>>
>>> Cheers,
>>> r
>>>
>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com>
>>> wrote:
>>>
>>>> Done!
>>>>
>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>>
>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>
>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>
>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rfern...@google.com>
>>>>> wrote:
>>>>>
>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>> gcp-quota.
>>>>>>
>>>>>> Cheers,
>>>>>> r
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One thing that is nice when you do this is to be able to share your
>>>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>>>> don't have to insist on evidence.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> A few thoughts:
>>>>>>>>
>>>>>>>> * The Jenkins job getting backed up
>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly 
>>>>>>>> requested
>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this 
>>>>>>>> job
>>>>>>>> is idle more often than backlogged.
>>>>>>>>
>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have 
>>>>>>>> different
>>>>>>>> parallelism configurations. If we have budget, we could enable 
>>>>>>>> concurrent
>>>>>>>> execution of this job and increase our quota enough to give some 
>>>>>>>> breathing
>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>
>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of 
>>>>>>>> a
>>>>>>>> runner. It would be more efficient to run locally only the tests 
>>>>>>>> affected
>>>>>>>> by your change. Note that this requires having access to a GCP project 
>>>>>>>> with
>>>>>>>> billing, but most Dataflow developers probably have access to this 
>>>>>>>> already.
>>>>>>>> The command for this is:
>>>>>>>>
>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>> [2]
>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>
>>>>>>>>> Each test fork is run on a different gradle worker, so the number
>>>>>>>>> of parallel test runs is limited to the max number of workers 
>>>>>>>>> configured
>>>>>>>>> which is controlled here:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>
>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>> - Where are those settings?
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests.
>>>>>>>>>>> We currently allow only one of these to run at a time, to control 
>>>>>>>>>>> usage of
>>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from 
>>>>>>>>>>> this
>>>>>>>>>>> issue.
>>>>>>>>>>>
>>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota
>>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to 
>>>>>>>>>>> end to
>>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If 
>>>>>>>>>>> there was a
>>>>>>>>>>> failure, I would have had to repeat the whole process. In the worst 
>>>>>>>>>>> case,
>>>>>>>>>>> this process could have taken me days. While this is not as 
>>>>>>>>>>> pressing as
>>>>>>>>>>> some other issues (as most people don't need to run the Dataflow 
>>>>>>>>>>> tests on
>>>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours 
>>>>>>>>>>>> or so. I
>>>>>>>>>>>> would like to help reduce these wait times by increasing 
>>>>>>>>>>>> parallelism. I
>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It 
>>>>>>>>>>>> seems the
>>>>>>>>>>>> following is true:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>>>    each)
>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>    each)
>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem
>>>>>>>>>>>>    to run one-at-a-time <-- I think we can safely parallelize this 
>>>>>>>>>>>> to 20.
>>>>>>>>>>>>
>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to 
>>>>>>>>>>>> execute,
>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>
>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> r
>>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to