I like 1 and 2. How do credentials get into Jenkins? Could we create a user
per Jenkins host?

On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:

> There was also a proposal to lump multiple tests into a single Dataflow
> job instead of spinning up a separate Dataflow job for each test.
>
> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com>
> wrote:
>
>> I synced with Rafael. Below is summary of discussion.
>>
>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
>> user by default.
>>
>> I've created Jira [BEAM-5053](
>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>
>> I see following options we can utilize:
>> 1. Add retry logic. Although this limits us to 1 dataflow job start per
>> second for whole Jenkins. In long scale this can also block one test job if
>> other jobs take all the slots.
>> 2. Utilize different users to spin Dataflow jobs.
>> 3. Find way to rise quota limit on Dataflow. By default the field limits
>> value to 60 requests per minute.
>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up
>> and move tests to the form of unit or component tests.
>>
>> Please, fill in any insights or ideas you have on this.
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> Seems that we hit quota issue again:
>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>
>>> Can someone share information on how was this triaged last time or guide
>>> me on possible follow-up actions?
>>>
>>> Regards,
>>> --Mikhail
>>>
>>> Have feedback <http://go/migryz-feedback>?
>>>
>>>
>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com>
>>> wrote:
>>>
>>>> Summary for all folks following this story -- and many thanks for
>>>> explaining configs to me and pointing me to files and such.
>>>>
>>>> - Scott made changes to the config and we can now run 3
>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>> concurrent Dataflow jobs when running those
>>>> - I've been keeping an eye on quota peaks for all resources today and
>>>> have not seen any worryisome limits overall.
>>>> - Also note there are improvements planned to the
>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>> itself runs faster -- I believe it's on Alan's radar
>>>>
>>>> Cheers,
>>>> r
>>>>
>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com>
>>>> wrote:
>>>>
>>>>> Done!
>>>>>
>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>>>
>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>
>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rfern...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>> gcp-quota.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> r
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One thing that is nice when you do this is to be able to share your
>>>>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>>>>> don't have to insist on evidence.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> A few thoughts:
>>>>>>>>>
>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly 
>>>>>>>>> requested
>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So 
>>>>>>>>> this job
>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>
>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have 
>>>>>>>>> different
>>>>>>>>> parallelism configurations. If we have budget, we could enable 
>>>>>>>>> concurrent
>>>>>>>>> execution of this job and increase our quota enough to give some 
>>>>>>>>> breathing
>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>
>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects 
>>>>>>>>> of a
>>>>>>>>> runner. It would be more efficient to run locally only the tests 
>>>>>>>>> affected
>>>>>>>>> by your change. Note that this requires having access to a GCP 
>>>>>>>>> project with
>>>>>>>>> billing, but most Dataflow developers probably have access to this 
>>>>>>>>> already.
>>>>>>>>> The command for this is:
>>>>>>>>>
>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>> [2]
>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>
>>>>>>>>>> Each test fork is run on a different gradle worker, so the number
>>>>>>>>>> of parallel test runs is limited to the max number of workers 
>>>>>>>>>> configured
>>>>>>>>>> which is controlled here:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>
>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests.
>>>>>>>>>>>> We currently allow only one of these to run at a time, to control 
>>>>>>>>>>>> usage of
>>>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from 
>>>>>>>>>>>> this
>>>>>>>>>>>> issue.
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota
>>>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to 
>>>>>>>>>>>> end to
>>>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If 
>>>>>>>>>>>> there was a
>>>>>>>>>>>> failure, I would have had to repeat the whole process. In the 
>>>>>>>>>>>> worst case,
>>>>>>>>>>>> this process could have taken me days. While this is not as 
>>>>>>>>>>>> pressing as
>>>>>>>>>>>> some other issues (as most people don't need to run the Dataflow 
>>>>>>>>>>>> tests on
>>>>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>>>>
>>>>>>>>>>>> Reuven
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 
>>>>>>>>>>>>> hours or so. I
>>>>>>>>>>>>> would like to help reduce these wait times by increasing 
>>>>>>>>>>>>> parallelism. I
>>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It 
>>>>>>>>>>>>> seems the
>>>>>>>>>>>>> following is true:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem
>>>>>>>>>>>>>    to run one-at-a-time <-- I think we can safely parallelize 
>>>>>>>>>>>>> this to 20.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs 
>>>>>>>>>>>>> to execute,
>>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> r
>>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to