I've disabled concurrency for auto-triggered post-commits job. That should
reduce job scheduling considerably.

I believe that this change should resolve quota issue we have seen this
time. I'll monitor if problem reappears.

--Mikhail

Have feedback <http://go/migryz-feedback>?


On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pabl...@google.com> wrote:

> It feels to me like a peak of 60 jobs per minute is pretty high. If I
> understand correctly, we run up to 20 dataflow jobs in parallel per test
> suite? Or what's the number here?
>
> It is also true that most our tests are simple NeedsRunner tests, that
> test a couple elements, so the whole pipeline overhead is on startup. This
> may be improved by lumping tests together (though might we lose
> debuggability?).  Our average number of jobs is, I hope, muuuch smaller
> than 60 per minute...
>
> With all these considerations, I would lean more towards having a retry
> policy as the immediate solution.
> -P.
>
> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <apill...@google.com> wrote:
>
>> I like 1 and 2. How do credentials get into Jenkins? Could we create a
>> user per Jenkins host?
>>
>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>
>>> There was also a proposal to lump multiple tests into a single Dataflow
>>> job instead of spinning up a separate Dataflow job for each test.
>>>
>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com>
>>> wrote:
>>>
>>>> I synced with Rafael. Below is summary of discussion.
>>>>
>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
>>>> user by default.
>>>>
>>>> I've created Jira [BEAM-5053](
>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>>
>>>> I see following options we can utilize:
>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start per
>>>> second for whole Jenkins. In long scale this can also block one test job if
>>>> other jobs take all the slots.
>>>> 2. Utilize different users to spin Dataflow jobs.
>>>> 3. Find way to rise quota limit on Dataflow. By default the field
>>>> limits value to 60 requests per minute.
>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin
>>>> up and move tests to the form of unit or component tests.
>>>>
>>>> Please, fill in any insights or ideas you have on this.
>>>>
>>>> Regards,
>>>> --Mikhail
>>>>
>>>> Have feedback <http://go/migryz-feedback>?
>>>>
>>>>
>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> Seems that we hit quota issue again:
>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>>
>>>>> Can someone share information on how was this triaged last time or
>>>>> guide me on possible follow-up actions?
>>>>>
>>>>> Regards,
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Summary for all folks following this story -- and many thanks for
>>>>>> explaining configs to me and pointing me to files and such.
>>>>>>
>>>>>> - Scott made changes to the config and we can now run 3
>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>>> concurrent Dataflow jobs when running those
>>>>>> - I've been keeping an eye on quota peaks for all resources today and
>>>>>> have not seen any worryisome limits overall.
>>>>>> - Also note there are improvements planned to the
>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>>
>>>>>> Cheers,
>>>>>> r
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Done!
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>
>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>>> gcp-quota.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> r
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I 
>>>>>>>>>> guess
>>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> A few thoughts:
>>>>>>>>>>>
>>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. 
>>>>>>>>>>> Since
>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly 
>>>>>>>>>>> requested
>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So 
>>>>>>>>>>> this job
>>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>>
>>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have 
>>>>>>>>>>> different
>>>>>>>>>>> parallelism configurations. If we have budget, we could enable 
>>>>>>>>>>> concurrent
>>>>>>>>>>> execution of this job and increase our quota enough to give some 
>>>>>>>>>>> breathing
>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>>
>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects 
>>>>>>>>>>> of a
>>>>>>>>>>> runner. It would be more efficient to run locally only the tests 
>>>>>>>>>>> affected
>>>>>>>>>>> by your change. Note that this requires having access to a GCP 
>>>>>>>>>>> project with
>>>>>>>>>>> billing, but most Dataflow developers probably have access to this 
>>>>>>>>>>> already.
>>>>>>>>>>> The command for this is:
>>>>>>>>>>>
>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>>> [2]
>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>>
>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>>> number of parallel test runs is limited to the max number of 
>>>>>>>>>>>> workers
>>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>>
>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to 
>>>>>>>>>>>>>> control
>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not 
>>>>>>>>>>>>>> suffer from
>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 
>>>>>>>>>>>>>> hours end to
>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be 
>>>>>>>>>>>>>> scheduled). If
>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole 
>>>>>>>>>>>>>> process. In the
>>>>>>>>>>>>>> worst case, this process could have taken me days. While this is 
>>>>>>>>>>>>>> not as
>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes 
>>>>>>>>>>>>>> much easier
>>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he
>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took 
>>>>>>>>>>>>>>> 6 hours or
>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing 
>>>>>>>>>>>>>>> parallelism.
>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we 
>>>>>>>>>>>>>>> use. It seems
>>>>>>>>>>>>>>> the following is true:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and
>>>>>>>>>>>>>>>    seem to run one-at-a-time <-- I think we can safely 
>>>>>>>>>>>>>>> parallelize this to 20.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs 
>>>>>>>>>>>>>>> to execute,
>>>>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>

Reply via email to