[
https://issues.apache.org/jira/browse/BEAM-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Myrvold reassigned BEAM-6081:
----------------------------------
Assignee: Alan Myrvold
> Create "Dataflow Reaper" infrastructure to periodically clean up stuck
> Dataflow jobs
> ------------------------------------------------------------------------------------
>
> Key: BEAM-6081
> URL: https://issues.apache.org/jira/browse/BEAM-6081
> Project: Beam
> Issue Type: New Feature
> Components: build-system, testing
> Reporter: Scott Wegner
> Assignee: Alan Myrvold
> Priority: Minor
>
> Our Jenkins infrastructure continuously runs many Dataflow jobs as part of
> pre- and post-commit tests. These are scheduled against our shared
> {{apache-beam-testing}} project, which has some amount of GCP quota for these
> jobs.
> Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This
> causes many test jobs to stack up, which eats up our GCP quota and then
> causes all subsequent jobs to fail for quota issues. For an example, see
> [[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080].
> We should harden the Dataflow runner and test framework to prevent Dataflow
> jobs getting stuck indefinitely, but in reality: bugs happen.
> We should add some "reaper" process to periodically query for long-running
> jobs on our Dataflow project and cancel them. This would be fairly
> straight-forward using the [Dataflow REST
> API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on
> Jenkins.
> If we build such a mechanism, we should also document the imposed policy
> (i.e. the threshold for "long running jobs"), and perhaps some mechanism for
> opting out. For example, performance benchmarking jobs might be long-running
> by design.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)