Scott Wegner created BEAM-6081:
----------------------------------
Summary: Create "Dataflow Reaper" infrastructure to periodically
clean up stuck Dataflow jobs
Key: BEAM-6081
URL: https://issues.apache.org/jira/browse/BEAM-6081
Project: Beam
Issue Type: New Feature
Components: build-system, testing
Reporter: Scott Wegner
Assignee: Alan Myrvold
Our Jenkins infrastructure continuously runs many Dataflow jobs as part of pre-
and post-commit tests. These are scheduled against our shared
{{apache-beam-testing}} project, which has some amount of GCP quota for these
jobs.
Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This
causes many test jobs to stack up, which eats up our GCP quota and then causes
all subsequent jobs to fail for quota issues. For an example, see
[[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080].
We should harden the Dataflow runner and test framework to prevent Dataflow
jobs getting stuck indefinitely, but in reality: bugs happen.
We should add some "reaper" process to periodically query for long-running jobs
on our Dataflow project and cancel them. This would be fairly straight-forward
using the [Dataflow REST
API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on
Jenkins.
If we build such a mechanism, we should also document the imposed policy (i.e.
the threshold for "long running jobs"), and perhaps some mechanism for opting
out. For example, performance benchmarking jobs might be long-running by design.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)