Scott Wegner created BEAM-6081:
----------------------------------

             Summary: Create "Dataflow Reaper" infrastructure to periodically 
clean up stuck Dataflow jobs
                 Key: BEAM-6081
                 URL: https://issues.apache.org/jira/browse/BEAM-6081
             Project: Beam
          Issue Type: New Feature
          Components: build-system, testing
            Reporter: Scott Wegner
            Assignee: Alan Myrvold


Our Jenkins infrastructure continuously runs many Dataflow jobs as part of pre- 
and post-commit tests. These are scheduled against our shared 
{{apache-beam-testing}} project, which has some amount of GCP quota for these 
jobs.

Some bugs can cause Dataflow jobs to get stuck and hang indefinitely. This 
causes many test jobs to stack up, which eats up our GCP quota and then causes 
all subsequent jobs to fail for quota issues. For an example, see 
[[BEAM-6080]|https://issues.apache.org/jira/browse/BEAM-6080].

We should harden the Dataflow runner and test framework to prevent Dataflow 
jobs getting stuck indefinitely, but in reality: bugs happen.

We should add some "reaper" process to periodically query for long-running jobs 
on our Dataflow project and cancel them. This would be fairly straight-forward 
using the [Dataflow REST 
API|https://cloud.google.com/dataflow/docs/reference/rest/], and scheduled on 
Jenkins.

If we build such a mechanism, we should also document the imposed policy (i.e. 
the threshold for "long running jobs"), and perhaps some mechanism for opting 
out. For example, performance benchmarking jobs might be long-running by design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to