Kuai Yu created GOBBLIN-661:
-------------------------------

             Summary: Prevent jobs resubmission after manager failure
                 Key: GOBBLIN-661
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-661
             Project: Apache Gobblin
          Issue Type: Improvement
            Reporter: Kuai Yu
            Assignee: Kuai Yu


In gobblin cluster, if manager failed and relaunched, all the jobs persisted in 
the job catalog will be relaunched. This can cause a few issues:

1) Scalability issue: because the unfinished job might be submitted at 
different point of time, now if all of them are submitted at the same time, it 
can cause a performance issue.

2) Waste effort: because the unfinished job now needs to be deleted, we have to 
kill the existing running job, and resubmit.

 

In this change, we improve both 1) and 2)

1) In taskdriver mode, we will delete the job spec once we submit to Helix, 
because we believe Helix is durable and all the jobs submitted wont' be lost, 
so that we can safely delete the job specs. Next reboot manager won't see those 
deleted job spec, thus no resubmission is needed. 

2) In taskdriver mode, we will cleanup Helix running jobs. If it is a planning 
job, we won't delete it. Instead we just let it run to the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to