[jira] [Work logged] (GOBBLIN-1721) Give option to cancel helix workflow through Delete API to avoid job hanging

ASF GitHub Bot (Jira) Mon, 17 Oct 2022 13:47:10 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1721?focusedWorklogId=817817&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-817817
 ]


ASF GitHub Bot logged work on GOBBLIN-1721:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Oct/22 20:46
            Start Date: 17/Oct/22 20:46
    Worklog Time Spent: 10m 
      Work Description: homatthew commented on code in PR #3580:
URL: https://github.com/apache/gobblin/pull/3580#discussion_r997494999


##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java:
##########
@@ -431,8 +431,15 @@ private void 
cancelJobIfRequired(DeleteJobConfigArrivalEvent deleteJobArrival) t
       if (jobNameToWorkflowIdMap.containsKey(deleteJobArrival.getJobName())) {
         String workflowId = 
jobNameToWorkflowIdMap.get(deleteJobArrival.getJobName());
         TaskDriver taskDriver = new TaskDriver(this.jobHelixManager);
-        taskDriver.waitToStop(workflowId, this.helixJobStopTimeoutMillis);
-        LOGGER.info("Stopped workflow: {}", deleteJobArrival.getJobName());
+        // Cancel the job by calling either Delete or Stop Helix API
+        if (PropertiesUtils.getPropAsBoolean(jobConfig, 
GobblinClusterConfigurationKeys.CANCEL_HELIX_JOB_BY_DELETE,
+            
GobblinClusterConfigurationKeys.DEFAULT_CANCEL_HELIX_JOB_BY_DELETE)) {
+          taskDriver.delete(workflowId);
+          LOGGER.info("Canceling Helix workflow: {} through delete API", 
deleteJobArrival.getJobName());
+        } else {

Review Comment:
   A few open questions for us to discuss:
   1. We use this waitToStop in other places. Should we consider using this 
delete api as a replacement for all calls? We use the taskrunner code basically 
everywhere and that is the rootcause for long stopping times.
   2. We should plan on cleaning up this config if we see no issues after 
further testing. Can we create a JIRA for this and add a comment for this.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 817817)
    Time Spent: 20m  (was: 10m)

> Give option to cancel helix workflow through Delete API to avoid job hanging
> ----------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1721
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1721
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-cluster
>            Reporter: Hanghang Liu
>            Assignee: Hung Tran
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently when we receive a job restart(handleUpdateJobConfigArrival), 
> GobblinHelixJobLauncher will firstly call  helixTaskDriver.waitToStop to stop 
> the workflow, then initiate the new one. We observe the behavior of Helix 
> taking exceptionally long to stop the workflow, making the job state staying 
> in STOPPING status. This will make waitToStop timeout and throw exception all 
> the time, making the new flow never be able to launch.
> We can utilize Delete API in this case since our job is stateless for Helix, 
> to avoid job hanging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1721) Give option to cancel helix workflow through Delete API to avoid job hanging

Reply via email to