[
https://issues.apache.org/jira/browse/GOBBLIN-1721?focusedWorklogId=817817&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-817817
]
ASF GitHub Bot logged work on GOBBLIN-1721:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 17/Oct/22 20:46
Start Date: 17/Oct/22 20:46
Worklog Time Spent: 10m
Work Description: homatthew commented on code in PR #3580:
URL: https://github.com/apache/gobblin/pull/3580#discussion_r997494999
##########
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobScheduler.java:
##########
@@ -431,8 +431,15 @@ private void
cancelJobIfRequired(DeleteJobConfigArrivalEvent deleteJobArrival) t
if (jobNameToWorkflowIdMap.containsKey(deleteJobArrival.getJobName())) {
String workflowId =
jobNameToWorkflowIdMap.get(deleteJobArrival.getJobName());
TaskDriver taskDriver = new TaskDriver(this.jobHelixManager);
- taskDriver.waitToStop(workflowId, this.helixJobStopTimeoutMillis);
- LOGGER.info("Stopped workflow: {}", deleteJobArrival.getJobName());
+ // Cancel the job by calling either Delete or Stop Helix API
+ if (PropertiesUtils.getPropAsBoolean(jobConfig,
GobblinClusterConfigurationKeys.CANCEL_HELIX_JOB_BY_DELETE,
+
GobblinClusterConfigurationKeys.DEFAULT_CANCEL_HELIX_JOB_BY_DELETE)) {
+ taskDriver.delete(workflowId);
+ LOGGER.info("Canceling Helix workflow: {} through delete API",
deleteJobArrival.getJobName());
+ } else {
Review Comment:
A few open questions for us to discuss:
1. We use this waitToStop in other places. Should we consider using this
delete api as a replacement for all calls? We use the taskrunner code basically
everywhere and that is the rootcause for long stopping times.
2. We should plan on cleaning up this config if we see no issues after
further testing. Can we create a JIRA for this and add a comment for this.
Issue Time Tracking
-------------------
Worklog Id: (was: 817817)
Time Spent: 20m (was: 10m)
> Give option to cancel helix workflow through Delete API to avoid job hanging
> ----------------------------------------------------------------------------
>
> Key: GOBBLIN-1721
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1721
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-cluster
> Reporter: Hanghang Liu
> Assignee: Hung Tran
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Currently when we receive a job restart(handleUpdateJobConfigArrival),
> GobblinHelixJobLauncher will firstly callĀ helixTaskDriver.waitToStop to stop
> the workflow, then initiate the new one. We observe the behavior of Helix
> taking exceptionally long to stop the workflow, making the job state staying
> in STOPPING status. This will make waitToStop timeout and throw exception all
> the time, making the new flow never be able to launch.
> We can utilize Delete API in this case since our job is stateless for Helix,
> to avoid job hanging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)