[
https://issues.apache.org/jira/browse/GOBBLIN-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601542#comment-17601542
]
Hanghang Liu commented on GOBBLIN-1692:
---------------------------------------
[https://github.com/apache/gobblin/pull/3546]
To summarize the PR is trying to address:
when update job event received, the GobblinHelixJobScheduler tries to stop the
old one and then launch the new one. When stop the old one, we used to have a
sync call of waitToStop through Helix.
[HelixUtils.waitJobCompletion|https://github.com/apache/gobblin/blob/8c9c8a84ed23c0215c4d80125ac532e97085d76f/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java#L278]
then detect the job state changed to stopping, then it immediately delete the
job, which causing waitToStop always throw exception. Change the waitToStop to
a async call can avoid the exception and we'll realize the job is completed by
checking the jobRunningMap, which shall be updated in the JobLauncher.
To fix the
[HelixUtils.waitJobCompletion|https://github.com/apache/gobblin/blob/8c9c8a84ed23c0215c4d80125ac532e97085d76f/gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java#L278]
incorrect deletion timing, we'll have a separate PR to address.
> Make GobblinHelixJobScheduler stop Helix workflow asynchronously
> ----------------------------------------------------------------
>
> Key: GOBBLIN-1692
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1692
> Project: Apache Gobblin
> Issue Type: Improvement
> Components: gobblin-cluster
> Reporter: Hanghang Liu
> Assignee: Hung Tran
> Priority: Major
>
> When handleUpdateJobConfigArrival, a new job config gets posted,
> GobblinHelixJobScheduler will firstly stop and delete the old job, and try to
> spin up the updated helix workflow.
> The job scheduler will try to do the stop synchronically with a default 10
> seconds timeout setting. However, this stop constantly running longer than
> the timeout for Helix, causing the job state not correctly updated as
> stopped. Thus, when construct the GobblinHelixJobLauncher, we will have the
> previous job in a wrong state as jobRunningMap is not updated yet, causing
> the new job won’t being launched. So we always see this log: {{{}Job {} will
> not be executed because other jobs are still running{}}}.
> We can make the job delete asynchronized, and let waitForJobCompletion method
> to ensure the job status get updated correctly eventually.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)