[jira] [Commented] (GOBBLIN-318) Gobblin Helix Jobs Hang Indefinitely

Matthew Ho (Jira) Wed, 15 Feb 2023 10:59:11 -0800


    [ 
https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689308#comment-17689308
 ]


Matthew Ho commented on GOBBLIN-318:
------------------------------------

[https://github.com/apache/gobblin/pull/3580]

 

Issue is addressed by using delete workflow instead of stop and wait. For 
gobblin use-case, delete works much better and is confirmed to be safer because 
the job is immediately dropped and task runners are unable to commit data 
anymore

> Gobblin Helix Jobs Hang Indefinitely 
> -------------------------------------
>
>                 Key: GOBBLIN-318
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-318
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Joel Baranick
>            Priority: Critical
>
> In some cases, gobblin helix jobs can hang indefinitely.  When coupled with 
> job locks, this can result in a job becoming stuck and not progressing.  The 
> only solution currently is to restart the master node.
> Assume the following is for a {{job_myjob_1510884004834}} and which hung at 
> 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. 
> {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job 
> as completed. This results in the {{TaskStateCollectorService}} indefinitely 
> searching for more task states, even though it has processed all the task 
> states that are ever going to be produced.  There is no reference to the hung 
> job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}.  In the Helix Web Admin, 
> the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. 
> There is no record of the job in Zookeeper at 
> {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}.  This means that 
> the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails.
> {code:java}
> private void waitForJobCompletion() throws InterruptedException {
>     while (true) {
>       WorkflowContext workflowContext = 
> TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName);
>       if (workflowContext != null) {
>         org.apache.helix.task.TaskState helixJobState = 
> workflowContext.getJobState(this.jobResourceName);
>         if (helixJobState == org.apache.helix.task.TaskState.COMPLETED ||
>             helixJobState == org.apache.helix.task.TaskState.FAILED ||
>             helixJobState == org.apache.helix.task.TaskState.STOPPED) {
>           return;
>         }
>       }
>       Thread.sleep(1000);
>     }
>   }
> {code}
> The code gets the job state from Zookeeper:
> {code:javascript}
> {
>   "id": "WorkflowContext",
>   "simpleFields": {
>     "START_TIME": "1505159715449",
>     "STATE": "IN_PROGRESS"
>   },
>   "listFields": {},
>   "mapFields": {
>     "JOB_STATES": {
>       "jobname_job_jobname_1507415700001": "COMPLETED",
>       "jobname_job_jobname_1507756800000": "COMPLETED",
>       "jobname_job_jobname_1507959300001": "COMPLETED",
>       "jobname_job_jobname_1509857102910": "COMPLETED",
>       "jobname_job_jobname_1510253708033": "COMPLETED",
>       "jobname_job_jobname_1510271102898": "COMPLETED",
>       "jobname_job_jobname_1510852210668": "COMPLETED",
>       "jobname_job_jobname_1510853133675": "COMPLETED"
>     }
>   }
> }
> {code}
> But there is no information contained in the job state for the hung job.
> Also, it is really strange that the job states contained in that json blob 
> are so old.  The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a 
> month ago.
> I'm not sure how the system got in this state, but this isn't the first time 
> we have seen this.  While it would be good to prevent this from happening, it 
> would also be good to allow the system to recover if this state is entered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (GOBBLIN-318) Gobblin Helix Jobs Hang Indefinitely

Reply via email to