[ 
https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286433#comment-16286433
 ] 

Joel Baranick edited comment on GOBBLIN-318 at 12/11/17 7:36 PM:
-----------------------------------------------------------------

[~abti] Job timeouts will help.  That said, the underlying issue still needs to 
be fixed.

A couple more pieces of info that might help figure out what is going on.
# We write the task state to EFS, so it isn't an S3 eventual consistency issue.
# The TaskStateCollectorService recognizes that the task is done. I know that 
because the sum of the completed task count from "Collected task state of %d 
completed tasks" equals that task count for the job.



was (Author: jbaranick):
[~abti] Job timeouts will help.  That said, the underlying issue of the 
TaskCollectorService missing task states should be resolved.

One other piece of info.  We write the task state to EFS, so it isn't an S3 
eventual consistency issue.

> Gobblin Helix Jobs Hang Indefinitely 
> -------------------------------------
>
>                 Key: GOBBLIN-318
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-318
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Joel Baranick
>            Priority: Critical
>
> In some cases, gobblin helix jobs can hang indefinitely.  When coupled with 
> job locks, this can result in a job becoming stuck and not progressing.  The 
> only solution currently is to restart the master node.
> Assume the following is for a {{job_myjob_1510884004834}} and which hung at 
> 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. 
> {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job 
> as completed. This results in the {{TaskStateCollectorService}} indefinitely 
> searching for more task states, even though it has processed all the task 
> states that are ever going to be produced.  There is no reference to the hung 
> job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}.  In the Helix Web Admin, 
> the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. 
> There is no record of the job in Zookeeper at 
> {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}.  This means that 
> the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails.
> {code:java}
> private void waitForJobCompletion() throws InterruptedException {
>     while (true) {
>       WorkflowContext workflowContext = 
> TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName);
>       if (workflowContext != null) {
>         org.apache.helix.task.TaskState helixJobState = 
> workflowContext.getJobState(this.jobResourceName);
>         if (helixJobState == org.apache.helix.task.TaskState.COMPLETED ||
>             helixJobState == org.apache.helix.task.TaskState.FAILED ||
>             helixJobState == org.apache.helix.task.TaskState.STOPPED) {
>           return;
>         }
>       }
>       Thread.sleep(1000);
>     }
>   }
> {code}
> The code gets the job state from Zookeeper:
> {code:javascript}
> {
>   "id": "WorkflowContext",
>   "simpleFields": {
>     "START_TIME": "1505159715449",
>     "STATE": "IN_PROGRESS"
>   },
>   "listFields": {},
>   "mapFields": {
>     "JOB_STATES": {
>       "jobname_job_jobname_1507415700001": "COMPLETED",
>       "jobname_job_jobname_1507756800000": "COMPLETED",
>       "jobname_job_jobname_1507959300001": "COMPLETED",
>       "jobname_job_jobname_1509857102910": "COMPLETED",
>       "jobname_job_jobname_1510253708033": "COMPLETED",
>       "jobname_job_jobname_1510271102898": "COMPLETED",
>       "jobname_job_jobname_1510852210668": "COMPLETED",
>       "jobname_job_jobname_1510853133675": "COMPLETED"
>     }
>   }
> }
> {code}
> But there is no information contained in the job state for the hung job.
> Also, it is really strange that the job states contained in that json blob 
> are so old.  The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a 
> month ago.
> I'm not sure how the system got in this state, but this isn't the first time 
> we have seen this.  While it would be good to prevent this from happening, it 
> would also be good to allow the system to recover if this state is entered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to