[ https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271461#comment-16271461 ]
Abhishek Tiwari commented on GOBBLIN-318: ----------------------------------------- [~jbaranick] : [~arjun4084346] is working on adding timeout to the jobs, so that should help with recoverability. [~arjun4084346] please check that Zookeeper state is also being cleaned up when the stuck job is killed. [~jbaranick] Its interesting that Zk is not cleaned up for phantom jobs, we should investigate that > Gobblin Helix Jobs Hang Indefinitely > ------------------------------------- > > Key: GOBBLIN-318 > URL: https://issues.apache.org/jira/browse/GOBBLIN-318 > Project: Apache Gobblin > Issue Type: Bug > Reporter: Joel Baranick > Priority: Critical > > In some cases, gobblin helix jobs can hang indefinitely. When coupled with > job locks, this can result in a job becoming stuck and not progressing. The > only solution currently is to restart the master node. > Assume the following is for a {{job_myjob_1510884004834}} and which hung at > 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. > {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job > as completed. This results in the {{TaskStateCollectorService}} indefinitely > searching for more task states, even though it has processed all the task > states that are ever going to be produced. There is no reference to the hung > job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}. In the Helix Web Admin, > the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. > There is no record of the job in Zookeeper at > {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}. This means that > the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails. > {code:java} > private void waitForJobCompletion() throws InterruptedException { > while (true) { > WorkflowContext workflowContext = > TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName); > if (workflowContext != null) { > org.apache.helix.task.TaskState helixJobState = > workflowContext.getJobState(this.jobResourceName); > if (helixJobState == org.apache.helix.task.TaskState.COMPLETED || > helixJobState == org.apache.helix.task.TaskState.FAILED || > helixJobState == org.apache.helix.task.TaskState.STOPPED) { > return; > } > } > Thread.sleep(1000); > } > } > {code} > The code gets the job state from Zookeeper: > {code:javascript} > { > "id": "WorkflowContext", > "simpleFields": { > "START_TIME": "1505159715449", > "STATE": "IN_PROGRESS" > }, > "listFields": {}, > "mapFields": { > "JOB_STATES": { > "jobname_job_jobname_1507415700001": "COMPLETED", > "jobname_job_jobname_1507756800000": "COMPLETED", > "jobname_job_jobname_1507959300001": "COMPLETED", > "jobname_job_jobname_1509857102910": "COMPLETED", > "jobname_job_jobname_1510253708033": "COMPLETED", > "jobname_job_jobname_1510271102898": "COMPLETED", > "jobname_job_jobname_1510852210668": "COMPLETED", > "jobname_job_jobname_1510853133675": "COMPLETED" > } > } > } > {code} > But there is no information contained in the job state for the hung job. > Also, it is really strange that the job states contained in that json blob > are so old. The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a > month ago. > I'm not sure how the system got in this state, but this isn't the first time > we have seen this. While it would be good to prevent this from happening, it > would also be good to allow the system to recover if this state is entered. -- This message was sent by Atlassian JIRA (v6.4.14#64029)