Joel Baranick created GOBBLIN-318:
-------------------------------------

             Summary: Gobblin Helix Jobs Hang Indefinitely 
                 Key: GOBBLIN-318
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-318
             Project: Apache Gobblin
          Issue Type: Bug
            Reporter: Joel Baranick
            Priority: Critical


In some cases, gobblin helix jobs can hang indefinitely.  When coupled with job 
locks, this can result in a job becoming stuck and not progressing.  The only 
solution currently is to restart the master node.

Assume the following is for a {{job_myjob_1510884004834}} and which hung at 
2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. 
{{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job 
as completed. This results in the {{TaskStateCollectorService}} indefinitely 
searching for more task states, even though it has processed all the task 
states that are ever going to be produced.  There is no reference to the hung 
job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}.  In the Helix Web Admin, 
the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. There 
is no record of the job in Zookeeper at 
{{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}.  This means that 
the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails.
{code:java}
private void waitForJobCompletion() throws InterruptedException {
    while (true) {
      WorkflowContext workflowContext = 
TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName);
      if (workflowContext != null) {
        org.apache.helix.task.TaskState helixJobState = 
workflowContext.getJobState(this.jobResourceName);
        if (helixJobState == org.apache.helix.task.TaskState.COMPLETED ||
            helixJobState == org.apache.helix.task.TaskState.FAILED ||
            helixJobState == org.apache.helix.task.TaskState.STOPPED) {
          return;
        }
      }

      Thread.sleep(1000);
    }
  }
{code}

The code gets the job state from Zookeeper:
{code:javascript}
{
  "id": "WorkflowContext",
  "simpleFields": {
    "START_TIME": "1505159715449",
    "STATE": "IN_PROGRESS"
  },
  "listFields": {},
  "mapFields": {
    "JOB_STATES": {
      "jobname_job_jobname_1507415700001": "COMPLETED",
      "jobname_job_jobname_1507756800000": "COMPLETED",
      "jobname_job_jobname_1507959300001": "COMPLETED",
      "jobname_job_jobname_1509857102910": "COMPLETED",
      "jobname_job_jobname_1510253708033": "COMPLETED",
      "jobname_job_jobname_1510271102898": "COMPLETED",
      "jobname_job_jobname_1510852210668": "COMPLETED",
      "jobname_job_jobname_1510853133675": "COMPLETED"
    }
  }
}
{code}

But there is no information contained in the job state for the hung job.

Also, it is really strange that the job states contained in that json blob are 
so old.  The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a month 
ago.

I'm not sure how the system got in this state, but this isn't the first time we 
have seen this.  While it would be good to prevent this from happening, it 
would also be good to allow the system to recover if this state is entered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to