[
https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286602#comment-16286602
]
Joel Baranick edited comment on GOBBLIN-318 at 12/11/17 9:34 PM:
-----------------------------------------------------------------
Another piece of info. All tasks are marked as completed in the Gobblin DB,
but when I look at
https://zookeeper/node?path=/ROOT/CLUSTER/PROPERTYSTORE/TaskRebalancer/JOB_NAME_job_JOB_NAME_1512924480001/Context
, there are multiple tasks still marked as running:
{code:java}
{
"id":"TaskContext"
,"simpleFields":{
"START_TIME":"1512924491039"
}
,"listFields":{
}
,"mapFields":{
"0":{
"ASSIGNED_PARTICIPANT":"worker-1"
,"FINISH_TIME":"1512924700877"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"124a2e88-90e3-40e8-add6-94b59ee30133"
}
,"1":{
"ASSIGNED_PARTICIPANT":"worker-2"
,"FINISH_TIME":"1512924701120"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"9d7c2369-d6d9-4c2f-8bf3-1bcea0a47fdf"
}
,"2":{
"ASSIGNED_PARTICIPANT":"worker-3"
,"FINISH_TIME":"1512924695451"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"19545764-e2bf-48b6-9942-361c834790cf"
}
,"3":{
"ASSIGNED_PARTICIPANT":"worker-4"
,"FINISH_TIME":"1512924776614"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"3f59431f-2415-477a-8008-26a3eb258129"
}
,"4":{
"ASSIGNED_PARTICIPANT":"worker-5"
,"FINISH_TIME":"1512924731962"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"19863633-6ed3-49d4-a07f-2130eec15dd3"
}
,"5":{
"ASSIGNED_PARTICIPANT":"worker-6"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"433c0107-0919-428a-b7c5-6e8925df7dac"
}
,"6":{
"ASSIGNED_PARTICIPANT":"worker-7"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"89a63cfd-efb4-44ce-a08b-68678d792e25"
}
,"7":{
"ASSIGNED_PARTICIPANT":"worker-8"
,"FINISH_TIME":"1512924524111"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"a133db13-3f28-49af-8e3d-1d6fa81f6247"
}
,"8":{
"ASSIGNED_PARTICIPANT":"worker-9"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"7bbda2ef-68da-4f11-b217-89c3cd7d7a2e"
}
,"9":{
"ASSIGNED_PARTICIPANT":"worker-10"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"8407cb27-4b26-4786-91f2-ad920b1e2343"
}
}
}
{code}
Compared this to a job that worked and you will see all tasks marked as
COMPLETED.
was (Author: jbaranick):
Another piece of info. All tasks are marked as completed in the Gobblin DB,
but when I look at
https://zookeeper/node?path=/ROOT/CLUSTER/PROPERTYSTORE/TaskRebalancer/JOB_NAME_job_JOB_NAME_1512924480001/Context
, there are multiple tasks still marked as running:
{code:java}
{
"id":"TaskContext"
,"simpleFields":{
"START_TIME":"1512924491039"
}
,"listFields":{
}
,"mapFields":{
"0":{
"ASSIGNED_PARTICIPANT":"worker-1"
,"FINISH_TIME":"1512924700877"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"124a2e88-90e3-40e8-add6-94b59ee30133"
}
,"1":{
"ASSIGNED_PARTICIPANT":"worker-2"
,"FINISH_TIME":"1512924701120"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"9d7c2369-d6d9-4c2f-8bf3-1bcea0a47fdf"
}
,"2":{
"ASSIGNED_PARTICIPANT":"worker-3"
,"FINISH_TIME":"1512924695451"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"19545764-e2bf-48b6-9942-361c834790cf"
}
,"3":{
"ASSIGNED_PARTICIPANT":"worker-4"
,"FINISH_TIME":"1512924776614"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"3f59431f-2415-477a-8008-26a3eb258129"
}
,"4":{
"ASSIGNED_PARTICIPANT":"worker-5"
,"FINISH_TIME":"1512924731962"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"19863633-6ed3-49d4-a07f-2130eec15dd3"
}
,"5":{
"ASSIGNED_PARTICIPANT":"worker-6"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"433c0107-0919-428a-b7c5-6e8925df7dac"
}
,"6":{
"ASSIGNED_PARTICIPANT":"worker-7"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"89a63cfd-efb4-44ce-a08b-68678d792e25"
}
,"7":{
"ASSIGNED_PARTICIPANT":"worker-8"
,"FINISH_TIME":"1512924524111"
,"INFO":"completed tasks: 1"
,"NUM_ATTEMPTS":"1"
,"START_TIME":"1512924491044"
,"STATE":"COMPLETED"
,"TASK_ID":"a133db13-3f28-49af-8e3d-1d6fa81f6247"
}
,"8":{
"ASSIGNED_PARTICIPANT":"worker-9"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"7bbda2ef-68da-4f11-b217-89c3cd7d7a2e"
}
,"9":{
"ASSIGNED_PARTICIPANT":"worker-10"
,"INFO":""
,"START_TIME":"1512924491044"
,"STATE":"RUNNING"
,"TASK_ID":"8407cb27-4b26-4786-91f2-ad920b1e2343"
}
}
}
{code}
> Gobblin Helix Jobs Hang Indefinitely
> -------------------------------------
>
> Key: GOBBLIN-318
> URL: https://issues.apache.org/jira/browse/GOBBLIN-318
> Project: Apache Gobblin
> Issue Type: Bug
> Reporter: Joel Baranick
> Priority: Critical
>
> In some cases, gobblin helix jobs can hang indefinitely. When coupled with
> job locks, this can result in a job becoming stuck and not progressing. The
> only solution currently is to restart the master node.
> Assume the following is for a {{job_myjob_1510884004834}} and which hung at
> 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC.
> {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job
> as completed. This results in the {{TaskStateCollectorService}} indefinitely
> searching for more task states, even though it has processed all the task
> states that are ever going to be produced. There is no reference to the hung
> job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}. In the Helix Web Admin,
> the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}.
> There is no record of the job in Zookeeper at
> {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}. This means that
> the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails.
> {code:java}
> private void waitForJobCompletion() throws InterruptedException {
> while (true) {
> WorkflowContext workflowContext =
> TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName);
> if (workflowContext != null) {
> org.apache.helix.task.TaskState helixJobState =
> workflowContext.getJobState(this.jobResourceName);
> if (helixJobState == org.apache.helix.task.TaskState.COMPLETED ||
> helixJobState == org.apache.helix.task.TaskState.FAILED ||
> helixJobState == org.apache.helix.task.TaskState.STOPPED) {
> return;
> }
> }
> Thread.sleep(1000);
> }
> }
> {code}
> The code gets the job state from Zookeeper:
> {code:javascript}
> {
> "id": "WorkflowContext",
> "simpleFields": {
> "START_TIME": "1505159715449",
> "STATE": "IN_PROGRESS"
> },
> "listFields": {},
> "mapFields": {
> "JOB_STATES": {
> "jobname_job_jobname_1507415700001": "COMPLETED",
> "jobname_job_jobname_1507756800000": "COMPLETED",
> "jobname_job_jobname_1507959300001": "COMPLETED",
> "jobname_job_jobname_1509857102910": "COMPLETED",
> "jobname_job_jobname_1510253708033": "COMPLETED",
> "jobname_job_jobname_1510271102898": "COMPLETED",
> "jobname_job_jobname_1510852210668": "COMPLETED",
> "jobname_job_jobname_1510853133675": "COMPLETED"
> }
> }
> }
> {code}
> But there is no information contained in the job state for the hung job.
> Also, it is really strange that the job states contained in that json blob
> are so old. The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a
> month ago.
> I'm not sure how the system got in this state, but this isn't the first time
> we have seen this. While it would be good to prevent this from happening, it
> would also be good to allow the system to recover if this state is entered.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)