[
https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796697#comment-15796697
]
Siddharth Seth commented on HIVE-15255:
---------------------------------------
[~sershe] - this is multiple attempts of the same task being re-scheduled. The
delay can be controlled via
"hive.llap.task.scheduler.node.reenable.min.timeout.ms" if I'm not mistaken.
NodeBlacklistConf in LlapTaskScheduler
> LLAP: service_busy error should not be retried so fast
> ------------------------------------------------------
>
> Key: HIVE-15255
> URL: https://issues.apache.org/jira/browse/HIVE-15255
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
>
> {noformat}
> 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5,
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy,
> nodeHttpAddress=(node3), counters=Counters: 1,
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329,
> containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
> 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16,
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy,
> nodeHttpAddress=(node3), counters=Counters: 1,
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330,
> containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
> 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117,
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy,
> nodeHttpAddress=(node3), counters=Counters: 1,
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331,
> containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
> 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14,
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy,
> nodeHttpAddress=(node3), counters=Counters: 1,
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332,
> containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
> 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1,
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6,
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy,
> nodeHttpAddress=(node3), counters=Counters: 1,
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> {noformat}
> As you can see by the attempt number, this has been going on for a while. In
> fact I think other tasks could have been scheduled in the time (not sure),
> but the thread just kept at it for this one task until it was finally
> scheduled.
> There should be some fallback after initial failures; we should also make
> sure such retries do not take over all scheduling (not sure if they do, need
> to check).
> LLAP on the node was alive, just busy with other tasks. The task did
> eventually get scheduled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)