[
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Seth resolved HIVE-10649.
-----------------------------------
Resolution: Duplicate
> LLAP: AM gets stuck completely if one node is dead
> --------------------------------------------------
>
> Key: HIVE-10649
> URL: https://issues.apache.org/jira/browse/HIVE-10649
> Project: Hive
> Issue Type: Sub-task
> Reporter: Sergey Shelukhin
> Assignee: Siddharth Seth
>
> See HIVE-10648.
> When AM cannot connect to a node, that appears to cause it to stall; example
> log, there are no other interleaving logs even though this is happening in
> the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
> From "Assigning" messages I can also see tasks are scheduled to all the nodes
> before and after the pause, not just to the problematic node.
> LLAP daemons have corresponding gaps where between two fragments nothing is
> ran for a long time on any daemon.
> {noformat}
> 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl:
> task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to
> RUNNING due to event T_ATTEMPT_LAUNCHED
> 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 10 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler]
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 11 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 12 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 13 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 14 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 15 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 16 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 17 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 18 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 19 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 20 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler]
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 21 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 22 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 23 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 24 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 25 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 26 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 27 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 28 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 29 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 30 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler]
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 31 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 32 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 33 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 34 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 35 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 36 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 37 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 38 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 39 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 40 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler]
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 41 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 42 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 43 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 44 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 45 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 46 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 47 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 48 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001.
> Already tried 49 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000
> MILLISECONDS)
> 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3]
> tezplugins.LlapTaskCommunicator: Unable to run task:
> attempt_1429683757595_0784_1_00_000017_0 on containerId:
> container_222212222_0784_01_000018, Communication Error
> 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central]
> history.HistoryEventHandler:
> [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]:
> vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0,
> startTime=1431026014322, finishTime=1431026065838, timeTaken=51516,
> status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication
> Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter,
> DATA_LOCAL_TASKS=1
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)