[ 
https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-10649:
----------------------------------
    Assignee:     (was: Siddharth Seth)

> LLAP: AM gets stuck completely if one node is dead
> --------------------------------------------------
>
>                 Key: HIVE-10649
>                 URL: https://issues.apache.org/jira/browse/HIVE-10649
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>
> See HIVE-10648.
> When AM cannot connect to a node, that appears to cause it to stall; example 
> log, there are no other interleaving logs even though this is happening in 
> the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
> From "Assigning" messages I can also see tasks are scheduled to all the nodes 
> before and after the pause, not just to the problematic node. 
> LLAP daemons have corresponding gaps where between two fragments nothing is 
> ran for a long time on any daemon.
> {noformat}
> 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: 
> task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to 
> RUNNING due to event T_ATTEMPT_LAUNCHED
> 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 10 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] 
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 11 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 12 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 13 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 14 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 15 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 16 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 17 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 18 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 19 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 20 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] 
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 21 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 22 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 23 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 24 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 25 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 26 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 27 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 28 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 29 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 30 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] 
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 31 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 32 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 33 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 34 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 35 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 36 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 37 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 38 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 39 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 40 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] 
> impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 41 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 42 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 43 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 44 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 45 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 46 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 47 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 48 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying 
> connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. 
> Already tried 49 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] 
> tezplugins.LlapTaskCommunicator: Unable to run task: 
> attempt_1429683757595_0784_1_00_000017_0 on containerId: 
> container_222212222_0784_01_000018, Communication Error
> 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0, 
> startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, 
> status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication 
> Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, 
> DATA_LOCAL_TASKS=1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to