[ 
https://issues.apache.org/jira/browse/HIVE-23409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103382#comment-17103382
 ] 

Naresh P R commented on HIVE-23409:
-----------------------------------

[~ashutoshc] Thanks for looking into this.

If a tez session AM is released after dag wait timeout, call to tez session 
will try to launch a new AM which is failing after 2 retries at here

 
{code:java}
Dag submit failed due to java.lang.RuntimeException: Failed to connect to 
timeline server. Connection retries limit exceeded. The posted timeline event 
may be missing stack trace: 
[org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:403)
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:363)
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:282)
org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:77)
org.apache.tez.client.TezClient.start(TezClient.java:402)
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.startSessionAndContainers(TezSessionState.java:516)
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:451)
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124)
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:379)
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.reopenInternal(TezSessionPoolManager.java:498)
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.reopen(TezSessionPoolManager.java:487)
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.reopen(TezSessionPoolSession.java:228)
org.apache.hadoop.hive.ql.exec.tez.TezTask.getNewTezSessionOnError(TezTask.java:531)
org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:547)
org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:221){code}
If it fails twice, we are destroying the session which is part of 
TezSessionPool.

 

[HiveServer2-Background-Pool: Thread-12345]: tez.TezSessionPoolManager (:()) - 
We are closing a default session because of retry failure.

All new queries are waiting for a session from TezSessionPool
{code:java}
"HiveServer2-Background-Pool: Thread-21342" #21342"HiveServer2-Background-Pool: 
Thread-21342" #21342   java.lang.Thread.State: TIMED_WAITING (parking) at 
sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x00000005c4567e10> 
(a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
 at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPool.getSession(TezSessionPool.java:193)
 at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:295)
 at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:474)
 at 
org.apache.hadoop.hive.ql.exec.tez.WorkloadManagerFederation.getUnmanagedSession(WorkloadManagerFederation.java:66)
 at 
org.apache.hadoop.hive.ql.exec.tez.WorkloadManagerFederation.getSession(WorkloadManagerFederation.java:38)
 at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:189) at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:103)
{code}
Only HS2 restart is resolving this issue.

> If TezSession application reopen fails for Timeline service down, default 
> TezSession from SessionPool is closed after a retry
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23409
>                 URL: https://issues.apache.org/jira/browse/HIVE-23409
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Assignee: Naresh P R
>            Priority: Major
>         Attachments: HIVE-23409.patch
>
>
> we are closing a default session from TezSessionPool at here.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java#L589]
> If all the sessions in a pool are destroyed, queries wait indefinitely at 
> TezSessionPool.getSession until HS2 restarts after other service recoveries.
> [HiveServer2-Background-Pool: Thread-12345]: tez.TezSessionPoolManager (:()) 
> - We are closing a default session because of retry failure.
> It's better if we allow retry & fail than hung.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to