[jira] [Created] (YARN-10851) Tez session close does not interrupt yarn's async thread

2021-07-07 Thread Qihong Wu (Jira)
Qihong Wu created YARN-10851:


 Summary: Tez session close does not interrupt yarn's async thread
 Key: YARN-10851
 URL: https://issues.apache.org/jira/browse/YARN-10851
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.10.1, 2.8.5
 Environment: On an HA cluster, where RM1 is not the active RM
Yarn of version 2.8.5 and is configured with Tez
Reporter: Qihong Wu
 Attachments: hive.log

Hi, I want to ask for the expertise knowledge on the yarn behavior when 
handling `InterruptedIOException`. 

The issue occurs on a HA cluster, where RM1 is NOT the active RM. Therefore, if 
the yarn request made to RM1 failed, the RM failover should happen. However, if 
an interrupted exception is thrown when connecting to RM1, the thread should 
try to [bail 
out|https://dzone.com/articles/how-to-handle-the-interruptedexception] as soon 
as possible to [respect interrupt 
request|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdownNow--],
 rather than moving on to another RM.

But I found my application (hive) after throwing `InterruptedIOException` when 
trying to connect with RM1 failed, continuing to RM2. I want to know how does 
yarn handle InterruptedIOException, shouldn't the async thread gets interrupted 
and shutdown when tez close() triggered interrupt request?



*The reproduction step is:*
 1. In an HA cluster which uses yarn of version 2.8.5 and is configured with Tez
 2. Make sure RM1 is not the active RM by checking `yarn rmadmin 
-getAllServiceState`. It it is, manually [transition RM2 as active 
RM|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html#Admin_commands].
 3. Apply failover-retry properties to yarn-site.xml 
{quote}
 yarn.client.failover-retries
 4
 
 
 yarn.client.failover-retries-on-socket-timeouts
 4
 
 
 yarn.client.failover-max-attempts
 4
 
{quote}
4. Run a simple application to yarn-client (for example, a simple hive DDL 
command)
{quote}hive --hiveconf hive.root.logger=TRACE,console -e "create table tez_test 
(id int, name string);"
{quote}
5. Find from application's log (for example, hive.log), you can find 
`RetryInvocationHandler` has captured the `InterruptedIOException` when request 
was talking over rm1, but the thread didn't bail out immediately, but continue 
moving to rm2.



*More information:*
The interrupted exception is triggered via via 
[TezSessionState#close|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L689]
 and 
[Future#cancel|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10851) Tez session close does not interrupt yarn's async thread

2021-07-07 Thread Qihong Wu (Jira)
Qihong Wu created YARN-10851:


 Summary: Tez session close does not interrupt yarn's async thread
 Key: YARN-10851
 URL: https://issues.apache.org/jira/browse/YARN-10851
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.10.1, 2.8.5
 Environment: On an HA cluster, where RM1 is not the active RM
Yarn of version 2.8.5 and is configured with Tez
Reporter: Qihong Wu
 Attachments: hive.log

Hi, I want to ask for the expertise knowledge on the yarn behavior when 
handling `InterruptedIOException`. 

The issue occurs on a HA cluster, where RM1 is NOT the active RM. Therefore, if 
the yarn request made to RM1 failed, the RM failover should happen. However, if 
an interrupted exception is thrown when connecting to RM1, the thread should 
try to [bail 
out|https://dzone.com/articles/how-to-handle-the-interruptedexception] as soon 
as possible to [respect interrupt 
request|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdownNow--],
 rather than moving on to another RM.

But I found my application (hive) after throwing `InterruptedIOException` when 
trying to connect with RM1 failed, continuing to RM2. I want to know how does 
yarn handle InterruptedIOException, shouldn't the async thread gets interrupted 
and shutdown when tez close() triggered interrupt request?



*The reproduction step is:*
 1. In an HA cluster which uses yarn of version 2.8.5 and is configured with Tez
 2. Make sure RM1 is not the active RM by checking `yarn rmadmin 
-getAllServiceState`. It it is, manually [transition RM2 as active 
RM|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html#Admin_commands].
 3. Apply failover-retry properties to yarn-site.xml 
{quote}
 yarn.client.failover-retries
 4
 
 
 yarn.client.failover-retries-on-socket-timeouts
 4
 
 
 yarn.client.failover-max-attempts
 4
 
{quote}
4. Run a simple application to yarn-client (for example, a simple hive DDL 
command)
{quote}hive --hiveconf hive.root.logger=TRACE,console -e "create table tez_test 
(id int, name string);"
{quote}
5. Find from application's log (for example, hive.log), you can find 
`RetryInvocationHandler` has captured the `InterruptedIOException` when request 
was talking over rm1, but the thread didn't bail out immediately, but continue 
moving to rm2.



*More information:*
The interrupted exception is triggered via via 
[TezSessionState#close|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L689]
 and 
[Future#cancel|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (ZEPPELIN-5149) NPE in RemoteInterpreterServer#interpret

2020-12-01 Thread Qihong Wu (Jira)
Qihong Wu created ZEPPELIN-5149:
---

 Summary: NPE in RemoteInterpreterServer#interpret
 Key: ZEPPELIN-5149
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5149
 Project: Zeppelin
  Issue Type: Bug
Reporter: Qihong Wu


This is the same bug in https://issues.apache.org/jira/browse/ZEPPELIN-4829#.

Reproduce step is to run any query with %livy.sql interpreter in zeppelin, for 
example
 
%livy.sql 
show databases 


The bug is caused by two NPEs and the solution in ZEPPELIN-4829 only resolved 
the first NPE. With v0.9.0-preview2, I still can reproduce the same error with 
log
{quote}
ERROR [2020-12-01 07:28:58,404] (\{pool-2-thread-1} 
ProcessFunction.java[process]:47) - Internal error processing interpretERROR 
[2020-12-01 07:28:58,404] (\{pool-2-thread-1} ProcessFunction.java[process]:47) 
- Internal error processing interpretjava.lang.NullPointerException at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.interpret(RemoteInterpreterServer.java:624)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$interpret.getResult(RemoteInterpreterService.java:1646)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$interpret.getResult(RemoteInterpreterService.java:1626)
 at 
org.apache.zeppelin.shaded.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
 at 
org.apache.zeppelin.shaded.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
 at 
org.apache.zeppelin.shaded.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)


{quote}

The unresolved NPE is 
[https://github.com/apache/zeppelin/blob/v0.9.0-preview2/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterServer.java#L624]

I don't see the error in v0.8.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)