[
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832582#comment-15832582
]
Xuefu Zhang commented on HIVE-15671:
------------------------------------
Patch #1 followed what [~vanzin] suggested. With it, I observed the following
behavior:
1. Increasing *server.connect.timeout* will make hive wait longer for the
driver to connect back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive
side with the following error:
{code}
2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685 Stage-3_0: 0/1
2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685 Stage-3_0: 0/1
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC
channel is closed.)'
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}
This meets my expectation.
However, I didn't test the case of driver death before connecting back to Hive.
(It's also hard to construct such a test case.) In that case, I assume that
Hive will wait for *server.connect.timeout* before declares a failure. I guess
there isn't much we can do for this case. I don't think the change here has any
implication on this.
> RPCServer.registerClient() erroneously uses server/client handshake timeout
> for connection timeout
> --------------------------------------------------------------------------------------------------
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Affects Versions: 1.1.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
> /**
> * Tells the RPC server to expect a connection from a new client.
> * ...
> */
> public Future<Rpc> registerClient(final String clientId, String secret,
> RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher,
> config.getServerConnectTimeoutMs());
> }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for
> handshake between Hive client and remote Spark driver. Instead, the timeout
> should be *hive.spark.client.connect.timeout*, which is for timeout for
> remote Spark driver in connecting back to Hive client.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)