[
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831178#comment-15831178
]
Xuefu Zhang commented on HIVE-15671:
------------------------------------
Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using
{{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout
using {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.
On the other hand,
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores
{{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is
TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with
{{getConnectTimeoutMs()}}.
To me, it seems there is mismatch between 3 and 4. In 3, the timeout message
implies "connection timeout", while the value is what is supposed to be that
for saslHandler handshake. This is why I think 3 should use
{{getConnectTimeoutMs()}} instead.
Could you take another look?
I actually ran into issues with this. Our cluster is constantly busy, and it
takes minutes for the Hive's spark session to get a container to launch the
remote driver. In that case, the query fails with a failure of creating a spark
session. For such a scenario, I supposed we should increase
*client.connect.timeout*. However, that's not effective. On the other hand, if
I increase *server.connect.timeout*, Hive waits longer for the driver to come
up, which is good. However, doing that has a bad consequence that Hive will
wait as long to declare a failure if for any reason the remote driver becomes
dead.
With the patch in place, the problem is solved in both cases. I only need to
increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.
> RPCServer.registerClient() erroneously uses server/client handshake timeout
> for connection timeout
> --------------------------------------------------------------------------------------------------
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Affects Versions: 1.1.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-15671.patch
>
>
> {code}
> /**
> * Tells the RPC server to expect a connection from a new client.
> * ...
> */
> public Future<Rpc> registerClient(final String clientId, String secret,
> RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher,
> config.getServerConnectTimeoutMs());
> }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for
> handshake between Hive client and remote Spark driver. Instead, the timeout
> should be *hive.spark.client.connect.timeout*, which is for timeout for
> remote Spark driver in connecting back to Hive client.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)