[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831178#comment-15831178
 ] 

Xuefu Zhang commented on HIVE-15671:
------------------------------------

Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using 
{{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout 
using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores 
{{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is 
TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with 
{{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message 
implies "connection timeout", while the value is what is supposed to be that 
for saslHandler handshake. This is why I think 3 should use 
{{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it 
takes minutes for the Hive's spark session to get a container to launch the 
remote driver. In that case, the query fails with a failure of creating a spark 
session. For such a scenario, I supposed we should increase 
*client.connect.timeout*. However, that's not effective. On the other hand, if 
I increase *server.connect.timeout*, Hive waits longer  for the driver to come 
up, which is good. However, doing that has a bad consequence that Hive will 
wait as long to declare a failure if for any reason the remote driver becomes 
dead.

With the patch in place, the problem is solved in both cases. I only need to 
increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to