[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201029#comment-15201029
 ] 

Rui Li commented on HIVE-12650:
-------------------------------

Here're my findings so far (for yarn-client mode).

# If the cluster has no resources available, the {{RemoteDriver}} is blocked at 
creating the SparkContext. Rpc has been set up at this point. So 
{{SparkClientImpl}} believes that driver is up.
# There're 2 points where hive can get timeout. If we enable container 
pre-warm, we'll timeout at 
{{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}. If pre-warm is 
off, we'll timeout at {{RemoteSparkJobMonitor#startMonitor}}. Both are because 
SparkContext is not created and RemoteDriver can't respond to requests. For the 
latter, the job status remains {{SENT}} until timeout. Ideally it should be 
{{QUEUE}} instead. My understanding is that the Rpc handler is blocked at 
RemoteDriver for {{ADD JAR/FILE}} calls, which we submitted before the real job.
# Currently YARN doesn't timeout a starving application, which means the app 
will eventually get served. Refer to YARN-3813 and YARN-2266.

Based on these findings, I think we have to decide whether hive should timeout 
in this situation. Waiting is reasonable for busy clusters. But on the other 
hand, it seems difficult to tell whether we're blocked for lack of resources. 
I'm not sure if spark has such facilities for it. For yarn-cluster mode, this 
may be even more difficult because RemoteDriver is not running in that case and 
we'll have less information.
What do you guys think?

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-12650
>                 URL: https://issues.apache.org/jira/browse/HIVE-12650
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: JoneZhang
>            Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to