[
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201029#comment-15201029
]
Rui Li commented on HIVE-12650:
-------------------------------
Here're my findings so far (for yarn-client mode).
# If the cluster has no resources available, the {{RemoteDriver}} is blocked at
creating the SparkContext. Rpc has been set up at this point. So
{{SparkClientImpl}} believes that driver is up.
# There're 2 points where hive can get timeout. If we enable container
pre-warm, we'll timeout at
{{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}. If pre-warm is
off, we'll timeout at {{RemoteSparkJobMonitor#startMonitor}}. Both are because
SparkContext is not created and RemoteDriver can't respond to requests. For the
latter, the job status remains {{SENT}} until timeout. Ideally it should be
{{QUEUE}} instead. My understanding is that the Rpc handler is blocked at
RemoteDriver for {{ADD JAR/FILE}} calls, which we submitted before the real job.
# Currently YARN doesn't timeout a starving application, which means the app
will eventually get served. Refer to YARN-3813 and YARN-2266.
Based on these findings, I think we have to decide whether hive should timeout
in this situation. Waiting is reasonable for busy clusters. But on the other
hand, it seems difficult to tell whether we're blocked for lack of resources.
I'm not sure if spark has such facilities for it. For yarn-cluster mode, this
may be even more difficult because RemoteDriver is not running in that case and
we'll have less information.
What do you guys think?
> Spark-submit is killed when Hive times out. Killing spark-submit doesn't
> cancel AM request. When AM is finally launched, it tries to connect back to
> Hive and gets refused.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.1.1, 1.2.1
> Reporter: JoneZhang
> Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than
> spark.yarn.am.waitTime. The default value for
> spark.yarn.am.waitTime is 100s, and the default value for
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can
> increase it to a larger value such as 120s.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)