[
https://issues.apache.org/jira/browse/HIVE-16984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069173#comment-16069173
]
Xuefu Zhang commented on HIVE-16984:
------------------------------------
The patch looks good. However, I'm thinking we'd better to keep the previous
logic intact (before HIVE-15171). For this, we can change the current API,
SparkJobStatus.getAppId(), to a pure getter and add
SparkJobStatus.fetchAppId(), which invokes the remote RPC call and caches app
Id in SparkJobStatus class. This way, getAppId() can be called any time w/o any
network trips. Existing call to getAppId() is changed to fetchAppId().
> HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver
> died
> -------------------------------------------------------------------------------
>
> Key: HIVE-16984
> URL: https://issues.apache.org/jira/browse/HIVE-16984
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-16984.1.patch
>
>
> In HoS, after a RemoteDriver is launched, it may fail to initialize a Spark
> context and thus the ApplicationMaster will die eventually. In this case,
> there are two issues related to RemoteSparkJobStatus::getAppID():
> 1. Currently we call {{getAppID()}} before starting the monitoring job. For
> the first, it will wait for {{hive.spark.client.future.timeout}}, and for the
> latter, it will wait for {{hive.spark.job.monitor.timeout}}. The error
> message for the latter treats the {{hive.spark.job.monitor.timeout}} as the
> time waiting for the job submission. However, this is inaccurate as it
> doesn't include {{hive.spark.client.future.timeout}}.
> 2. In case the RemoteDriver suddenly died, currently we still may wait
> hopelessly for the timeouts. This should potentially be avoided if we know
> that the channel has closed between the client and remote driver.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)