[jira] [Commented] (HIVE-16984) HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver died

Xuefu Zhang (JIRA) Thu, 29 Jun 2017 16:17:36 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069173#comment-16069173
 ]


Xuefu Zhang commented on HIVE-16984:
------------------------------------

The patch looks good. However, I'm thinking we'd better to keep the previous 
logic intact (before HIVE-15171). For this, we can change the current API, 
SparkJobStatus.getAppId(), to a pure getter and add 
SparkJobStatus.fetchAppId(), which invokes the remote RPC call and caches app 
Id in SparkJobStatus class. This way, getAppId() can be called any time w/o any 
network trips. Existing call to getAppId() is changed to fetchAppId().

> HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver 
> died
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-16984
>                 URL: https://issues.apache.org/jira/browse/HIVE-16984
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-16984.1.patch
>
>
> In HoS, after a RemoteDriver is launched, it may fail to initialize a Spark 
> context and thus the ApplicationMaster will die eventually. In this case, 
> there are two issues related to RemoteSparkJobStatus::getAppID():
> 1. Currently we call {{getAppID()}} before starting the monitoring job. For 
> the first, it will wait for {{hive.spark.client.future.timeout}}, and for the 
> latter, it will wait for {{hive.spark.job.monitor.timeout}}. The error 
> message for the latter treats the {{hive.spark.job.monitor.timeout}} as the 
> time waiting for the job submission. However, this is inaccurate as it 
> doesn't include {{hive.spark.client.future.timeout}}.
> 2. In case the RemoteDriver suddenly died, currently we still may wait 
> hopelessly for the timeouts. This should potentially be avoided if we know 
> that the channel has closed between the client and remote driver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-16984) HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver died

Reply via email to