[ 
https://issues.apache.org/jira/browse/HIVE-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15860:
--------------------------
    Attachment: HIVE-15860.1.patch

Actually Rpc's listener can detect the remote end becomes inactive and mark 
SparkClient as dead accordingly. However, since we haven't received the JobId 
on Hive side, RemoteSparkJobStatus won't contact the remote driver for job info 
and simply return null for the monitor. And thus it never finds out the Rpc 
channel has already closed.
The patch adds a check every time the monitor runs, to make sure the remote 
context is still alive. The fix is for remote mode only, because I don't think 
the issue exists for local mode.

> RemoteSparkJobMonitor may hang when RemoteDriver exits abnormally
> -----------------------------------------------------------------
>
>                 Key: HIVE-15860
>                 URL: https://issues.apache.org/jira/browse/HIVE-15860
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-15860.1.patch
>
>
> It happens when RemoteDriver crashes between {{JobStarted}} and 
> {{JobSubmitted}}, e.g. killed by {{kill -9}}. RemoteSparkJobMonitor will 
> consider the job has started, however it can't get the job info because it 
> hasn't received the JobId. Then the monitor will loop forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to