[ https://issues.apache.org/jira/browse/HIVE-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rui Li updated HIVE-15860: -------------------------- Attachment: HIVE-15860.1.patch Actually Rpc's listener can detect the remote end becomes inactive and mark SparkClient as dead accordingly. However, since we haven't received the JobId on Hive side, RemoteSparkJobStatus won't contact the remote driver for job info and simply return null for the monitor. And thus it never finds out the Rpc channel has already closed. The patch adds a check every time the monitor runs, to make sure the remote context is still alive. The fix is for remote mode only, because I don't think the issue exists for local mode. > RemoteSparkJobMonitor may hang when RemoteDriver exits abnormally > ----------------------------------------------------------------- > > Key: HIVE-15860 > URL: https://issues.apache.org/jira/browse/HIVE-15860 > Project: Hive > Issue Type: Bug > Reporter: Rui Li > Assignee: Rui Li > Attachments: HIVE-15860.1.patch > > > It happens when RemoteDriver crashes between {{JobStarted}} and > {{JobSubmitted}}, e.g. killed by {{kill -9}}. RemoteSparkJobMonitor will > consider the job has started, however it can't get the job info because it > hasn't received the JobId. Then the monitor will loop forever. -- This message was sent by Atlassian JIRA (v6.3.15#6346)