[
https://issues.apache.org/jira/browse/HIVE-22687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashutosh Chauhan updated HIVE-22687:
------------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Status: Resolved (was: Patch Available)
Pushed to master. Thanks, Himanshu!
> Query hangs indefinitely if LLAP daemon registers after the query is submitted
> ------------------------------------------------------------------------------
>
> Key: HIVE-22687
> URL: https://issues.apache.org/jira/browse/HIVE-22687
> Project: Hive
> Issue Type: Bug
> Components: llap
> Affects Versions: 3.1.0
> Reporter: Himanshu Mishra
> Assignee: Himanshu Mishra
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22687.01.patch, HIVE-22687.02.patch
>
>
> If a query is submitted and no LLAP daemon is running, it waits for 1 minute
> and times out with error {{SERVICE_UNAVAILABLE}}.
> While waiting, if a new LLAP Daemon starts, then the timeout is cancelled,
> and the tasks do not get scheduled as well. As a result, the query hangs
> indefinitely.
> This is due to the race condition where LLAP Daemon first registers the LLAP
> instance at {{.../workers/worker-0000}}, and afterwards registers
> {{.../workers/slot-0000}}. In the gap between two, Tez AM gets notified of
> worker zk node and while processing it checks if slot zk node is present, if
> not it rejects the LLAP Daemon. Error in Tez AM is:
> {code:java}
> [INFO] [LlapScheduler] |impl.LlapZookeeperRegistryImpl|: Unknown slot for
> 8ebfdc45-0382-4757-9416-52898885af90{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)