[
https://issues.apache.org/jira/browse/HIVE-22687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007080#comment-17007080
]
Himanshu Mishra commented on HIVE-22687:
----------------------------------------
Thanks [~gopalv] for looking into this.
The patch reorders registration of {{slot-000}} and {{worker-000}} zookeeper
nodes, ensuring that slot zk node must be present when we are registering
worker zk node.
On client side (Tez AM), we listen only on worker zk node and then schedule any
queued task in {{LlapTaskSchedulerService}}. To run the task we get list of all
the running LLAP daemon instances via
{{LlapZookeeperRegistryImpl#getAllInstancesOrdered}}, which returns only those
worker instances whose slot zk node is already present. With the patch we'll
always have associated slot zk node present for the newly registered worker zk
node, and this worker (LLAP Instance) will definitely be returned by this
method. This client side filtering was causing the issue in first place because
we were reaching this method even before the slot zk node got registered.
> Query hangs indefinitely if LLAP daemon registers after the query is submitted
> ------------------------------------------------------------------------------
>
> Key: HIVE-22687
> URL: https://issues.apache.org/jira/browse/HIVE-22687
> Project: Hive
> Issue Type: Bug
> Components: llap
> Affects Versions: 3.1.0
> Reporter: Himanshu Mishra
> Assignee: Himanshu Mishra
> Priority: Major
> Attachments: HIVE-22687.01.patch, HIVE-22687.02.patch
>
>
> If a query is submitted and no LLAP daemon is running, it waits for 1 minute
> and times out with error {{SERVICE_UNAVAILABLE}}.
> While waiting, if a new LLAP Daemon starts, then the timeout is cancelled,
> and the tasks do not get scheduled as well. As a result, the query hangs
> indefinitely.
> This is due to the race condition where LLAP Daemon first registers the LLAP
> instance at {{.../workers/worker-0000}}, and afterwards registers
> {{.../workers/slot-0000}}. In the gap between two, Tez AM gets notified of
> worker zk node and while processing it checks if slot zk node is present, if
> not it rejects the LLAP Daemon. Error in Tez AM is:
> {code:java}
> [INFO] [LlapScheduler] |impl.LlapZookeeperRegistryImpl|: Unknown slot for
> 8ebfdc45-0382-4757-9416-52898885af90{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)