[
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prasanth Jayachandran updated TEZ-4179:
---------------------------------------
Attachment: TEZ-4179.1.patch
> [Kubernetes] Extend NodeId in tez to support unique worker identity
> -------------------------------------------------------------------
>
> Key: TEZ-4179
> URL: https://issues.apache.org/jira/browse/TEZ-4179
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Priority: Major
> Attachments: TEZ-4179.1.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there
> can be situations where node trackers could be retaining old instance of the
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler
> maintains the membership of nodes based on zookeeper registry events there
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up
> removing the node/host from node trackers because of stable hostname and
> service port. The NODE_REMOVED event in this case is old stale event of the
> already dead pod but ZK will send only after session timeout (in case of
> non-graceful shutdown). If this sequence of events happen, a node/host is
> completely lost form the schedulers perspective.
> To support this scenario, tez can extend yarn's NodeId to include
> uniqueIdentifier. Llap task scheduler can construct the container object with
> this new NodeId that includes uniqueIdentifier as well so that stale events
> like above will only remove the host/node that matches the old
> uniqueIdentifier.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)