[ 
https://issues.apache.org/jira/browse/TEZ-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107697#comment-17107697
 ] 

Prasanth Jayachandran commented on TEZ-4179:
--------------------------------------------

[~jeagles]  But this is an additional optional field which will not be used by 
the default tez container mode. Currently this is only planned to be used for 
hive on LLAP in kubernetes environment where the assumption about host + port 
based caching no longer holds true. Pods get stable hostnames and ports across 
restarts can only be differentiated via some unique identifier (POD UID) that 
changes across restart. 

> [Kubernetes] Extend NodeId in tez to support unique worker identity
> -------------------------------------------------------------------
>
>                 Key: TEZ-4179
>                 URL: https://issues.apache.org/jira/browse/TEZ-4179
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>         Attachments: TEZ-4179.1.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In kubernetes environment where pods can have same host name and port, there 
> can be situations where node trackers could be retaining old instance of the 
> pod in its cache. In case of Hive LLAP, where the llap tez task scheduler 
> maintains the membership of nodes based on zookeeper registry events there 
> can be cases where NODE_ADDED followed by NODE_REMOVED event could end up 
> removing the node/host from node trackers because of stable hostname and 
> service port. The NODE_REMOVED event in this case is old stale event of the 
> already dead pod but ZK will send only after session timeout (in case of 
> non-graceful shutdown). If this sequence of events happen, a node/host is 
> completely lost form the schedulers perspective. 
> To support this scenario, tez can extend yarn's NodeId to include 
> uniqueIdentifier. Llap task scheduler can construct the container object with 
> this new NodeId that includes uniqueIdentifier as well so that stale events 
> like above will only remove the host/node that matches the old 
> uniqueIdentifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to