[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703959#comment-17703959
 ] 

ASF GitHub Bot commented on HDFS-16918:
---------------------------------------

virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1480735197

   @ayushtkn to provide you some update, the issue for which I was thinking of 
having this optional behavior is already fixed by 
https://github.com/apache/hadoop/commit/26fba8701c97928bb2ed2e6b456ab5ba9513e0fe
   We no longer see any transient connection failures after this commit. We are 
also trying to harmonize socket connection timeouts for all daemons to get them 
in sync with OS level settings.
   
   Hence we no longer need the functionality of this PR but we did end up 
building some level of resilience in k8s operators to deal with transient 
failures for future i.e. bounce the DN pod if it doesn't stay connected 
consistently to active NN pod. Had to get some connection ports accessible, etc 
:)
   
   Thanks




> Optionally shut down datanode if it does not stay connected to active namenode
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-16918
>                 URL: https://issues.apache.org/jira/browse/HDFS-16918
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to