[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689989#comment-17689989
 ] 

ASF GitHub Bot commented on HDFS-16918:
---------------------------------------

ayushtkn commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433706087

   I think I have lost the flow now 😅
   
   But I think using the getDataNodeStats is  a cool thing to explore, it is 
under a read lock so not costly either, and would be easier to process also may 
be...
   
   "Usually" around metrics, if things can be derived using the exposed ones, 
we don't coin new ones, generally, there are tools which can do that logics and 
show you fancy graphs and all also combining metrics together and doing maths 
on them as well...
   
   The Dn case seems a corner case, it won't be very common and need to be 
careful around not getting pass a split-brain scenario. There are bunch of 
checks around though, but they are just to verify we don't get a false active 
claim acknowledged..
   
   But just thinking about this case,  it can be figured out by simple logic or 
scripts, if there are two claiming active, the one from which the last response 
time is less can be used for those decisions. 
   
   Something like
   ```
   Variables to Store: activeNnId and LastActiveResponseTime=MAX
   Fetch Metrics From DN
   Iterate over all Namenodes. 
   Check if Active
   NnLastResponseTime < LastActiveResponseTime
   Store the nnId and last Response Time
   else 
   Move Forward
   
   if LastActiveResponseTime < configurable value
   conclude dead and 
   do
   <Whatever>
   ```
   
   May be some if else or equality might have got inverted, just for idea 
sake...




> Optionally shut down datanode if it does not stay connected to active namenode
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-16918
>                 URL: https://issues.apache.org/jira/browse/HDFS-16918
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.
>  
> In a distributed system, it is essential to have robust fail-fast mechanisms 
> in place to prevent issues related to network partitioning. The system must 
> be designed to prevent further degradation of availability and consistency in 
> the event of a network partition. Several distributed systems offer fail-safe 
> approaches, and for some, partition tolerance is critical to the extent that 
> even a few seconds of heartbeat loss can trigger the removal of an 
> application server instance from the cluster. For instance, a majority of 
> zooKeeper clients utilize the ephemeral nodes for this purpose to make system 
> reliable, fault-tolerant and strongly consistent in the event of network 
> partition.
> From the hdfs architecture viewpoint, it is crucial to understand the 
> critical role that active and observer namenode play in file system 
> operations. In a large-scale cluster, if the datanodes holding the same block 
> (primary and replicas) lose connection to both active and observer namenodes 
> for a significant amount of time, delaying the process of shutting down such 
> datanodes and restarting it to re-establish the connection with the namenodes 
> (assuming the active namenode is alive, assumption is important in the even 
> of network partition to reestablish the connection) will further deteriorate 
> the availability of the service. This scenario underscores the importance of 
> resolving network partitioning.
> This is a real use case for hdfs and it is not prudent to assume that every 
> deployment or cluster management application must be able to restart 
> datanodes based on JMX metrics, as this would introduce another application 
> to resolve the network partition impact of hdfs. Besides, popular cluster 
> management applications are not typically used in all cloud-native env. Even 
> if these cluster management applications are deployed, certain security 
> constraints may restrict their access to JMX metrics and prevent them from 
> interfering with hdfs operations. The applications that can only trigger 
> alerts for users based on set parameters (for instance, missing blocks > 0) 
> are allowed to access JMX metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to