[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Nick Dimiduk (Jira) Thu, 20 Oct 2022 03:25:43 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620989#comment-17620989
 ]


Nick Dimiduk commented on HADOOP-18396:
---------------------------------------

We had a cluster trip on the old IP address issue, after a maintenance 
incident, the datanode by its old IP is in the dead servers list.

> Issues running in dynamic / managed environments
> ------------------------------------------------
>
>                 Key: HADOOP-18396
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18396
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.4.0, 3.3.5, 3.3.4
>         Environment: Running an HA configuration in Kubernetes, using Java 11.
>            Reporter: Steve Vaughan
>            Assignee: Steve Vaughan
>            Priority: Major
>
> Running in dynamic or managed environments is a challenge because we can't 
> assume that all services will have DNS entries, will be started in a specific 
> order, will maintain constant IP addresses, etc.  I'm using the following 
> assumptions to guide the changes necessary to operate in this kind of 
> environment:
>  # The configuration files are an expression of desired state
>  # If a referenced service instance is not resolvable or reachable at a 
> moment in time, it will be eventually and should be able to participate in 
> the future, as if it had been there originally, without requiring manual 
> intervention
>  # IP address changes should be handled in a way that no only allows 
> distributed calls to continue to function, but avoids having to re-resolve 
> the address over and over
>  # Code that requires resolved names (Kerberos and DataNode registration) 
> should fall back to DNS reverse lookups to work around temporary issues 
> caused by caching.  Example: The DataNode registration is only performed at 
> startup, and yet the extra check that allows it to succeed in registering 
> with the NameNode isn’t performed
>  # If an HA system is supposed to only require a quorum, then we shouldn’t 
> require the full set, allowing the called service to bring the remaining 
> instances into compliance
>  # Managing a service should be independent of other services.  Example: You 
> should be able to perform a rolling restart of JournalNodes without worrying 
> about causing an issue with NameNodes as long as a quorum is present.
> A proof of these concepts would be the ability to:
>  * Start with less that the full replica count of a service, while still 
> providing the required quorum or minimal count, should still allow a cluster 
> to start and function.  Example: 2 out of 3 configured JournalNodes should 
> still allow the NameNode to format, function, rollover to the standby, etc.
>  * Introduce missing instances should join the existing cluster without 
> manual intervention.  Example: Starting the 3rd JournalNode should 
> automatically be formatted and brought up to date
>  * Perform rolling restarts of individual services without negatively 
> impacting other services (causing failures, restarts, etc.).  Example: 
> Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; 
> Rolling restarts of NameNodes shouldn't cause problems with DataNodes
>  * Logs should only report updated IP addresses once (per dependent), 
> avoiding costly re-resolution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18396) Issues running in dynamic / managed environments

Reply via email to