[
https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620989#comment-17620989
]
Nick Dimiduk commented on HADOOP-18396:
---------------------------------------
We had a cluster trip on the old IP address issue, after a maintenance
incident, the datanode by its old IP is in the dead servers list.
> Issues running in dynamic / managed environments
> ------------------------------------------------
>
> Key: HADOOP-18396
> URL: https://issues.apache.org/jira/browse/HADOOP-18396
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 3.4.0, 3.3.5, 3.3.4
> Environment: Running an HA configuration in Kubernetes, using Java 11.
> Reporter: Steve Vaughan
> Assignee: Steve Vaughan
> Priority: Major
>
> Running in dynamic or managed environments is a challenge because we can't
> assume that all services will have DNS entries, will be started in a specific
> order, will maintain constant IP addresses, etc. I'm using the following
> assumptions to guide the changes necessary to operate in this kind of
> environment:
> # The configuration files are an expression of desired state
> # If a referenced service instance is not resolvable or reachable at a
> moment in time, it will be eventually and should be able to participate in
> the future, as if it had been there originally, without requiring manual
> intervention
> # IP address changes should be handled in a way that no only allows
> distributed calls to continue to function, but avoids having to re-resolve
> the address over and over
> # Code that requires resolved names (Kerberos and DataNode registration)
> should fall back to DNS reverse lookups to work around temporary issues
> caused by caching. Example: The DataNode registration is only performed at
> startup, and yet the extra check that allows it to succeed in registering
> with the NameNode isn’t performed
> # If an HA system is supposed to only require a quorum, then we shouldn’t
> require the full set, allowing the called service to bring the remaining
> instances into compliance
> # Managing a service should be independent of other services. Example: You
> should be able to perform a rolling restart of JournalNodes without worrying
> about causing an issue with NameNodes as long as a quorum is present.
> A proof of these concepts would be the ability to:
> * Start with less that the full replica count of a service, while still
> providing the required quorum or minimal count, should still allow a cluster
> to start and function. Example: 2 out of 3 configured JournalNodes should
> still allow the NameNode to format, function, rollover to the standby, etc.
> * Introduce missing instances should join the existing cluster without
> manual intervention. Example: Starting the 3rd JournalNode should
> automatically be formatted and brought up to date
> * Perform rolling restarts of individual services without negatively
> impacting other services (causing failures, restarts, etc.). Example:
> Rolling restarts of JournalNodes shouldn't cause problems in NameNodes;
> Rolling restarts of NameNodes shouldn't cause problems with DataNodes
> * Logs should only report updated IP addresses once (per dependent),
> avoiding costly re-resolution
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]