Steve Vaughan created HADOOP-18396: -------------------------------------- Summary: Issues running in dynamic / managed environments Key: HADOOP-18396 URL: https://issues.apache.org/jira/browse/HADOOP-18396 Project: Hadoop Common Issue Type: Improvement Affects Versions: 3.4.0, 3.3.9, 3.3.4 Environment: Running an HA configuration in Kubernetes, using Java 11. Reporter: Steve Vaughan Assignee: Steve Vaughan
Running in dynamic or managed environments is a challenge because we can't assume that all services will have DNS entries, will be started in a specific order, will maintain constant IP addresses, etc. I'm using the following assumptions to guide the changes necessary to operate in this kind of environment: # The configuration files are an expression of desired state # If a referenced service instance is not resolvable or reachable at a moment in time, it will be eventually and should be able to participate in the future, as if it had been there originally, without requiring manual intervention # IP address changes should be handled in a way that no only allows distributed calls to continue to function, but avoids having to re-resolve the address over and over # Code that requires resolved names (Kerberos and DataNode registration) should fall back to DNS reverse lookups to work around temporary issues caused by caching. Example: The DataNode registration is only performed at startup, and yet the extra check that allows it to succeed in registering with the NameNode isn’t performed # If an HA system is supposed to only require a quorum, then we shouldn’t require the full set, allowing the called service to bring the remaining instances into compliance # Managing a service should be independent of other services. Example: You should be able to perform a rolling restart of JournalNodes without worrying about causing an issue with NameNodes as long as a quorum is present. A proof of these concepts would be the ability to: * Start with less that the full replica count of a service, while still providing the required quorum or minimal count, should still allow a cluster to start and function. Example: 2 out of 3 configured JournalNodes should still allow the NameNode to format, function, rollover to the standby, etc. * Introduce missing instances should join the existing cluster without manual intervention. Example: Starting the 3rd JournalNode should automatically be formatted and brought up to date * Perform rolling restarts of individual services without negatively impacting other services (causing failures, restarts, etc.). Example: Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; Rolling restarts of NameNodes shouldn't cause problems with DataNodes * Logs should only report updated IP addresses once (per dependent), avoiding costly re-resolution -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org