Steve Vaughan created HADOOP-18396:
--------------------------------------
Summary: Issues running in dynamic / managed environments
Key: HADOOP-18396
URL: https://issues.apache.org/jira/browse/HADOOP-18396
Project: Hadoop Common
Issue Type: Improvement
Affects Versions: 3.4.0, 3.3.9, 3.3.4
Environment: Running an HA configuration in Kubernetes, using Java 11.
Reporter: Steve Vaughan
Assignee: Steve Vaughan
Running in dynamic or managed environments is a challenge because we can't
assume that all services will have DNS entries, will be started in a specific
order, will maintain constant IP addresses, etc. I'm using the following
assumptions to guide the changes necessary to operate in this kind of
environment:
# The configuration files are an expression of desired state
# If a referenced service instance is not resolvable or reachable at a moment
in time, it will be eventually and should be able to participate in the future,
as if it had been there originally, without requiring manual intervention
# IP address changes should be handled in a way that no only allows
distributed calls to continue to function, but avoids having to re-resolve the
address over and over
# Code that requires resolved names (Kerberos and DataNode registration)
should fall back to DNS reverse lookups to work around temporary issues caused
by caching. Example: The DataNode registration is only performed at startup,
and yet the extra check that allows it to succeed in registering with the
NameNode isn’t performed
# If an HA system is supposed to only require a quorum, then we shouldn’t
require the full set, allowing the called service to bring the remaining
instances into compliance
# Managing a service should be independent of other services. Example: You
should be able to perform a rolling restart of JournalNodes without worrying
about causing an issue with NameNodes as long as a quorum is present.
A proof of these concepts would be the ability to:
* Start with less that the full replica count of a service, while still
providing the required quorum or minimal count, should still allow a cluster to
start and function. Example: 2 out of 3 configured JournalNodes should still
allow the NameNode to format, function, rollover to the standby, etc.
* Introduce missing instances should join the existing cluster without manual
intervention. Example: Starting the 3rd JournalNode should automatically be
formatted and brought up to date
* Perform rolling restarts of individual services without negatively impacting
other services (causing failures, restarts, etc.). Example: Rolling restarts
of JournalNodes shouldn't cause problems in NameNodes; Rolling restarts of
NameNodes shouldn't cause problems with DataNodes
* Logs should only report updated IP addresses once (per dependent), avoiding
costly re-resolution
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]