Steve Vaughan created HADOOP-18396:
--------------------------------------

             Summary: Issues running in dynamic / managed environments
                 Key: HADOOP-18396
                 URL: https://issues.apache.org/jira/browse/HADOOP-18396
             Project: Hadoop Common
          Issue Type: Improvement
    Affects Versions: 3.4.0, 3.3.9, 3.3.4
         Environment: Running an HA configuration in Kubernetes, using Java 11.
            Reporter: Steve Vaughan
            Assignee: Steve Vaughan


Running in dynamic or managed environments is a challenge because we can't 
assume that all services will have DNS entries, will be started in a specific 
order, will maintain constant IP addresses, etc.  I'm using the following 
assumptions to guide the changes necessary to operate in this kind of 
environment:
 # The configuration files are an expression of desired state
 # If a referenced service instance is not resolvable or reachable at a moment 
in time, it will be eventually and should be able to participate in the future, 
as if it had been there originally, without requiring manual intervention
 # IP address changes should be handled in a way that no only allows 
distributed calls to continue to function, but avoids having to re-resolve the 
address over and over
 # Code that requires resolved names (Kerberos and DataNode registration) 
should fall back to DNS reverse lookups to work around temporary issues caused 
by caching.  Example: The DataNode registration is only performed at startup, 
and yet the extra check that allows it to succeed in registering with the 
NameNode isn’t performed
 # If an HA system is supposed to only require a quorum, then we shouldn’t 
require the full set, allowing the called service to bring the remaining 
instances into compliance
 # Managing a service should be independent of other services.  Example: You 
should be able to perform a rolling restart of JournalNodes without worrying 
about causing an issue with NameNodes as long as a quorum is present.

A proof of these concepts would be the ability to:
 * Start with less that the full replica count of a service, while still 
providing the required quorum or minimal count, should still allow a cluster to 
start and function.  Example: 2 out of 3 configured JournalNodes should still 
allow the NameNode to format, function, rollover to the standby, etc.
 * Introduce missing instances should join the existing cluster without manual 
intervention.  Example: Starting the 3rd JournalNode should automatically be 
formatted and brought up to date
 * Perform rolling restarts of individual services without negatively impacting 
other services (causing failures, restarts, etc.).  Example: Rolling restarts 
of JournalNodes shouldn't cause problems in NameNodes; Rolling restarts of 
NameNodes shouldn't cause problems with DataNodes
 * Logs should only report updated IP addresses once (per dependent), avoiding 
costly re-resolution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to