Marton Elek created HDDS-3586:
---------------------------------

             Summary: OM HA can be started with 3 isolated LEADER instead of 
one OM ring
                 Key: HDDS-3586
                 URL: https://issues.apache.org/jira/browse/HDDS-3586
             Project: Hadoop Distributed Data Store
          Issue Type: Improvement
            Reporter: Marton Elek


Steps to reproduce:

Imagine that I have 3 different om with the following DNS names:

{code}
ozone-om-0.ozone-om
ozone-om-1.ozone-om
ozone-om-2.ozone-om
{code}

I configured the three hosts as the following:

{code}
  OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
  OZONE-SITE.XML_ozone.om.address.omservice.om1: ozone-om-0
  OZONE-SITE.XML_ozone.om.address.omservice.om2: ozone-om-1
  OZONE-SITE.XML_ozone.om.address.omservice.om3: ozone-om-2
  OZONE-SITE.XML_ozone.om.ratis.enable: "true"
{code}

But unfortunately the DNS is not reliable. All the hosts can resolve only the 
LOCAL hostname.

OMHANodeDetails.java ignores ALL the configuration which are not resolvable:

{code}
 if (!addr.isUnresolved()) {
          if (!isPeer && OmUtils.isAddressLocal(addr)) {
            localRpcAddress = addr;
            localOMServiceId = serviceId;
            localOMNodeId = nodeId;
            localRatisPort = ratisPort;
            found++;
          } else {
            // This OMNode belongs to same OM service as the current OMNode.
            // Add it to peerNodes list.
            // This OMNode belongs to same OM service as the current OMNode.
            // Add it to peerNodes list.
            peerNodesList.add(getHAOMNodeDetails(conf, serviceId,
                nodeId, addr, ratisPort));
          }
        }
{code}

As a result I will have 3 running server but each has 1 one-node Ratis ring 
(peerNodesList is empty as only the local hostname can be resolved).

Group ID is the same for all. But they have separated database and they work as 
separated OM which is VERY dangerous.

 1. Option one: we can accept any unresolved address and retry with connection 
create if it couldn't be connected

2. Option two: at least the error handling should be fixed. When I configured 3 
om, there supposed to be 3 om.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to