Marton Elek created HDDS-3587:
---------------------------------

             Summary: OM HA can be started with stale OM server
                 Key: HDDS-3587
                 URL: https://issues.apache.org/jira/browse/HDDS-3587
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
            Reporter: Marton Elek


When I started to OM HA I found that it's possible to get a degraded cluster. I 
used the following configuration:

{code}
  CORE-SITE.XML_fs.defaultFS: o3fs://bucket1.vol1.ozone-om-0.ozone-om/
  CORE-SITE.xml_fs.AbstractFileSystem.o3fs.impl: org.apache.hadoop.fs.ozone.OzFs
  OZONE-SITE.XML_hdds.datanode.dir: /data/storage
  OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
  OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
  OZONE-SITE.XML_ozone.scm.block.client.address: ozone-scm-0.ozone-scm
  OZONE-SITE.XML_ozone.om.address: ozone-om-0.ozone-om
  OZONE-SITE.XML_ozone.scm.client.address: ozone-scm-0.ozone-scm
  OZONE-SITE.XML_ozone.scm.names: ozone-scm-0.ozone-scm
  OZONE-SITE.XML_ozone.enabled: "true"
  OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
  LOG4J.PROPERTIES_log4j.rootLogger: INFO, stdout
  LOG4J.PROPERTIES_log4j.logger.org.apache.ratis: DEBUG
  LOG4J.PROPERTIES_log4j.appender.stdout: org.apache.log4j.ConsoleAppender
  LOG4J.PROPERTIES_log4j.appender.stdout.layout: org.apache.log4j.PatternLayout
  LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern: 
'%d{yyyy-MM-dd
    HH:mm:ss} %-5p %c{1}:%L - %m%n'
  MAPRED-SITE.XML_mapreduce.application.classpath: 
/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/ozonefs/hadoop-ozone-filesystem-lib-current-0.5.0-SNAPSHOT.jar
  OZONE-SITE.XML_ozone.om.service.ids: omservice
  OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
  OZONE-SITE.XML_ozone.om.address.omservice.om1: 
ozone-om-0.ozone-om.default.svc.cluster.local
  OZONE-SITE.XML_ozone.om.address.omservice.om2: 
ozone-om-1.ozone-om.default.svc.cluster.local
  OZONE-SITE.XML_ozone.om.address.omservice.om3: 
ozone-om-2.ozone-om.default.svc.cluster.local
  OZONE-SITE.XML_ozone.om.ratis.enable: "true"
{code}

The first om (ozone-om-0) is started without any error and become a leader.

The ozone-om-0 and ozone-om-1 instances are running but the Ratis instance is 
shutdown as the leader rejected the vote and ordered to shut down the server. 

1. I think it would be better to make this error more visible (for example with 
stopping the OM HA cluster if it's rejected to join the ratis ring instead of 
waiting in an unusable state).

2. We need some more logging (Maybe in RatisServerImpl.shouldSendShutdown) to 
explain why the OM followers are rejected



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to