Marton Elek created HDDS-3587:
---------------------------------
Summary: OM HA can be started with stale OM server
Key: HDDS-3587
URL: https://issues.apache.org/jira/browse/HDDS-3587
Project: Hadoop Distributed Data Store
Issue Type: Bug
Reporter: Marton Elek
When I started to OM HA I found that it's possible to get a degraded cluster. I
used the following configuration:
{code}
CORE-SITE.XML_fs.defaultFS: o3fs://bucket1.vol1.ozone-om-0.ozone-om/
CORE-SITE.xml_fs.AbstractFileSystem.o3fs.impl: org.apache.hadoop.fs.ozone.OzFs
OZONE-SITE.XML_hdds.datanode.dir: /data/storage
OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
OZONE-SITE.XML_ozone.scm.block.client.address: ozone-scm-0.ozone-scm
OZONE-SITE.XML_ozone.om.address: ozone-om-0.ozone-om
OZONE-SITE.XML_ozone.scm.client.address: ozone-scm-0.ozone-scm
OZONE-SITE.XML_ozone.scm.names: ozone-scm-0.ozone-scm
OZONE-SITE.XML_ozone.enabled: "true"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
LOG4J.PROPERTIES_log4j.rootLogger: INFO, stdout
LOG4J.PROPERTIES_log4j.logger.org.apache.ratis: DEBUG
LOG4J.PROPERTIES_log4j.appender.stdout: org.apache.log4j.ConsoleAppender
LOG4J.PROPERTIES_log4j.appender.stdout.layout: org.apache.log4j.PatternLayout
LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern:
'%d{yyyy-MM-dd
HH:mm:ss} %-5p %c{1}:%L - %m%n'
MAPRED-SITE.XML_mapreduce.application.classpath:
/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/ozonefs/hadoop-ozone-filesystem-lib-current-0.5.0-SNAPSHOT.jar
OZONE-SITE.XML_ozone.om.service.ids: omservice
OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
OZONE-SITE.XML_ozone.om.address.omservice.om1:
ozone-om-0.ozone-om.default.svc.cluster.local
OZONE-SITE.XML_ozone.om.address.omservice.om2:
ozone-om-1.ozone-om.default.svc.cluster.local
OZONE-SITE.XML_ozone.om.address.omservice.om3:
ozone-om-2.ozone-om.default.svc.cluster.local
OZONE-SITE.XML_ozone.om.ratis.enable: "true"
{code}
The first om (ozone-om-0) is started without any error and become a leader.
The ozone-om-0 and ozone-om-1 instances are running but the Ratis instance is
shutdown as the leader rejected the vote and ordered to shut down the server.
1. I think it would be better to make this error more visible (for example with
stopping the OM HA cluster if it's rejected to join the ratis ring instead of
waiting in an unusable state).
2. We need some more logging (Maybe in RatisServerImpl.shouldSendShutdown) to
explain why the OM followers are rejected
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]