[
https://issues.apache.org/jira/browse/HDDS-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126230#comment-17126230
]
Arpit Agarwal commented on HDDS-3587:
-------------------------------------
Moving this out to 0.7.0 since HDDS-3586 is fixed hopefully this won't be a
blocker for 0.6.0.
> OM HA can be started with stale OM server
> -----------------------------------------
>
> Key: HDDS-3587
> URL: https://issues.apache.org/jira/browse/HDDS-3587
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: OM HA
> Reporter: Marton Elek
> Priority: Critical
> Labels: TriagePending
>
> When I started to OM HA I found that it's possible to get a degraded cluster.
> I used the following configuration:
> {code}
> CORE-SITE.XML_fs.defaultFS: o3fs://bucket1.vol1.ozone-om-0.ozone-om/
> CORE-SITE.xml_fs.AbstractFileSystem.o3fs.impl:
> org.apache.hadoop.fs.ozone.OzFs
> OZONE-SITE.XML_hdds.datanode.dir: /data/storage
> OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
> OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
> OZONE-SITE.XML_ozone.scm.block.client.address: ozone-scm-0.ozone-scm
> OZONE-SITE.XML_ozone.om.address: ozone-om-0.ozone-om
> OZONE-SITE.XML_ozone.scm.client.address: ozone-scm-0.ozone-scm
> OZONE-SITE.XML_ozone.scm.names: ozone-scm-0.ozone-scm
> OZONE-SITE.XML_ozone.enabled: "true"
> OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
> LOG4J.PROPERTIES_log4j.rootLogger: INFO, stdout
> LOG4J.PROPERTIES_log4j.logger.org.apache.ratis: DEBUG
> LOG4J.PROPERTIES_log4j.appender.stdout: org.apache.log4j.ConsoleAppender
> LOG4J.PROPERTIES_log4j.appender.stdout.layout:
> org.apache.log4j.PatternLayout
> LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern:
> '%d{yyyy-MM-dd
> HH:mm:ss} %-5p %c{1}:%L - %m%n'
> MAPRED-SITE.XML_mapreduce.application.classpath:
> /opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/ozonefs/hadoop-ozone-filesystem-lib-current-0.5.0-SNAPSHOT.jar
> OZONE-SITE.XML_ozone.om.service.ids: omservice
> OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
> OZONE-SITE.XML_ozone.om.address.omservice.om1:
> ozone-om-0.ozone-om.default.svc.cluster.local
> OZONE-SITE.XML_ozone.om.address.omservice.om2:
> ozone-om-1.ozone-om.default.svc.cluster.local
> OZONE-SITE.XML_ozone.om.address.omservice.om3:
> ozone-om-2.ozone-om.default.svc.cluster.local
> OZONE-SITE.XML_ozone.om.ratis.enable: "true"
> {code}
> The first om (ozone-om-0) is started without any error and become a leader.
> The ozone-om-0 and ozone-om-1 instances are running but the Ratis instance is
> shutdown as the leader rejected the vote and ordered to shut down the server.
> 1. I think it would be better to make this error more visible (for example
> with stopping the OM HA cluster if it's rejected to join the ratis ring
> instead of waiting in an unusable state).
> 2. We need some more logging (Maybe in RatisServerImpl.shouldSendShutdown) to
> explain why the OM followers are rejected
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]