[jira] [Updated] (HDDS-3587) OM HA can be started with stale OM server

Prashant Pogde (Jira) Fri, 29 Jan 2021 09:23:31 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prashant Pogde updated HDDS-3587:
---------------------------------
    Target Version/s: 1.2.0

I am managing the 1.1.0 release and we currently have more than 600 issues 
targeted for 1.1.0. I am moving the target field to 1.2.0. 

If you are actively working on this jira and believe this should be targeted to 
1.1.0 release, Please change the target field back to 1.1.0 before Feb 05, 
2021. 

> OM HA can be started with stale OM server
> -----------------------------------------
>
>                 Key: HDDS-3587
>                 URL: https://issues.apache.org/jira/browse/HDDS-3587
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: OM HA
>            Reporter: Marton Elek
>            Priority: Critical
>
> When I started to OM HA I found that it's possible to get a degraded cluster. 
> I used the following configuration:
> {code}
>   CORE-SITE.XML_fs.defaultFS: o3fs://bucket1.vol1.ozone-om-0.ozone-om/
>   CORE-SITE.xml_fs.AbstractFileSystem.o3fs.impl: 
> org.apache.hadoop.fs.ozone.OzFs
>   OZONE-SITE.XML_hdds.datanode.dir: /data/storage
>   OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
>   OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
>   OZONE-SITE.XML_ozone.scm.block.client.address: ozone-scm-0.ozone-scm
>   OZONE-SITE.XML_ozone.om.address: ozone-om-0.ozone-om
>   OZONE-SITE.XML_ozone.scm.client.address: ozone-scm-0.ozone-scm
>   OZONE-SITE.XML_ozone.scm.names: ozone-scm-0.ozone-scm
>   OZONE-SITE.XML_ozone.enabled: "true"
>   OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
>   LOG4J.PROPERTIES_log4j.rootLogger: INFO, stdout
>   LOG4J.PROPERTIES_log4j.logger.org.apache.ratis: DEBUG
>   LOG4J.PROPERTIES_log4j.appender.stdout: org.apache.log4j.ConsoleAppender
>   LOG4J.PROPERTIES_log4j.appender.stdout.layout: 
> org.apache.log4j.PatternLayout
>   LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern: 
> '%d{yyyy-MM-dd
>     HH:mm:ss} %-5p %c{1}:%L - %m%n'
>   MAPRED-SITE.XML_mapreduce.application.classpath: 
> /opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/ozonefs/hadoop-ozone-filesystem-lib-current-0.5.0-SNAPSHOT.jar
>   OZONE-SITE.XML_ozone.om.service.ids: omservice
>   OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
>   OZONE-SITE.XML_ozone.om.address.omservice.om1: 
> ozone-om-0.ozone-om.default.svc.cluster.local
>   OZONE-SITE.XML_ozone.om.address.omservice.om2: 
> ozone-om-1.ozone-om.default.svc.cluster.local
>   OZONE-SITE.XML_ozone.om.address.omservice.om3: 
> ozone-om-2.ozone-om.default.svc.cluster.local
>   OZONE-SITE.XML_ozone.om.ratis.enable: "true"
> {code}
> The first om (ozone-om-0) is started without any error and become a leader.
> The ozone-om-0 and ozone-om-1 instances are running but the Ratis instance is 
> shutdown as the leader rejected the vote and ordered to shut down the server. 
> 1. I think it would be better to make this error more visible (for example 
> with stopping the OM HA cluster if it's rejected to join the ratis ring 
> instead of waiting in an unusable state).
> 2. We need some more logging (Maybe in RatisServerImpl.shouldSendShutdown) to 
> explain why the OM followers are rejected



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-3587) OM HA can be started with stale OM server

Reply via email to