[jira] [Commented] (HDDS-3004) OM HA stability issues

Li Cheng (Jira) Mon, 24 Feb 2020 09:55:02 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043715#comment-17043715
 ]


Li Cheng commented on HDDS-3004:
--------------------------------

After adding some logs to SCM starter:
Assuming SCM is only bounced after the leader OM is stopped
1. If SCM is bounced after former leader OM is restarted, meaning all OMs are 
up, SCM will be bootstrapped correctly but there will be missing pipeline 
report from the node who doesn't have OM process on it (it's always him tho). 
This would cause all pipelines stay at ALLOCATED state and cluster will be in 
safemode. At this point, if we restart the blacksheep datanode, it will come 
back and send the pipeline report to SCM and all pipelines will be at OPEN 
state.
2. If SCM is bounced before the former leader OM is restarted, meaning not all 
OMs in ratis ring are up, SCM cannot be bootstrapped correctly and it shows 
Pipeline not found.
logs: 
2020-02-25 01:42:59,638 [main] ERROR 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start 
failed with exception
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
PipelineID=8aac083b-aba4-4a77-886f-cf0ad20040bb not found
 at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133)
 at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110)
 at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.addContainerToPipeline(PipelineStateManager.java:59)
 at 
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.addContainerToPipeline(SCMPipelineManager.java:324)
 at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.loadExistingContainers(SCMContainerManager.java:121)
 at 
org.apache.hadoop.hdds.scm.container.SCMContainerManager.<init>(SCMContainerManager.java:107)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:412)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:283)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:215)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:612)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:142)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:117)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:66)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:42)
 at picocli.CommandLine.execute(CommandLine.java:1173)
 at picocli.CommandLine.access$800(CommandLine.java:141)
 at picocli.CommandLine$RunLast.handle(CommandLine.java:1367)
 at picocli.CommandLine$RunLast.handle(CommandLine.java:1335)
 at 
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1243)
 at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
 at picocli.CommandLine.parseWithHandler(CommandLine.java:1465)
 at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:65)
 at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:56)
 at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:55)
2020-02-25 01:42:59,694 [shutdown-hook-0] INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:

> OM HA stability issues
> ----------------------
>
>                 Key: HDDS-3004
>                 URL: https://issues.apache.org/jira/browse/HDDS-3004
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: om
>    Affects Versions: 0.4.0
>            Reporter: Li Cheng
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>
> To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
>  # When I do a long running s3g writing to cluster with OM HA and I stop the 
> Om leader to force a re-election, the writing will stop and can never recover.
> --updates 2020-02-20:
> https://issues.apache.org/jira/browse/HDDS-3031 {color:#FF0000}fixes{color} 
> this issue.
>  
> 2. If I force a OM re-election and do a scm restart after that, the cluster 
> cannot see any leader datanode and no datanodes are able to send pipeline 
> reports, which makes the cluster unavailable as well. I consider this a 
> multi-failover case when the leader OM and SCM are on the same node and there 
> is a short outage happen to the node.
>  
> --updates 2020-02-20:
>  When you do a jar swap for a new version of Ozone and enable OM HA while 
> keeping the same ozone-site.xml as last time, if you've written some data 
> into the last Ozone cluster (and therefore there are existing versions and 
> metadata for om and scm), SCM cannot be up after the jar swap.
> {color:#FF0000}Error logs{color}: 
> PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs 
> when scm process cannot be started.
>  
> Original posting:
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint. 
> After the writer starts to work, I kill the OM process on the OM leader host. 
> After that, the s3 gateway can never allow writing data and keeps reporting 
> InternalError for all new coming keys.
> Process Process-488:
>  S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
> ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-489:
>  S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
> ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-490:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-491:
>  S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
> ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-492:
>  S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
> ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-493:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
> That's a partial list and note that all keys are different. I also tried 
> re-enable the OM process on previous leader OM, but it doesn't help since the 
> leader has changed. Also attach partial OM logs:
>  2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
>  <name>ozone.om.service.ids</name>
>  <value>OMHA</value>
>  </property>
>  <property>
>  <name>ozone.om.nodes.OMHA</name>
>  <value>om1,om2,om3</value>
>  </property>
>  <property>
>  <name>ozone.om.node.id</name>
>  <value>om1</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om1</name>
>  <value>9.134.50.210:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om2</name>
>  <value>9.134.51.215:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om3</name>
>  <value>9.134.51.25:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.ratis.enable</name>
>  <value>true</value>
>  </property>
>  <property>
>  <name>ozone.enabled</name>
>  <value>true</value>
>  <tag>OZONE, REQUIRED</tag>
>  <description>
>  Status of the Ozone Object Storage service is enabled.
>  Set to true to enable Ozone.
>  Set to false to disable Ozone.
>  Unless this value is set to true, Ozone services will not be started in
>  the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
>  </description>
>  </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-3004) OM HA stability issues

Reply via email to