[jira] [Comment Edited] (HDDS-3004) OM HA stability issues

Li Cheng (Jira) Sun, 23 Feb 2020 08:38:13 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042978#comment-17042978
 ]


Li Cheng edited comment on HDDS-3004 at 2/23/20 4:37 PM:
---------------------------------------------------------

After a while, both old pipelines have leader datanodes:

[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli datanode list
 Datanode: 316c3dd3-470b-4ebc-a139-766e2f1b8593 
(/default-rack/9.134.51.25/9.134.51.25/2 pipelines) 
 Related pipelines: 
 
{color:#ff0000}4461d34e-c509-4175-944f-83fbe8ae1095{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc 
(/default-rack/9.134.51.232/9.134.51.232/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 6da6b84b-3d8e-4309-ab28-7cc72b4e7293 
(/default-rack/9.134.51.215/ozone.s3/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 
{color:#ff0000}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}

 

{color:#172b4d}Whereas now SCM sees same pipelines, but still cannot move them 
to OPEN state.{color}

{color:#172b4d}[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone 
scmcli pipeline list
 Pipeline[ Id: 4461d34e-c509-4175-944f-83fbe8ae1095,{color} Nodes: 
5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, 
networkLocation: /default-rack, certSerialId: 
null}316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, 
networkLocation: /default-rack, certSerialId: 
null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:ALLOCATED, leaderId:316c3dd3-470b-4ebc-a139-766e2f1b8593, 
CreationTimestamp2020-02-23T16:28:52.131Z]
 Pipeline[ Id: {color:#de350b}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}, 
Nodes: 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 
9.134.51.25, networkLocation: /default-rack, certSerialId: 
null}5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 
9.134.51.232, networkLocation: /default-rack, certSerialId: 
null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:ALLOCATED, leaderId:6da6b84b-3d8e-4309-ab28-7cc72b4e7293, 
CreationTimestamp2020-02-23T16:28:52.130Z]

 

SCM logs keep showing:

2020-02-24 00:36:18,028 [EventQueue-ContainerReportForContainerReportHandler] 
WARN org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Container #12 
is in OPEN state, but the datanode 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 
9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: 
null} reports an QUASI_CLOSED replica.

 

And 25 datanode logs showing:

2020-02-24 00:35:15,532 [Command processor thread] ERROR 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Can't close pipeline #id: "e2b8a389-bfd2-4b34-b43d-361cbc02c7f9"

java.io.IOException: 316c3dd3-470b-4ebc-a139-766e2f1b8593: Group 
group-361CBC02C7F9 not found.
 at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:640)
 at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.handle(ClosePipelineCommandHandler.java:77)
 at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommText
 colorandDispatcher.java:99)
 at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$1(DatanodeStateMachine.java:450)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ratis.protocol.GroupMismatchException: 
316c3dd3-470b-4ebc-a139-766e2f1b8593: Group 
group-{color:#de350b}361CBC02C7F9{color} not found.
 at 
org.apache.ratis.server.impl.RaftServerProxy.groupRemoveAsync(RaftServerProxy.java:404)
 at 
org.apache.ratis.server.impl.RaftServerProxy.groupManagementAsync(RaftServerProxy.java:367)
 at 
org.apache.ratis.server.impl.RaftServerProxy.groupManagement(RaftServerProxy.java:350)
 at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:638)
 ... 4 more


was (Author: timmylicheng):
After a while, both old pipelines have leader datanodes:

[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli datanode list
 Datanode: 316c3dd3-470b-4ebc-a139-766e2f1b8593 
(/default-rack/9.134.51.25/9.134.51.25/2 pipelines) 
 Related pipelines: 
 
{color:#ff0000}4461d34e-c509-4175-944f-83fbe8ae1095{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc 
(/default-rack/9.134.51.232/9.134.51.232/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 6da6b84b-3d8e-4309-ab28-7cc72b4e7293 
(/default-rack/9.134.51.215/ozone.s3/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 
{color:#ff0000}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}

 

{color:#172b4d}Whereas now SCM sees same pipelines, but still cannot move them 
to OPEN state.{color}

{color:#172b4d}[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone 
scmcli pipeline list
 Pipeline[ Id: 4461d34e-c509-4175-944f-83fbe8ae1095,{color} Nodes: 
5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, 
networkLocation: /default-rack, certSerialId: 
null}316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, 
networkLocation: /default-rack, certSerialId: 
null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:ALLOCATED, leaderId:316c3dd3-470b-4ebc-a139-766e2f1b8593, 
CreationTimestamp2020-02-23T16:28:52.131Z]
 Pipeline[ Id: {color:#de350b}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}, 
Nodes: 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 
9.134.51.25, networkLocation: /default-rack, certSerialId: 
null}5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 
9.134.51.232, networkLocation: /default-rack, certSerialId: 
null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:ALLOCATED, leaderId:6da6b84b-3d8e-4309-ab28-7cc72b4e7293, 
CreationTimestamp2020-02-23T16:28:52.130Z]

 

SCM logs keep showing:

2020-02-24 00:36:18,028 [EventQueue-ContainerReportForContainerReportHandler] 
WARN org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Container #12 
is in OPEN state, but the datanode 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 
9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: 
null} reports an QUASI_CLOSED replica.

> OM HA stability issues
> ----------------------
>
>                 Key: HDDS-3004
>                 URL: https://issues.apache.org/jira/browse/HDDS-3004
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: om
>    Affects Versions: 0.4.0
>            Reporter: Li Cheng
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>
> To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
>  # When I do a long running s3g writing to cluster with OM HA and I stop the 
> Om leader to force a re-election, the writing will stop and can never recover.
> --updates 2020-02-20:
> https://issues.apache.org/jira/browse/HDDS-3031 {color:#FF0000}fixes{color} 
> this issue.
>  
> 2. If I force a OM re-election and do a scm restart after that, the cluster 
> cannot see any leader datanode and no datanodes are able to send pipeline 
> reports, which makes the cluster unavailable as well. I consider this a 
> multi-failover case when the leader OM and SCM are on the same node and there 
> is a short outage happen to the node.
>  
> --updates 2020-02-20:
>  When you do a jar swap for a new version of Ozone and enable OM HA while 
> keeping the same ozone-site.xml as last time, if you've written some data 
> into the last Ozone cluster (and therefore there are existing versions and 
> metadata for om and scm), SCM cannot be up after the jar swap.
> {color:#FF0000}Error logs{color}: 
> PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs 
> when scm process cannot be started.
>  
> Original posting:
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint. 
> After the writer starts to work, I kill the OM process on the OM leader host. 
> After that, the s3 gateway can never allow writing data and keeps reporting 
> InternalError for all new coming keys.
> Process Process-488:
>  S3UploadFailedError: Failed to upload ./20191204/file1056.dat to 
> ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-489:
>  S3UploadFailedError: Failed to upload ./20191204/file9631.dat to 
> ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-490:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-491:
>  S3UploadFailedError: Failed to upload ./20191204/file4220.dat to 
> ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-492:
>  S3UploadFailedError: Failed to upload ./20191204/file5523.dat to 
> ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
>  Process Process-493:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to 
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) 
> when calling the PutObject operation (reached max retries: 4): Internal 
> Server Error
> That's a partial list and note that all keys are different. I also tried 
> re-enable the OM process on previous leader OM, but it doesn't help since the 
> leader has changed. Also attach partial OM logs:
>  2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO 
> org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 
> Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not 
> the leader. Suggested leader is OM:om2.
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
>  <name>ozone.om.service.ids</name>
>  <value>OMHA</value>
>  </property>
>  <property>
>  <name>ozone.om.nodes.OMHA</name>
>  <value>om1,om2,om3</value>
>  </property>
>  <property>
>  <name>ozone.om.node.id</name>
>  <value>om1</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om1</name>
>  <value>9.134.50.210:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om2</name>
>  <value>9.134.51.215:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om3</name>
>  <value>9.134.51.25:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.ratis.enable</name>
>  <value>true</value>
>  </property>
>  <property>
>  <name>ozone.enabled</name>
>  <value>true</value>
>  <tag>OZONE, REQUIRED</tag>
>  <description>
>  Status of the Ozone Object Storage service is enabled.
>  Set to true to enable Ozone.
>  Set to false to disable Ozone.
>  Unless this value is set to true, Ozone services will not be started in
>  the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
>  </description>
>  </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-3004) OM HA stability issues

Reply via email to