[
https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042971#comment-17042971
]
Li Cheng commented on HDDS-3004:
--------------------------------
On SCM, there is one datanode trying to be the leader of pipelines:
[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli datanode list
Datanode: 316c3dd3-470b-4ebc-a139-766e2f1b8593
(/default-rack/9.134.51.25/9.134.51.25/2 pipelines)
Related pipelines:
79d8bf21-fdc6-4fb7-ba70-7394195d13d4/THREE/RATIS/ALLOCATED/Follower
57276a4f-97b2-4474-a0c6-5308d349d2a2/THREE/RATIS/ALLOCATED/Follower
Datanode: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc
(/default-rack/9.134.51.232/9.134.51.232/2 pipelines)
Related pipelines:
79d8bf21-fdc6-4fb7-ba70-7394195d13d4/THREE/RATIS/ALLOCATED/Follower
57276a4f-97b2-4474-a0c6-5308d349d2a2/THREE/RATIS/ALLOCATED/Follower
Datanode: 6da6b84b-3d8e-4309-ab28-7cc72b4e7293
(/default-rack/9.134.51.215/ozone.s3/2 pipelines)
Related pipelines:
79d8bf21-fdc6-4fb7-ba70-7394195d13d4/THREE/RATIS/ALLOCATED/Follower
{color:#FF0000}57276a4f-97b2-4474-a0c6-5308d349d2a2/THREE/RATIS/ALLOCATED/Leader{color}
Note that this cluster enabled multi-raft so there will be multiple Factor
THREE ratis pipelines.
In this case there are 2 pipelines. And pipeline
57276a4f-97b2-4474-a0c6-_+{color:#FF0000}5308d349d2a2{color}+_ is the only one
who has the leader.
The current leader 215 has datanode logs showing some info about the pipeline
57276a4f-97b2-4474-a0c6-_+{color:#FF0000}5308d349d2a2{color}+_:
2020-02-24 00:24:16,077 [grpc-default-executor-7] WARN
org.apache.ratis.server.impl.RaftServerProxy: 6da6b84b-3d8e-4309-ab28-7c
c72b4e7293: Failed groupAdd*
GroupManagementRequest:client-7920D05E9944->6da6b84b-3d8e-4309-ab28-7cc72b4e7293@group-5308D349D2A
2, cid=7, seq=0, RW, null,
Add:group-_+{color:#FF0000}5308D349D2A2{color}+_:[5ab5305d-f733-44e9-9dcd-ace391b5a9dc:9.134.51.232:9858,
316c3dd3-470b-4ebc-a
139-766e2f1b8593:9.134.51.25:9858,
6da6b84b-3d8e-4309-ab28-7cc72b4e7293:9.134.51.215:9858]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.AlreadyExistsException:
6da6b84b-3d8e-4309-ab28-7cc72b4e729
3: Failed to add
group-5308D349D2A2:[5ab5305d-f733-44e9-9dcd-ace391b5a9dc:9.134.51.232:9858,
316c3dd3-470b-4ebc-a139-766e2f1b85
93:9.134.51.25:9858, 6da6b84b-3d8e-4309-ab28-7cc72b4e7293:9.134.51.215:9858]
since the group already exists in the map.
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at
java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:617)
at
java.util.concurrent.CompletableFuture.thenApplyAsync(CompletableFuture.java:1993)
at
org.apache.ratis.server.impl.RaftServerProxy.groupAddAsync(RaftServerProxy.java:379)
at
org.apache.ratis.server.impl.RaftServerProxy.groupManagementAsync(RaftServerProxy.java:363)
at
org.apache.ratis.grpc.server.GrpcAdminProtocolService.lambda$groupManagement$0(GrpcAdminProtocolService.java:42)
at org.apache.ratis.grpc.GrpcUtil.asyncCall(GrpcUtil.java:160)
at
org.apache.ratis.grpc.server.GrpcAdminProtocolService.groupManagement(GrpcAdminProtocolService.java:42)
at
org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$MethodHandlers.invoke(AdminProtocolServiceGrpc.java:358)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(Serv
erCalls.java:172)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:
331)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runI
nContext(ServerImpl.java:814)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ratis.protocol.AlreadyExistsException:
6da6b84b-3d8e-4309-ab28-7cc72b4e7293: Failed to add group-5308D349
D2A2:[5ab5305d-f733-44e9-9dcd-ace391b5a9dc:9.134.51.232:9858,
316c3dd3-470b-4ebc-a139-766e2f1b8593:9.134.51.25:9858, 6da6b84b-3
d8e-4309-ab28-7cc72b4e7293:9.134.51.215:9858] since the group already exists in
the map.
at
org.apache.ratis.server.impl.RaftServerProxy$ImplMap.addNew(RaftServerProxy.java:83)
at
org.apache.ratis.server.impl.RaftServerProxy.groupAddAsync(RaftServerProxy.java:378)
... 13 more
> OM HA stability issues
> ----------------------
>
> Key: HDDS-3004
> URL: https://issues.apache.org/jira/browse/HDDS-3004
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: om
> Affects Versions: 0.4.0
> Reporter: Li Cheng
> Assignee: Bharat Viswanadham
> Priority: Blocker
>
> To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
> # When I do a long running s3g writing to cluster with OM HA and I stop the
> Om leader to force a re-election, the writing will stop and can never recover.
> --updates 2020-02-20:
> https://issues.apache.org/jira/browse/HDDS-3031 {color:#FF0000}fixes{color}
> this issue.
>
> 2. If I force a OM re-election and do a scm restart after that, the cluster
> cannot see any leader datanode and no datanodes are able to send pipeline
> reports, which makes the cluster unavailable as well. I consider this a
> multi-failover case when the leader OM and SCM are on the same node and there
> is a short outage happen to the node.
>
> --updates 2020-02-20:
> When you do a jar swap for a new version of Ozone and enable OM HA while
> keeping the same ozone-site.xml as last time, if you've written some data
> into the last Ozone cluster (and therefore there are existing versions and
> metadata for om and scm), SCM cannot be up after the jar swap.
> {color:#FF0000}Error logs{color}:
> PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs
> when scm process cannot be started.
>
> Original posting:
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint.
> After the writer starts to work, I kill the OM process on the OM leader host.
> After that, the s3 gateway can never allow writing data and keeps reporting
> InternalError for all new coming keys.
> Process Process-488:
> S3UploadFailedError: Failed to upload ./20191204/file1056.dat to
> ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-489:
> S3UploadFailedError: Failed to upload ./20191204/file9631.dat to
> ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-490:
> S3UploadFailedError: Failed to upload ./20191204/file7520.dat to
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-491:
> S3UploadFailedError: Failed to upload ./20191204/file4220.dat to
> ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-492:
> S3UploadFailedError: Failed to upload ./20191204/file5523.dat to
> ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> Process Process-493:
> S3UploadFailedError: Failed to upload ./20191204/file7520.dat to
> ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500)
> when calling the PutObject operation (reached max retries: 4): Internal
> Server Error
> That's a partial list and note that all keys are different. I also tried
> re-enable the OM process on previous leader OM, but it doesn't help since the
> leader has changed. Also attach partial OM logs:
> 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not
> the leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not
> the leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO
> org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869
> Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 9.134.50.210:36561
> org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not
> the leader. Suggested leader is OM:om2.
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>
>
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
> <name>ozone.om.service.ids</name>
> <value>OMHA</value>
> </property>
> <property>
> <name>ozone.om.nodes.OMHA</name>
> <value>om1,om2,om3</value>
> </property>
> <property>
> <name>ozone.om.node.id</name>
> <value>om1</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om1</name>
> <value>9.134.50.210:9862</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om2</name>
> <value>9.134.51.215:9862</value>
> </property>
> <property>
> <name>ozone.om.address.OMHA.om3</name>
> <value>9.134.51.25:9862</value>
> </property>
> <property>
> <name>ozone.om.ratis.enable</name>
> <value>true</value>
> </property>
> <property>
> <name>ozone.enabled</name>
> <value>true</value>
> <tag>OZONE, REQUIRED</tag>
> <description>
> Status of the Ozone Object Storage service is enabled.
> Set to true to enable Ozone.
> Set to false to disable Ozone.
> Unless this value is set to true, Ozone services will not be started in
> the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
> </description>
> </property>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]