[
https://issues.apache.org/jira/browse/HDDS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609032#comment-16609032
]
Shashikant Banerjee edited comment on HDDS-420 at 9/10/18 11:24 AM:
--------------------------------------------------------------------
>From Datanode logs:
{code:java}
2018-09-09 10:32:00,636 INFO org.apache.ratis.server.storage.RaftLogWorker: new
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9-RaftLogWorker for Storage Directory
/data/disk1/ozone/meta/ratis/group-7347726F7570
2018-09-09 10:32:02,696 INFO org.apache.ratis.server.impl.RaftServerImpl:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 changes role from
org.apache.ratis.server.impl.RoleInfo@42b3e7c2 to FOLLOWER at term 0 for
startInitializing
2018-09-09 10:32:02,698 INFO org.apache.ratis.util.JmxRegister: Successfully
registered JMX Bean with object name
Ratis:service=RaftServer,group=group-7347726F7570,id=bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{code}
Initial GroupId is *group-7347726F7570.*
Later on from SCM , reinitialization call comes to the Ratis server. This will
change the groupId of leader to *group-2041ABBEE452*
{code:java}
2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync
ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9)
in group-7347726F7570, cid=3, seq=0 RW, null,
group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858,
faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858,
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858]{code}
Though the groupId did change on this node, while the other servers on the
group, there was no change on the groupId. So next subsequent request from the
leader to the follower fails with groupIdMismatch exception
{code:java}
2018-09-09 10:49:41,382 INFO org.apache.ratis.conf.ConfUtils:
raft.server.rpc.request.timeout = 3000 ms (custom)
2018-09-09 10:49:41,522 WARN org.apache.ratis.server.impl.LogAppender:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: Failed appendEntries to
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858
org.apache.ratis.protocol.GroupMismatchException:
ff544de8-96ea-4097-8cdc-460ac1c60db7:
The group (group-2041ABBEE452) of bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 does not
match the group (group-7347726F7570) of the server
ff544de8-96ea-4097-8cdc-460ac1c60db7
{code}
Clearly, the issue is lies with Reinitialization in Ratis, where it seems like
only the leader has the updated groupId not the followers.
cc ~ [~msingh]
was (Author: shashikant):
>From Datanode logs:
{code:java}
2018-09-09 10:32:00,636 INFO org.apache.ratis.server.storage.RaftLogWorker: new
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9-RaftLogWorker for Storage Directory
/data/disk1/ozone/meta/ratis/group-7347726F7570
2018-09-09 10:32:02,696 INFO org.apache.ratis.server.impl.RaftServerImpl:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 changes role from
org.apache.ratis.server.impl.RoleInfo@42b3e7c2 to FOLLOWER at term 0 for
startInitializing
2018-09-09 10:32:02,698 INFO org.apache.ratis.util.JmxRegister: Successfully
registered JMX Bean with object name
Ratis:service=RaftServer,group=group-7347726F7570,id=bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{code}
Initial GroupId is *group-7347726F7570.*
Later on from SCM , reinitialization call comes to the Ratis server. This will
change the groupId of leader to *group-2041ABBEE452*
{code:java}
2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync
ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9)
in group-7347726F7570, cid=3, seq=0 RW, null,
group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858,
faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858,
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858]{code}
Though the groupd did change on this node, while the other servers on the
group, there was no change on the groupId. So next subsequent request from the
leader to the follower fails with groupIdMismatch exception
{code:java}
2018-09-09 10:49:41,382 INFO org.apache.ratis.conf.ConfUtils:
raft.server.rpc.request.timeout = 3000 ms (custom)
2018-09-09 10:49:41,522 WARN org.apache.ratis.server.impl.LogAppender:
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: Failed appendEntries to
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858
org.apache.ratis.protocol.GroupMismatchException:
ff544de8-96ea-4097-8cdc-460ac1c60db7:
The group (group-2041ABBEE452) of bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 does not
match the group (group-7347726F7570) of the server
ff544de8-96ea-4097-8cdc-460ac1c60db7
{code}
Clearly, the issue is lies with Reinitialization in Ratis, where it seems like
only the leader has the updated groupId not the followers.
> putKey failing with KEY_ALLOCATION_ERROR
> ----------------------------------------
>
> Key: HDDS-420
> URL: https://issues.apache.org/jira/browse/HDDS-420
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Manager
> Reporter: Nilotpal Nandi
> Assignee: Shashikant Banerjee
> Priority: Blocker
> Fix For: 0.2.1
>
>
> Here are the commands run :
> {noformat}
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone oz -putKey
> /fs-volume/fs-bucket/nn1 -file /etc/passwd
> 2018-09-09 15:39:31,131 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> Create key failed, error:KEY_ALLOCATION_ERROR
> [root@ctr-e138-1518143905142-468367-01-000002 bin]#
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone fs -copyFromLocal
> /etc/passwd /
> 2018-09-09 15:40:16,879 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-09-09 15:40:23,632 [main] ERROR - Try to allocate more blocks for write
> failed, already allocated 0 blocks for this write.
> copyFromLocal: Message missing required fields: keyLocation
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone oz -putKey
> /fs-volume/fs-bucket/nn2 -file /etc/passwd
> 2018-09-09 15:44:55,912 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> Create key failed, error:KEY_ALLOCATION_ERROR{noformat}
>
> hadoop version :
> ---------------------------
> {noformat}
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./hadoop version
> Hadoop 3.2.0-SNAPSHOT
> Source code repository git://git.apache.org/hadoop.git -r
> bf8a1750e99cfbfa76021ce51b6514c74c06f498
> Compiled by root on 2018-09-08T10:22Z
> Compiled with protoc 2.5.0
> From source with checksum c5bbb375aed8edabd89c377af83189d
> This command was run using
> /root/hadoop_trunk/ozone-0.3.0-SNAPSHOT/share/hadoop/common/hadoop-common-3.2.0-SNAPSHOT.jar{noformat}
>
> scm log :
> ---------------
> {noformat}
> 2018-09-09 15:45:00,907 INFO
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl: Allocating a new
> ratis pipeline of size: 3 id: pipelineId=f210716d-ba7b-4adf-91d6-da286e5fd010
> 2018-09-09 15:45:00,973 INFO org.apache.ratis.conf.ConfUtils: raft.rpc.type =
> GRPC (default)
> 2018-09-09 15:45:01,007 INFO org.apache.ratis.conf.ConfUtils:
> raft.grpc.message.size.max = 33554432 (custom)
> 2018-09-09 15:45:01,011 INFO org.apache.ratis.conf.ConfUtils:
> raft.client.rpc.retryInterval = 300 ms (default)
> 2018-09-09 15:45:01,012 INFO org.apache.ratis.conf.ConfUtils:
> raft.client.async.outstanding-requests.max = 100 (default)
> 2018-09-09 15:45:01,012 INFO org.apache.ratis.conf.ConfUtils:
> raft.client.async.scheduler-threads = 3 (default)
> 2018-09-09 15:45:01,020 INFO org.apache.ratis.conf.ConfUtils:
> raft.grpc.flow.control.window = 1MB (=1048576) (default)
> 2018-09-09 15:45:01,020 INFO org.apache.ratis.conf.ConfUtils:
> raft.grpc.message.size.max = 33554432 (custom)
> 2018-09-09 15:45:01,102 INFO org.apache.ratis.conf.ConfUtils:
> raft.client.rpc.request.timeout = 3000 ms (default)
> 2018-09-09 15:45:01,667 ERROR org.apache.hadoop.hdds.scm.XceiverClientRatis:
> Failed to reinitialize
> RaftPeer:bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858 datanode:
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{ip: 172.27.12.96, host:
> ctr-e138-1518143905142-468367-01-000007.hwx.site}
> org.apache.ratis.protocol.GroupMismatchException:
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: The group (group-7347726F7570) of
> client-409D68EB500F does not match the group (group-2041ABBEE452) of the
> server bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:222)
> at
> org.apache.ratis.grpc.RaftGrpcUtil.tryUnwrapException(RaftGrpcUtil.java:79)
> at org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:67)
> at
> org.apache.ratis.grpc.client.RaftClientProtocolClient.blockingCall(RaftClientProtocolClient.java:127)
> at
> org.apache.ratis.grpc.client.RaftClientProtocolClient.reinitialize(RaftClientProtocolClient.java:102)
> at
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:77)
> at
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:302)
> at
> org.apache.ratis.client.impl.RaftClientImpl.reinitialize(RaftClientImpl.java:216)
> at
> org.apache.hadoop.hdds.scm.XceiverClientRatis.reinitialize(XceiverClientRatis.java:163)
> at
> org.apache.hadoop.hdds.scm.XceiverClientRatis.reinitialize(XceiverClientRatis.java:133)
> at
> org.apache.hadoop.hdds.scm.XceiverClientRatis.createPipeline(XceiverClientRatis.java:97)
> at
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl.initializePipeline(RatisManagerImpl.java:105)
> at
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.getReplicationPipeline(PipelineSelector.java:303)
> at
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.allocateContainer(ContainerStateManager.java:299)
> at
> org.apache.hadoop.hdds.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:289)
> at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.preAllocateContainers(BlockManagerImpl.java:167)
> at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:266)
> at
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
> at
> org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
> at
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6271)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL:
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: The group (group-7347726F7570) of
> client-409D68EB500F does not match the group (group-2041ABBEE452) of the
> server bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9
> at
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
> at
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:203)
> at
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:132)
> at
> org.apache.ratis.shaded.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.reinitialize(AdminProtocolServiceGrpc.java:220)
> at
> org.apache.ratis.grpc.client.RaftClientProtocolClient.lambda$reinitialize$1(RaftClientProtocolClient.java:104)
> at
> org.apache.ratis.grpc.client.RaftClientProtocolClient.blockingCall(RaftClientProtocolClient.java:125)
> ... 24 more{noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]