[ 
https://issues.apache.org/jira/browse/HDDS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609032#comment-16609032
 ] 

Shashikant Banerjee edited comment on HDDS-420 at 9/10/18 11:24 AM:
--------------------------------------------------------------------

>From Datanode logs:
{code:java}
2018-09-09 10:32:00,636 INFO org.apache.ratis.server.storage.RaftLogWorker: new 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9-RaftLogWorker for Storage Directory 
/data/disk1/ozone/meta/ratis/group-7347726F7570

2018-09-09 10:32:02,696 INFO org.apache.ratis.server.impl.RaftServerImpl: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 changes role from 
org.apache.ratis.server.impl.RoleInfo@42b3e7c2 to FOLLOWER at term 0 for 
startInitializing

2018-09-09 10:32:02,698 INFO org.apache.ratis.util.JmxRegister: Successfully 
registered JMX Bean with object name 
Ratis:service=RaftServer,group=group-7347726F7570,id=bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{code}
Initial GroupId is *group-7347726F7570.*

Later on from SCM , reinitialization call comes to the Ratis server. This will 
change the groupId of leader to *group-2041ABBEE452*
{code:java}
2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync 
ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9) 
in group-7347726F7570, cid=3, seq=0 RW, null, 
group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, 
faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, 
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858]{code}
Though the groupId did change on this node, while the other servers on the 
group, there was no change on the groupId. So next subsequent request from the 
leader to the follower fails with groupIdMismatch exception
{code:java}
2018-09-09 10:49:41,382 INFO org.apache.ratis.conf.ConfUtils: 
raft.server.rpc.request.timeout = 3000 ms (custom)

2018-09-09 10:49:41,522 WARN org.apache.ratis.server.impl.LogAppender: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: Failed appendEntries to 
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858

org.apache.ratis.protocol.GroupMismatchException: 
ff544de8-96ea-4097-8cdc-460ac1c60db7: 
The group (group-2041ABBEE452) of bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 does not 
match the group (group-7347726F7570) of the server 
ff544de8-96ea-4097-8cdc-460ac1c60db7
{code}
Clearly, the issue is lies with Reinitialization in Ratis, where it seems like 
only the leader has the updated groupId not the followers.

cc ~ [~msingh]

 


was (Author: shashikant):
>From Datanode logs:

 
{code:java}
2018-09-09 10:32:00,636 INFO org.apache.ratis.server.storage.RaftLogWorker: new 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9-RaftLogWorker for Storage Directory 
/data/disk1/ozone/meta/ratis/group-7347726F7570

2018-09-09 10:32:02,696 INFO org.apache.ratis.server.impl.RaftServerImpl: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 changes role from 
org.apache.ratis.server.impl.RoleInfo@42b3e7c2 to FOLLOWER at term 0 for 
startInitializing

2018-09-09 10:32:02,698 INFO org.apache.ratis.util.JmxRegister: Successfully 
registered JMX Bean with object name 
Ratis:service=RaftServer,group=group-7347726F7570,id=bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{code}
 

Initial GroupId is *group-7347726F7570.*

Later on from SCM , reinitialization call comes to the Ratis server. This will 
change the groupId of leader to *group-2041ABBEE452*

 

 
{code:java}
2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync 
ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9) 
in group-7347726F7570, cid=3, seq=0 RW, null, 
group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, 
faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, 
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858]{code}
 

Though the groupd did change on this node, while the other servers on the 
group, there was no change on the groupId. So next subsequent request from the 
leader to the follower fails with groupIdMismatch exception

 
{code:java}
2018-09-09 10:49:41,382 INFO org.apache.ratis.conf.ConfUtils: 
raft.server.rpc.request.timeout = 3000 ms (custom)

2018-09-09 10:49:41,522 WARN org.apache.ratis.server.impl.LogAppender: 
bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: Failed appendEntries to 
ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858

org.apache.ratis.protocol.GroupMismatchException: 
ff544de8-96ea-4097-8cdc-460ac1c60db7: 
The group (group-2041ABBEE452) of bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9 does not 
match the group (group-7347726F7570) of the server 
ff544de8-96ea-4097-8cdc-460ac1c60db7
{code}
 

Clearly, the issue is lies with Reinitialization in Ratis, where it seems like 
only the leader has the updated groupId not the followers.

 

> putKey failing with KEY_ALLOCATION_ERROR
> ----------------------------------------
>
>                 Key: HDDS-420
>                 URL: https://issues.apache.org/jira/browse/HDDS-420
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Manager
>            Reporter: Nilotpal Nandi
>            Assignee: Shashikant Banerjee
>            Priority: Blocker
>             Fix For: 0.2.1
>
>
> Here are the commands run :
> {noformat}
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone oz -putKey 
> /fs-volume/fs-bucket/nn1 -file /etc/passwd
> 2018-09-09 15:39:31,131 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> Create key failed, error:KEY_ALLOCATION_ERROR
> [root@ctr-e138-1518143905142-468367-01-000002 bin]#
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone fs -copyFromLocal 
> /etc/passwd /
> 2018-09-09 15:40:16,879 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2018-09-09 15:40:23,632 [main] ERROR - Try to allocate more blocks for write 
> failed, already allocated 0 blocks for this write.
> copyFromLocal: Message missing required fields: keyLocation
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./ozone oz -putKey 
> /fs-volume/fs-bucket/nn2 -file /etc/passwd
> 2018-09-09 15:44:55,912 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> Create key failed, error:KEY_ALLOCATION_ERROR{noformat}
>  
> hadoop version :
> ---------------------------
> {noformat}
> [root@ctr-e138-1518143905142-468367-01-000002 bin]# ./hadoop version
> Hadoop 3.2.0-SNAPSHOT
> Source code repository git://git.apache.org/hadoop.git -r 
> bf8a1750e99cfbfa76021ce51b6514c74c06f498
> Compiled by root on 2018-09-08T10:22Z
> Compiled with protoc 2.5.0
> From source with checksum c5bbb375aed8edabd89c377af83189d
> This command was run using 
> /root/hadoop_trunk/ozone-0.3.0-SNAPSHOT/share/hadoop/common/hadoop-common-3.2.0-SNAPSHOT.jar{noformat}
>  
> scm log :
> ---------------
> {noformat}
> 2018-09-09 15:45:00,907 INFO 
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl: Allocating a new 
> ratis pipeline of size: 3 id: pipelineId=f210716d-ba7b-4adf-91d6-da286e5fd010
> 2018-09-09 15:45:00,973 INFO org.apache.ratis.conf.ConfUtils: raft.rpc.type = 
> GRPC (default)
> 2018-09-09 15:45:01,007 INFO org.apache.ratis.conf.ConfUtils: 
> raft.grpc.message.size.max = 33554432 (custom)
> 2018-09-09 15:45:01,011 INFO org.apache.ratis.conf.ConfUtils: 
> raft.client.rpc.retryInterval = 300 ms (default)
> 2018-09-09 15:45:01,012 INFO org.apache.ratis.conf.ConfUtils: 
> raft.client.async.outstanding-requests.max = 100 (default)
> 2018-09-09 15:45:01,012 INFO org.apache.ratis.conf.ConfUtils: 
> raft.client.async.scheduler-threads = 3 (default)
> 2018-09-09 15:45:01,020 INFO org.apache.ratis.conf.ConfUtils: 
> raft.grpc.flow.control.window = 1MB (=1048576) (default)
> 2018-09-09 15:45:01,020 INFO org.apache.ratis.conf.ConfUtils: 
> raft.grpc.message.size.max = 33554432 (custom)
> 2018-09-09 15:45:01,102 INFO org.apache.ratis.conf.ConfUtils: 
> raft.client.rpc.request.timeout = 3000 ms (default)
> 2018-09-09 15:45:01,667 ERROR org.apache.hadoop.hdds.scm.XceiverClientRatis: 
> Failed to reinitialize 
> RaftPeer:bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858 datanode: 
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9{ip: 172.27.12.96, host: 
> ctr-e138-1518143905142-468367-01-000007.hwx.site}
> org.apache.ratis.protocol.GroupMismatchException: 
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: The group (group-7347726F7570) of 
> client-409D68EB500F does not match the group (group-2041ABBEE452) of the 
> server bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>  at 
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:222)
>  at 
> org.apache.ratis.grpc.RaftGrpcUtil.tryUnwrapException(RaftGrpcUtil.java:79)
>  at org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:67)
>  at 
> org.apache.ratis.grpc.client.RaftClientProtocolClient.blockingCall(RaftClientProtocolClient.java:127)
>  at 
> org.apache.ratis.grpc.client.RaftClientProtocolClient.reinitialize(RaftClientProtocolClient.java:102)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:77)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:302)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.reinitialize(RaftClientImpl.java:216)
>  at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.reinitialize(XceiverClientRatis.java:163)
>  at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.reinitialize(XceiverClientRatis.java:133)
>  at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.createPipeline(XceiverClientRatis.java:97)
>  at 
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl.initializePipeline(RatisManagerImpl.java:105)
>  at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.getReplicationPipeline(PipelineSelector.java:303)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.allocateContainer(ContainerStateManager.java:299)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:289)
>  at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.preAllocateContainers(BlockManagerImpl.java:167)
>  at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:266)
>  at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
>  at 
> org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
>  at 
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6271)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: 
> bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: The group (group-7347726F7570) of 
> client-409D68EB500F does not match the group (group-2041ABBEE452) of the 
> server bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9
>  at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
>  at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:203)
>  at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:132)
>  at 
> org.apache.ratis.shaded.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.reinitialize(AdminProtocolServiceGrpc.java:220)
>  at 
> org.apache.ratis.grpc.client.RaftClientProtocolClient.lambda$reinitialize$1(RaftClientProtocolClient.java:104)
>  at 
> org.apache.ratis.grpc.client.RaftClientProtocolClient.blockingCall(RaftClientProtocolClient.java:125)
>  ... 24 more{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to