[
https://issues.apache.org/jira/browse/HDDS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784290#comment-16784290
]
Elek, Marton commented on HDDS-1102:
------------------------------------
Thanks [~nilotpalnandi] to report this issue.
I can reproduce it with local docker compose, with the following steps:
1. start ozone docker-compose cluster and scale datanodes up to 5 nodes
2. stop one datanode from the 1 active RATIS/THREE pipeline
3. Wait until the pipline is switched to CLOSING state (ozone scmcli
listPipelines)
4. restart datanode
The following exception is visible from datanode:
{code}
2019-03-05 09:58:46 INFO LeaderElection:230 -
394d0a0c-ee64-45ba-9dfa-e77cf6e4be8a got exception when requesting votes: {}
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL:
3be08b6f-deb1-429d-b627-feb993a44bed: group-2F66FA0CA189 not found.
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at
org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
at
org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
at
org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
INTERNAL: 3be08b6f-deb1-429d-b627-feb993a44bed: group-2F66FA0CA189 not found.
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
at
org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
at
org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
at
org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
at
org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2019-03-05 09:58:46 INFO LeaderElection:230 -
394d0a0c-ee64-45ba-9dfa-e77cf6e4be8a got exception when requesting votes: {}
java.util.concurrent.ExecutionException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL:
a4d4854c-1cbc-458a-9383-5a44e6383c14: group-2F66FA0CA189 not found.
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at
org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
at
org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
at
org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
INTERNAL: a4d4854c-1cbc-458a-9383-5a44e6383c14: group-2F66FA0CA189 not found.
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
at
org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
at
org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
at
org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
at
org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}
The test was continued:
A new pipeline is created successfully before the restart.
I killed other datanodes until I got a new pipline which included the restarted
datanode. I found that it worked very well: I started freon and checked the
data directory of the restarted datanode. Everything was right.
[~msingh]: Please confirm, but as far as I understood, it's normal. The
original pipline was removed, and the original raft ring can't be found by the
restarted datanode.
The only thing what we can do is adjusting the error message. It could be WARN
and we can add some more info that it could be caused by pipeline recreation.
> docker datanode stopped when new datanodes are added to the cluster
> --------------------------------------------------------------------
>
> Key: HDDS-1102
> URL: https://issues.apache.org/jira/browse/HDDS-1102
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Nilotpal Nandi
> Assignee: Elek, Marton
> Priority: Major
> Labels: test-badlands
> Attachments: allnode.log, datanode.log
>
>
> steps taken:
> --------------------
> # created 5 datanode cluster.
> # shutdown 2 datanodes
> # started the datanodes again.
> One of the datanodes was shut down.
> exception seen :
>
> {noformat}
> 2019-02-14 07:37:26 INFO LeaderElection:230 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL:
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> INTERNAL: a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
> at
> org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
> at
> org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
> at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
> at
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2019-02-14 07:37:26 INFO LeaderElection:46 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: Election PASSED; received 1 response(s)
> [6a0522ba-019e-4b77-ac1f-a9322cd525b8<-61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5#0:OK-t7]
> and 1 exception(s); 6a0522ba-019e-4b77-ac1f-a9322cd525b8:t7, leader=null,
> voted=6a0522ba-019e-4b77-ac1f-a9322cd525b8,
> raftlog=6a0522ba-019e-4b77-ac1f-a9322cd525b8-SegmentedRaftLog:OPENED, conf=3:
> [61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5:172.20.0.8:9858,
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8:172.20.0.6:9858,
> 0f377918-aafa-4d8a-972a-6ead54048fba:172.20.0.3:9858], old=null
> 2019-02-14 07:37:26 INFO LeaderElection:52 - 0:
> java.util.concurrent.ExecutionException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL:
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> 2019-02-14 07:37:26 INFO RoleInfo:130 - 6a0522ba-019e-4b77-ac1f-a9322cd525b8:
> shutdown LeaderElection
> 2019-02-14 07:37:26 INFO RaftServerImpl:161 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 changes role from CANDIDATE to LEADER at
> term 7 for changeToLeader
> 2019-02-14 07:37:26 INFO RaftServerImpl:258 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: change Leader from null to
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 at term 7 for becomeLeader, leader
> elected after 1066ms
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.staging.catchup.gap = 1000 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.rpc.sleep.time
> = 25ms (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.watch.timeout
> = 10s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.watch.timeout.denomination = 1s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.buffer.byte-limit = 33554432 (custom)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.buffer.element-limit = 1 (custom)
> 2019-02-14 07:37:26 INFO GrpcConfigKeys$Server:43 -
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.rpc.request.timeout = 3000ms (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.buffer.byte-limit = 33554432 (custom)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.log.appender.buffer.element-limit = 1 (custom)
> 2019-02-14 07:37:26 INFO GrpcConfigKeys$Server:43 -
> raft.grpc.server.leader.outstanding.appends.max = 128 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 -
> raft.server.rpc.request.timeout = 3000ms (default)
> 2019-02-14 07:37:26 INFO RoleInfo:139 - 6a0522ba-019e-4b77-ac1f-a9322cd525b8:
> start LeaderState
> 2019-02-14 07:37:26 INFO RaftLogWorker:303 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8-RaftLogWorker: Rolling segment log-3_4
> to index:4
> 2019-02-14 07:37:26 INFO RaftLogWorker:403 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8-RaftLogWorker: Rolled log segment from
> /data/metadata/ratis/134f574e-c1b0-4556-a206-fd6fa533f1fb/current/log_inprogress_3
> to /data/metadata/ratis/134f574e-c1b0-4556-a206-fd6fa533f1fb/current/log_3-4
> 2019-02-14 07:37:26 INFO RaftServerImpl:354 -
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: set configuration 5:
> [61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5:172.20.0.8:9858,
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8:172.20.0.6:9858,
> 0f377918-aafa-4d8a-972a-6ead54048fba:172.20.0.3:9858], old=null at 5
> /opt/starter.sh: line 162: 13 Killed $@
>
>
> {noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]