[
https://issues.apache.org/jira/browse/HDDS-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446550#comment-17446550
]
Attila Doroszlai commented on HDDS-3907:
----------------------------------------
HDDS-2679 added logic for datanodes to try to send addGroup to peers. But in
the following case both dn1 and dn2 failed to call addGroup dn3 with {{Network
closed for unknown reason}} error. So we still have intermittent pipeline
creation failure.
{code:title=dn1}
datanode_1 | 2021-11-19 11:21:49,455 [Command processor thread] WARN
commandhandler.CreatePipelineCommandHandler: Add group failed for
a037aa30-dd3a-4a73-adfa-2538587f989d{ip: 172.18.0.2, host:
ozonesecure_datanode_3.ozonesecure_default, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: null, persistedOpState:
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
datanode_1 | java.io.IOException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
Network closed for unknown reason
...
datanode_1 | 2021-11-19 11:21:50,322 [Command processor thread] INFO
commandhandler.CreatePipelineCommandHandler: Created Pipeline RATIS THREE
PipelineID=16b880f4-5397-4ba9-bbca-677dec1bf190.
{code}
{code:title=dn2}
datanode_2 | 2021-11-19 11:21:51,828 [Command processor thread] WARN
commandhandler.CreatePipelineCommandHandler: Add group failed for
a037aa30-dd3a-4a73-adfa-2538587f989d{ip: 172.18.0.2, host:
ozonesecure_datanode_3.ozonesecure_default, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: null, persistedOpState:
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
datanode_2 | java.io.IOException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE:
Network closed for unknown reason
...
datanode_2 | 2021-11-19 11:21:52,105 [Command processor thread] INFO
commandhandler.CreatePipelineCommandHandler: Created Pipeline RATIS THREE
PipelineID=16b880f4-5397-4ba9-bbca-677dec1bf190.
{code}
{code:title=dn3}
datanode_3 | 2021-11-19 11:21:54,562 [grpc-default-executor-0] WARN
server.GrpcServerProtocolService: a037aa30-dd3a-4a73-adfa-2538587f989d: Failed
requestVote
74c8606f-7b25-4333-8b31-ee6fd2ae88e4->a037aa30-dd3a-4a73-adfa-2538587f989d#0
datanode_3 | org.apache.ratis.protocol.exceptions.GroupMismatchException:
a037aa30-dd3a-4a73-adfa-2538587f989d: group-677DEC1BF190 not found.
...
datanode_3 | 2021-11-19 11:22:17,359 [Command processor thread] INFO
server.RaftServer: a037aa30-dd3a-4a73-adfa-2538587f989d: addNew
group-677DEC1BF190:[a037aa30-dd3a-4a73-adfa-2538587f989d|rpc:172.18.0.2:9856|admin:172.18.0.2:9857|client:172.18.0.2:9858|priority:0,
74c8606f-7b25-4333-8b31-ee6fd2ae88e4|rpc:172.18.0.7:9856|admin:172.18.0.7:9857|client:172.18.0.7:9858|priority:1,
c52c6f5a-f9c6-495e-a9fa-5aa778f0f9cb|rpc:172.18.0.3:9856|admin:172.18.0.3:9857|client:172.18.0.3:9858|priority:0]
returns group-677DEC1BF190:java.util.concurrent.CompletableFuture@3b8c4205[Not
completed]
datanode_3 | 2021-11-19 11:22:17,361 [pool-23-thread-1] INFO
server.RaftServer$Division: a037aa30-dd3a-4a73-adfa-2538587f989d: new
RaftServerImpl for
group-677DEC1BF190:[a037aa30-dd3a-4a73-adfa-2538587f989d|rpc:172.18.0.2:9856|admin:172.18.0.2:9857|client:172.18.0.2:9858|priority:0,
74c8606f-7b25-4333-8b31-ee6fd2ae88e4|rpc:172.18.0.7:9856|admin:172.18.0.7:9857|client:172.18.0.7:9858|priority:1,
c52c6f5a-f9c6-495e-a9fa-5aa778f0f9cb|rpc:172.18.0.3:9856|admin:172.18.0.3:9857|client:172.18.0.3:9858|priority:0]
with ContainerStateMachine:uninitialized
{code}
(The logs above are from a [PR
run|https://github.com/apache/ozone/runs/4263414069], but it also [happens on
{{master}}|https://github.com/elek/ozone-build-results/tree/master/2021/11/18/11570/acceptance-secure].)
> Intermittent failure in writing data in acceptance test
> -------------------------------------------------------
>
> Key: HDDS-3907
> URL: https://issues.apache.org/jira/browse/HDDS-3907
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Marton Elek
> Priority: Blocker
>
> Examples:
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1318/acceptance
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1321/acceptance
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1334/acceptance
> Some strange errors:
> {code}
> scm_1 | 2020-06-30 19:17:50,787 [RatisPipelineUtilsThread] ERROR
> pipeline.SCMPipelineManager: Failed to create pipeline of type RATIS and
> factor ONE. Exception: Cannot create pipeline of factor 1 using 0 nodes. Used
> 6 nodes. Healthy nodes 6
> scm_1 | 2020-06-30 19:17:50,788 [RatisPipelineUtilsThread] ERROR
> pipeline.SCMPipelineManager: Failed to create pipeline of type RATIS and
> factor THREE. Exception: Pipeline creation failed because nodes are engaged
> in other pipelines and every node can only be engaged in max 2 pipelines.
> Required 3. Found 0
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]