[ 
https://issues.apache.org/jira/browse/HDDS-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446550#comment-17446550
 ] 

Attila Doroszlai commented on HDDS-3907:
----------------------------------------

HDDS-2679 added logic for datanodes to try to send addGroup to peers.  But in 
the following case both dn1 and dn2 failed to call addGroup dn3 with {{Network 
closed for unknown reason}} error.  So we still have intermittent pipeline 
creation failure.

{code:title=dn1}
datanode_1  | 2021-11-19 11:21:49,455 [Command processor thread] WARN 
commandhandler.CreatePipelineCommandHandler: Add group failed for 
a037aa30-dd3a-4a73-adfa-2538587f989d{ip: 172.18.0.2, host: 
ozonesecure_datanode_3.ozonesecure_default, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
datanode_1  | java.io.IOException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
Network closed for unknown reason
...
datanode_1  | 2021-11-19 11:21:50,322 [Command processor thread] INFO 
commandhandler.CreatePipelineCommandHandler: Created Pipeline RATIS THREE 
PipelineID=16b880f4-5397-4ba9-bbca-677dec1bf190.
{code}

{code:title=dn2}
datanode_2  | 2021-11-19 11:21:51,828 [Command processor thread] WARN 
commandhandler.CreatePipelineCommandHandler: Add group failed for 
a037aa30-dd3a-4a73-adfa-2538587f989d{ip: 172.18.0.2, host: 
ozonesecure_datanode_3.ozonesecure_default, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
datanode_2  | java.io.IOException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
Network closed for unknown reason
...
datanode_2  | 2021-11-19 11:21:52,105 [Command processor thread] INFO 
commandhandler.CreatePipelineCommandHandler: Created Pipeline RATIS THREE 
PipelineID=16b880f4-5397-4ba9-bbca-677dec1bf190.
{code}

{code:title=dn3}
datanode_3  | 2021-11-19 11:21:54,562 [grpc-default-executor-0] WARN 
server.GrpcServerProtocolService: a037aa30-dd3a-4a73-adfa-2538587f989d: Failed 
requestVote 
74c8606f-7b25-4333-8b31-ee6fd2ae88e4->a037aa30-dd3a-4a73-adfa-2538587f989d#0
datanode_3  | org.apache.ratis.protocol.exceptions.GroupMismatchException: 
a037aa30-dd3a-4a73-adfa-2538587f989d: group-677DEC1BF190 not found.
...
datanode_3  | 2021-11-19 11:22:17,359 [Command processor thread] INFO 
server.RaftServer: a037aa30-dd3a-4a73-adfa-2538587f989d: addNew 
group-677DEC1BF190:[a037aa30-dd3a-4a73-adfa-2538587f989d|rpc:172.18.0.2:9856|admin:172.18.0.2:9857|client:172.18.0.2:9858|priority:0,
 
74c8606f-7b25-4333-8b31-ee6fd2ae88e4|rpc:172.18.0.7:9856|admin:172.18.0.7:9857|client:172.18.0.7:9858|priority:1,
 
c52c6f5a-f9c6-495e-a9fa-5aa778f0f9cb|rpc:172.18.0.3:9856|admin:172.18.0.3:9857|client:172.18.0.3:9858|priority:0]
 returns group-677DEC1BF190:java.util.concurrent.CompletableFuture@3b8c4205[Not 
completed]
datanode_3  | 2021-11-19 11:22:17,361 [pool-23-thread-1] INFO 
server.RaftServer$Division: a037aa30-dd3a-4a73-adfa-2538587f989d: new 
RaftServerImpl for 
group-677DEC1BF190:[a037aa30-dd3a-4a73-adfa-2538587f989d|rpc:172.18.0.2:9856|admin:172.18.0.2:9857|client:172.18.0.2:9858|priority:0,
 
74c8606f-7b25-4333-8b31-ee6fd2ae88e4|rpc:172.18.0.7:9856|admin:172.18.0.7:9857|client:172.18.0.7:9858|priority:1,
 
c52c6f5a-f9c6-495e-a9fa-5aa778f0f9cb|rpc:172.18.0.3:9856|admin:172.18.0.3:9857|client:172.18.0.3:9858|priority:0]
 with ContainerStateMachine:uninitialized
{code}

(The logs above are from a [PR 
run|https://github.com/apache/ozone/runs/4263414069], but it also [happens on 
{{master}}|https://github.com/elek/ozone-build-results/tree/master/2021/11/18/11570/acceptance-secure].)

> Intermittent failure in writing data in acceptance test
> -------------------------------------------------------
>
>                 Key: HDDS-3907
>                 URL: https://issues.apache.org/jira/browse/HDDS-3907
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Blocker
>
> Examples:
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1318/acceptance
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1321/acceptance
> https://github.com/elek/ozone-build-results/tree/master/2020/06/30/1334/acceptance
> Some strange errors:
> {code}
> scm_1         | 2020-06-30 19:17:50,787 [RatisPipelineUtilsThread] ERROR 
> pipeline.SCMPipelineManager: Failed to create pipeline of type RATIS and 
> factor ONE. Exception: Cannot create pipeline of factor 1 using 0 nodes. Used 
> 6 nodes. Healthy nodes 6
> scm_1         | 2020-06-30 19:17:50,788 [RatisPipelineUtilsThread] ERROR 
> pipeline.SCMPipelineManager: Failed to create pipeline of type RATIS and 
> factor THREE. Exception: Pipeline creation failed because nodes are engaged 
> in other pipelines and every node can only be engaged in max 2 pipelines. 
> Required 3. Found 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to