[ 
https://issues.apache.org/jira/browse/HDDS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-9959:
------------------------------
    Description: 
In https://issues.apache.org/jira/browse/RATIS-1947, it was found that there 
might be cases where Datanodes in the same pipeline are closed hours apart. 

 
{code:java}
# dn1
2023-11-29 15:22:59,477 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode 1669a7e6-fe3c-4f7e-8fcb-ec5d5027b0eb.


#dn5
2023-11-29 14:07:55,442 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode bd1e72ab-cfd5-4cc1-8fbf-6ec9d9654c98.


# dn8
2023-11-29 16:57:53,894 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode 4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc.{code}
 

This might happen when there are a lot of commands in some of the Datanode 
commandQueue, causing some command to be handled earlier in one datanodes 
compared to the others.

Furthermore, Ratis group remove operation is only local to the Raft server and 
not propagated to the other Raft peers in the same group. Therefore, datanodes 
that have not received the group remove operation will keep operating (e.g. 
sending RequestVote / AppendEntries RPCs), although the pipeline (Raft group) 
is supposed to be closed.

Therefore, similar to CreatePipelineCommandHandler, the first datanode that 
receives the close pipeline command needs to propagate the group remove command 
to the other datanodes (Raft peers) in the same pipeline. This will close the 
pipeline immediately on all the datanodes. The other pipeline commands will be 
ignored silently by the datanodes as the pipeline has been successfully closed. 

  was:
In https://issues.apache.org/jira/browse/RATIS-1947, it was found that there 
might be cases where Datanodes in the same pipeline are closed hours apart. 
# dn1
2023-11-29 15:22:59,477 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode 1669a7e6-fe3c-4f7e-8fcb-ec5d5027b0eb.

# dn5
2023-11-29 14:07:55,442 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode bd1e72ab-cfd5-4cc1-8fbf-6ec9d9654c98.

# dn8 
2023-11-29 16:57:53,894 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
 Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
datanode 4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc. 
This might happen when there are a lot of commands queues in some of the 
Datanode's commandQueue, causing some command to be handled earlier than the 
other.

Furthermore, Ratis group remove operation is only local to the Raft server and 
not propagated to the other Raft peers in the same group.

Therefore, similar to CreatePipelineCommand, whenever a datanode receives a 
pipeline close command, it also needs to propagate the group remove command to 
the other datanodes (Raft peers) in the same pipeline.


> Propagate close pipelines to other datanodes in the pipeline
> ------------------------------------------------------------
>
>                 Key: HDDS-9959
>                 URL: https://issues.apache.org/jira/browse/HDDS-9959
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: DN, Ozone Datanode
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In https://issues.apache.org/jira/browse/RATIS-1947, it was found that there 
> might be cases where Datanodes in the same pipeline are closed hours apart. 
>  
> {code:java}
> # dn1
> 2023-11-29 15:22:59,477 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode 1669a7e6-fe3c-4f7e-8fcb-ec5d5027b0eb.
> #dn5
> 2023-11-29 14:07:55,442 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode bd1e72ab-cfd5-4cc1-8fbf-6ec9d9654c98.
> # dn8
> 2023-11-29 16:57:53,894 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode 4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc.{code}
>  
> This might happen when there are a lot of commands in some of the Datanode 
> commandQueue, causing some command to be handled earlier in one datanodes 
> compared to the others.
> Furthermore, Ratis group remove operation is only local to the Raft server 
> and not propagated to the other Raft peers in the same group. Therefore, 
> datanodes that have not received the group remove operation will keep 
> operating (e.g. sending RequestVote / AppendEntries RPCs), although the 
> pipeline (Raft group) is supposed to be closed.
> Therefore, similar to CreatePipelineCommandHandler, the first datanode that 
> receives the close pipeline command needs to propagate the group remove 
> command to the other datanodes (Raft peers) in the same pipeline. This will 
> close the pipeline immediately on all the datanodes. The other pipeline 
> commands will be ignored silently by the datanodes as the pipeline has been 
> successfully closed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to