[ 
https://issues.apache.org/jira/browse/HDDS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-9959:
------------------------------
    Summary: Propagate group remove to other datanodes during pipeline close  
(was: Propagate group remove to other datanodes in the pipeline during pipeline 
close)

> Propagate group remove to other datanodes during pipeline close
> ---------------------------------------------------------------
>
>                 Key: HDDS-9959
>                 URL: https://issues.apache.org/jira/browse/HDDS-9959
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: DN, Ozone Datanode
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In https://issues.apache.org/jira/browse/RATIS-1947, it was found that there 
> might be cases where Datanodes in the same pipeline are closed hours apart. 
> {code:java}
> # dn1
> 2023-11-29 15:22:59,477 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode 1669a7e6-fe3c-4f7e-8fcb-ec5d5027b0eb.
> #dn5
> 2023-11-29 14:07:55,442 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode bd1e72ab-cfd5-4cc1-8fbf-6ec9d9654c98.
> # dn8
> 2023-11-29 16:57:53,894 [Command processor thread] INFO 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler:
>  Close Pipeline PipelineID=23e46782-6b48-4559-b3ac-0f95993cf0bc command on 
> datanode 4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc.{code}
> This might happen when there are a lot of commands in some of the Datanode 
> commandQueue, causing some command to be handled earlier in one datanodes 
> compared to the others.
> Furthermore, Ratis group remove operation is only local to the Raft server 
> and not propagated to the other Raft peers in the same group. Therefore, 
> datanodes that have not received the group remove operation will keep 
> operating (e.g. sending RequestVote / AppendEntries RPCs), although the 
> pipeline (Raft group) is supposed to be closed.
> Therefore, similar to CreatePipelineCommandHandler, the first datanode that 
> receives the close pipeline command needs to propagate the group remove 
> command to the other datanodes (Raft peers) in the same pipeline. This will 
> close the pipeline immediately on all the datanodes. The other pipeline 
> commands will be ignored silently by the datanodes as the pipeline has been 
> successfully closed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to