[jira] [Updated] (HDDS-4511) ReplicationManager#isContainerUnderReplicated should consider OPEN container

Glen Geng (Jira) Wed, 25 Nov 2020 04:35:34 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4511:
----------------------------
    Description: 
This improvement is inspired by the fixing of TestDeleteWithSlowFollower in the 
broken HDDS-2823.

 

In the test case TestDeleteWithSlowFollower, there is following trace appearing 
in the log
{code:java}
2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null} moved to stale state. 
Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
[EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
(PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state

{code}
 

 But by design of this case, the stale node handler should not take effect.
{code:java}
// Make the stale, dead and server failure timeout higher so that a dead
// node is not detecte at SCM as well as the pipeline close action
// never gets initiated early at Datanode in the test.{code}
 

This test case relies on ReplicationManager to close the OPEN container in SCM, 
so that SCM won't hold the delete blocks command. 

But the command disappears, since ReplicationManager#isContainerUnderReplicated 
does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED 
container.

 

After talked with [~Sammi], By design, we should avoid replicating container in 
DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated should 
consider OPEN container

  was:
This improvement is inspired from the fixing of TestDeleteWithSlowFollower in 
the broken HDDS-2823.

In the test case TestDeleteWithSlowFollower, there is following trace appearing 
in the log

 
{code:java}
2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null} moved to stale state. 
Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
[EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
(PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state

{code}
 

 

But by design of this case, 

 
{code:java}
// Make the stale, dead and server failure timeout higher so that a dead
// node is not detecte at SCM as well as the pipeline close action
// never gets initiated early at Datanode in the test.{code}
 

 

It relies on ReplicationManager to close the OPEN container in SCM, so that SCM 
won't hold the delete blocks command. 

But the command disappears, since ReplicationManager#isContainerUnderReplicated 
does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED 
container.

 

After talked with [~Sammi], By design, we should avoid replicating container in 
DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated should 
consider OPEN container

 


> ReplicationManager#isContainerUnderReplicated should consider OPEN container
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-4511
>                 URL: https://issues.apache.org/jira/browse/HDDS-4511
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: SCM
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>
> This improvement is inspired by the fixing of TestDeleteWithSlowFollower in 
> the broken HDDS-2823.
>  
> In the test case TestDeleteWithSlowFollower, there is following trace 
> appearing in the log
> {code:java}
> 2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
> node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
> 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: null} moved to stale state. 
> Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
> PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
> [EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
> (PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
> 6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
> 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: 
> null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: 
> null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: null}, Type:RATIS, 
> Factor:THREE, State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
> CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state
> {code}
>  
>  But by design of this case, the stale node handler should not take effect.
> {code:java}
> // Make the stale, dead and server failure timeout higher so that a dead
> // node is not detecte at SCM as well as the pipeline close action
> // never gets initiated early at Datanode in the test.{code}
>  
> This test case relies on ReplicationManager to close the OPEN container in 
> SCM, so that SCM won't hold the delete blocks command. 
> But the command disappears, since 
> ReplicationManager#isContainerUnderReplicated does not consider OPEN 
> container, it only take care of CLOSED and QUASI_CLOSED container.
>  
> After talked with [~Sammi], By design, we should avoid replicating container 
> in DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated 
> should consider OPEN container



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-4511) ReplicationManager#isContainerUnderReplicated should consider OPEN container

Reply via email to