[ 
https://issues.apache.org/jira/browse/HDDS-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4511:
----------------------------
    Description: 
This improvement is inspired by the fixing of TestDeleteWithSlowFollower in the 
broken HDDS-2823.

 

In the test case TestDeleteWithSlowFollower, there is following trace appearing 
in the log
{code:java}
2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null} moved to stale state. 
Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
[EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
(PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state

{code}
 

 But by design of this case, the stale node handler should not take effect.
{code:java}
// Make the stale, dead and server failure timeout higher so that a dead
// node is not detecte at SCM as well as the pipeline close action
// never gets initiated early at Datanode in the test.{code}
 

This test case relies on ReplicationManager to close the OPEN container in SCM, 
so that SCM won't hold the delete blocks command. 

It can send out the close container command either because it is an OPEN 
container but under replicate or it is an OPEN container but it has CLOSED 
replica.

Since the default interval of RM is 5m, the test case actually relies the "t is 
an OPEN container but under replicate" to avoid trigger the stale node handler..

 

But the command disappears, since ReplicationManager#isContainerUnderReplicated 
does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED 
container.

 

After talked with [~Sammi], By design, we should avoid replicating container in 
DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated should 
consider OPEN container

  was:
This improvement is inspired by the fixing of TestDeleteWithSlowFollower in the 
broken HDDS-2823.

 

In the test case TestDeleteWithSlowFollower, there is following trace appearing 
in the log
{code:java}
2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null} moved to stale state. 
Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
[EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
(PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: 
null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state

{code}
 

 But by design of this case, the stale node handler should not take effect.
{code:java}
// Make the stale, dead and server failure timeout higher so that a dead
// node is not detecte at SCM as well as the pipeline close action
// never gets initiated early at Datanode in the test.{code}
 

This test case relies on ReplicationManager to close the OPEN container in SCM, 
so that SCM won't hold the delete blocks command. 

But the command disappears, since ReplicationManager#isContainerUnderReplicated 
does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED 
container.

 

After talked with [~Sammi], By design, we should avoid replicating container in 
DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated should 
consider OPEN container


> ReplicationManager#isContainerUnderReplicated should consider OPEN container
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-4511
>                 URL: https://issues.apache.org/jira/browse/HDDS-4511
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: SCM
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: pull-request-available
>
> This improvement is inspired by the fixing of TestDeleteWithSlowFollower in 
> the broken HDDS-2823.
>  
> In the test case TestDeleteWithSlowFollower, there is following trace 
> appearing in the log
> {code:java}
> 2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  
> node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 
> 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: null} moved to stale state. 
> Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, 
> PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 
> [EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager 
> (PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 
> 6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 
> 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: 
> null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: 
> null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, 
> networkLocation: /default-rack, certSerialId: null}, Type:RATIS, 
> Factor:THREE, State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, 
> CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state
> {code}
>  
>  But by design of this case, the stale node handler should not take effect.
> {code:java}
> // Make the stale, dead and server failure timeout higher so that a dead
> // node is not detecte at SCM as well as the pipeline close action
> // never gets initiated early at Datanode in the test.{code}
>  
> This test case relies on ReplicationManager to close the OPEN container in 
> SCM, so that SCM won't hold the delete blocks command. 
> It can send out the close container command either because it is an OPEN 
> container but under replicate or it is an OPEN container but it has CLOSED 
> replica.
> Since the default interval of RM is 5m, the test case actually relies the "t 
> is an OPEN container but under replicate" to avoid trigger the stale node 
> handler..
>  
> But the command disappears, since 
> ReplicationManager#isContainerUnderReplicated does not consider OPEN 
> container, it only take care of CLOSED and QUASI_CLOSED container.
>  
> After talked with [~Sammi], By design, we should avoid replicating container 
> in DELETING or DELETED state. ReplicationManager#isContainerUnderReplicated 
> should consider OPEN container



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to