[ 
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125986#comment-17125986
 ] 

Nanda kumar commented on HDDS-3481:
-----------------------------------

In clusters where disk and network are slow, we should increase 
{{hdds.scm.replication.event.timeout}} to a higher value. Setting this property 
to a higher value will avoid the issue. The value depends on the performance of 
the cluster.

While increasing this value we should also consider that if the node which 
receives the replication command is not able to copy the container or the node 
goes down, we are delaying the replication as we will timeout the inflight 
replication based on {{hdds.scm.replication.event.timeout}} value.

If Container C1 is under-replicated and has only one replica in Datanode D1
 * C1 [D1]
 * SCM will send the replicate command to D3 and D4
 * If D3 and D4 receives the replicate command but not able to process the 
command (bug/node goes down)
 * SCM will not send the replicate command again for container C1 until the 
already sent commands timeout (hdds.scm.replication.event.timeout)
 * Container C1 will have only one replica during this time period

As Marton suggested, we can have a mechanism to send the status of 
replicate/delete container commands from datanode to SCM. This will make the 
code complex.



[~yjxxtd], did increasing the {{hdds.scm.replication.event.timeout}} value 
solve this issue?

> SCM ask 31 datanodes to replicate the same container
> ----------------------------------------------------
>
>                 Key: HDDS-3481
>                 URL: https://issues.apache.org/jira/browse/HDDS-3481
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Critical
>              Labels: TriagePending
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows,  scm ask 31 datanodes to replicate container 2037 every 
> 10 minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
> container 2037. 
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> *What's the reason ?*
> scm check whether  (container replicates num + 
> inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
> will ask some datanode to replicate the container, and add the action into 
> inflightReplication.get(containerId). The replicate action time out is 10 
> minutes, if action timeout, scm will delete the action from 
> inflightReplication.get(containerId) as the image shows. Then (container 
> replicates num + inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
> another datanode to replicate the container.
> Because replicate container cost a long time,  sometimes it cannot finish in 
> 10 minutes, thus 31 datanodes has to replicate the container every 10 
> minutes.  19 of 31 datanodes replicate container from the same source 
> datanode,  it will also cause big pressure on the source datanode and 
> replicate container become slower. Actually it cost 4 hours to finish the 
> first replicate. 
>  !screenshot-4.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to