[
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129894#comment-17129894
]
runzhiwang commented on HDDS-3481:
----------------------------------
bq. What was the cause of the delay? Was the node slow/overloaded?
[~arp] The reason is source datanode replicates too many containers to other
datanode, so the source datanode become very slow.
> SCM ask 31 datanodes to replicate the same container
> ----------------------------------------------------
>
> Key: HDDS-3481
> URL: https://issues.apache.org/jira/browse/HDDS-3481
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: SCM
> Reporter: runzhiwang
> Assignee: runzhiwang
> Priority: Blocker
> Labels: TriagePending
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png,
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows, scm ask 31 datanodes to replicate container 2037 every
> 10 minutes from 2020-04-17 23:38:51. And at 2020-04-18 08:58:52 scm find the
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete
> container 2037.
> !screenshot-1.png!
> !screenshot-2.png!
> *What's the reason ?*
> scm check whether (container replicates num +
> inflightReplication.get(containerId).size() -
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it
> will ask some datanode to replicate the container, and add the action into
> inflightReplication.get(containerId). The replicate action time out is 10
> minutes, if action timeout, scm will delete the action from
> inflightReplication.get(containerId) as the image shows. Then (container
> replicates num + inflightReplication.get(containerId).size() -
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask
> another datanode to replicate the container.
> Because replicate container cost a long time, sometimes it cannot finish in
> 10 minutes, thus 31 datanodes has to replicate the container every 10
> minutes. 19 of 31 datanodes replicate container from the same source
> datanode, it will also cause big pressure on the source datanode and
> replicate container become slower. Actually it cost 4 hours to finish the
> first replicate.
> !screenshot-4.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]