[
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129023#comment-17129023
]
Nanda kumar commented on HDDS-3481:
-----------------------------------
Having closed feedback loop for replication command will increase the
complexity of the code both in SCM and in the datanode side.
Sometime back we had this code which had closed feedback loop, we refactored it
because the code was getting complex to understand.
I can see the problem here.
What is that we get by introducing the feedback loop for the replication
command?
* If the datanode has not yet started the replication even after a configured
interval, we cancel the operation on that datanode and replicate the container
somewhere else.
Is there anything else that I'm missing?
What are all the cases where a datanode might not start processing the
replication command after receiving?
1. Datanode goes down after receiving the replicate command
2. Datanode goes into very long GC pause/hangs
3. Datanode is not able to process the replicate command because of
HDDS-3451/HDDS-3459 or similar issues
In both case 1 and 2, datanode will stop sending heartbeat to SCM, we can
remove the {{inFlightReplication}} in ReplicationManager for the datanodes
which are marked as stale. This should be sufficient enough.
For case 3, we should be able to have a proper solution as part of HDDS-3459.
Is there any other scenario that I'm missing?
Unless there is a very strong argument, I would prefer not to have a closed
feedback loop for the replication command.
> SCM ask 31 datanodes to replicate the same container
> ----------------------------------------------------
>
> Key: HDDS-3481
> URL: https://issues.apache.org/jira/browse/HDDS-3481
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: SCM
> Reporter: runzhiwang
> Assignee: runzhiwang
> Priority: Blocker
> Labels: TriagePending
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png,
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows, scm ask 31 datanodes to replicate container 2037 every
> 10 minutes from 2020-04-17 23:38:51. And at 2020-04-18 08:58:52 scm find the
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete
> container 2037.
> !screenshot-1.png!
> !screenshot-2.png!
> *What's the reason ?*
> scm check whether (container replicates num +
> inflightReplication.get(containerId).size() -
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it
> will ask some datanode to replicate the container, and add the action into
> inflightReplication.get(containerId). The replicate action time out is 10
> minutes, if action timeout, scm will delete the action from
> inflightReplication.get(containerId) as the image shows. Then (container
> replicates num + inflightReplication.get(containerId).size() -
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask
> another datanode to replicate the container.
> Because replicate container cost a long time, sometimes it cannot finish in
> 10 minutes, thus 31 datanodes has to replicate the container every 10
> minutes. 19 of 31 datanodes replicate container from the same source
> datanode, it will also cause big pressure on the source datanode and
> replicate container become slower. Actually it cost 4 hours to finish the
> first replicate.
> !screenshot-4.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]