[ 
https://issues.apache.org/jira/browse/HDDS-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-9281:
---------------------------------
    Labels: pull-request-available  (was: )

> The DatanodeCommand sent in LegacyReplicationManager does not set the deadline
> ------------------------------------------------------------------------------
>
>                 Key: HDDS-9281
>                 URL: https://issues.apache.org/jira/browse/HDDS-9281
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: GuoHao
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2023-09-14-16-50-00-841.png, 
> image-2023-09-14-16-55-17-541.png
>
>
> Description:
> When a host is shut down and the node deadline is exceeded, the state of the 
> node is changed to DEAD, and the replication manager of the SCM schedules the 
> replenishment of the replica of the container that the node is responsible 
> for.
> When the machine has more data and fewer nodes, other datanode nodes will 
> receive too many replication tasks and queue them for execution. When the 
> datanode is restarted during the execution process and reports the container 
> it is responsible for to the SCM, these replication tasks will still be 
> executed in the queue.
>  in flight replication task num for datanode  like this:
> !image-2023-09-14-16-50-00-841.png!
>  
> I know that each datanode command has a deadline, and I don't see it set in 
> LegacyReplicationManager; it's set in ReplicationManager, and if 
> LegacyReplicationManager is also set If LegacyReplicationManager also sets 
> deadline, then the replication task on datanode will not be executed when it 
> reaches deadline.
>  
> ReplicationManager code see:
>  # 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
> {code:java}
> public void sendDatanodeCommand(SCMCommand<?> command,
>     ContainerInfo containerInfo, DatanodeDetails target)
>     throws NotLeaderException {
>   long scmDeadline = clock.millis() + rmConf.eventTimeout;
>   sendDatanodeCommand(command, containerInfo, target, scmDeadline);
> } {code}
>  
> LegacyReplicationManager code see:
>  # 
> org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand
>  
> {code:java}
> private <T extends Message> boolean sendAndTrackDatanodeCommand(
>     final DatanodeDetails datanode,
>     final SCMCommand<T> command,
>     final Predicate<InflightAction> tracker) {
>   try {
>     command.setTerm(scmContext.getTermOfLeader());
>   } catch (NotLeaderException nle) {
>     LOG.warn("Skip sending datanode command,"
>         + " since current SCM is not leader.", nle);
>     return false;
>   }
>   final boolean allowed = tracker.test(
>       new InflightAction(datanode, clock.millis()));
>   if (!allowed) {
>     return false;
>   }
>   final CommandForDatanode<T> datanodeCommand =
>       new CommandForDatanode<>(datanode.getUuid(), command);
>   eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
>   return true;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to