[ 
https://issues.apache.org/jira/browse/HDDS-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuoHao updated HDDS-9281:
-------------------------
    Description: 
Description:

When a host is shut down and the node deadline is exceeded, the state of the 
node is changed to DEAD, and the replication manager of the SCM schedules the 
replenishment of the replica of the container that the node is responsible for.

When the machine has more data and fewer nodes, other datanode nodes will 
receive too many replication tasks and queue them for execution. When the 
datanode is restarted during the execution process and reports the container it 
is responsible for to the SCM, these replication tasks will still be executed 
in the queue.

 in flight replication task num for datanode  like this:

!image-2023-09-14-16-50-00-841.png!

 

I know that each datanode command has a deadline, and I don't see it set in 
LegacyReplicationManager; it's set in ReplicationManager, and if 
LegacyReplicationManager is also set If LegacyReplicationManager also sets 
deadline, then the replication task on datanode will not be executed when it 
reaches deadline.

 

ReplicationManager code see:
 # 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
{code:java}
public void sendDatanodeCommand(SCMCommand<?> command,
    ContainerInfo containerInfo, DatanodeDetails target)
    throws NotLeaderException {
  long scmDeadline = clock.millis() + rmConf.eventTimeout;
  sendDatanodeCommand(command, containerInfo, target, scmDeadline);
} {code}
 

LegacyReplicationManager code see:
 # 
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand

 
{code:java}
private <T extends Message> boolean sendAndTrackDatanodeCommand(
    final DatanodeDetails datanode,
    final SCMCommand<T> command,
    final Predicate<InflightAction> tracker) {
  try {
    command.setTerm(scmContext.getTermOfLeader());
  } catch (NotLeaderException nle) {
    LOG.warn("Skip sending datanode command,"
        + " since current SCM is not leader.", nle);
    return false;
  }
  final boolean allowed = tracker.test(
      new InflightAction(datanode, clock.millis()));
  if (!allowed) {
    return false;
  }
  final CommandForDatanode<T> datanodeCommand =
      new CommandForDatanode<>(datanode.getUuid(), command);
  eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
  return true;
} {code}
 

 
{code:java}
 {code}

  was:
Description:

When a host is shut down and the node deadline is exceeded, the state of the 
node is changed to DEAD, and the replication manager of the SCM schedules the 
replenishment of the replica of the container that the node is responsible for.

When the machine has more data and fewer nodes, other datanode nodes will 
receive too many replication tasks and queue them for execution. When the 
datanode is restarted during the execution process and reports the container it 
is responsible for to the SCM, these replication tasks will still be executed 
in the queue.

I know that each datanode command has a deadline, and I don't see it set in 
LegacyReplicationManager; it's set in ReplicationManager, and if 
LegacyReplicationManager is also set If LegacyReplicationManager also sets 
deadline, then the replication task on datanode will not be executed when it 
reaches deadline.

 

ReplicationManager code see:

# 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
{code:java}
public void sendDatanodeCommand(SCMCommand<?> command,
    ContainerInfo containerInfo, DatanodeDetails target)
    throws NotLeaderException {
  long scmDeadline = clock.millis() + rmConf.eventTimeout;
  sendDatanodeCommand(command, containerInfo, target, scmDeadline);
} {code}
 

LegacyReplicationManager code see:

# 
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand

 
{code:java}
private <T extends Message> boolean sendAndTrackDatanodeCommand(
    final DatanodeDetails datanode,
    final SCMCommand<T> command,
    final Predicate<InflightAction> tracker) {
  try {
    command.setTerm(scmContext.getTermOfLeader());
  } catch (NotLeaderException nle) {
    LOG.warn("Skip sending datanode command,"
        + " since current SCM is not leader.", nle);
    return false;
  }
  final boolean allowed = tracker.test(
      new InflightAction(datanode, clock.millis()));
  if (!allowed) {
    return false;
  }
  final CommandForDatanode<T> datanodeCommand =
      new CommandForDatanode<>(datanode.getUuid(), command);
  eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
  return true;
} {code}
 

 
{code:java}
{code}


> The DatanodeCommand sent in LegacyReplicationManager does not set the deadline
> ------------------------------------------------------------------------------
>
>                 Key: HDDS-9281
>                 URL: https://issues.apache.org/jira/browse/HDDS-9281
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: GuoHao
>            Priority: Major
>         Attachments: image-2023-09-14-16-50-00-841.png
>
>
> Description:
> When a host is shut down and the node deadline is exceeded, the state of the 
> node is changed to DEAD, and the replication manager of the SCM schedules the 
> replenishment of the replica of the container that the node is responsible 
> for.
> When the machine has more data and fewer nodes, other datanode nodes will 
> receive too many replication tasks and queue them for execution. When the 
> datanode is restarted during the execution process and reports the container 
> it is responsible for to the SCM, these replication tasks will still be 
> executed in the queue.
>  in flight replication task num for datanode  like this:
> !image-2023-09-14-16-50-00-841.png!
>  
> I know that each datanode command has a deadline, and I don't see it set in 
> LegacyReplicationManager; it's set in ReplicationManager, and if 
> LegacyReplicationManager is also set If LegacyReplicationManager also sets 
> deadline, then the replication task on datanode will not be executed when it 
> reaches deadline.
>  
> ReplicationManager code see:
>  # 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
> {code:java}
> public void sendDatanodeCommand(SCMCommand<?> command,
>     ContainerInfo containerInfo, DatanodeDetails target)
>     throws NotLeaderException {
>   long scmDeadline = clock.millis() + rmConf.eventTimeout;
>   sendDatanodeCommand(command, containerInfo, target, scmDeadline);
> } {code}
>  
> LegacyReplicationManager code see:
>  # 
> org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand
>  
> {code:java}
> private <T extends Message> boolean sendAndTrackDatanodeCommand(
>     final DatanodeDetails datanode,
>     final SCMCommand<T> command,
>     final Predicate<InflightAction> tracker) {
>   try {
>     command.setTerm(scmContext.getTermOfLeader());
>   } catch (NotLeaderException nle) {
>     LOG.warn("Skip sending datanode command,"
>         + " since current SCM is not leader.", nle);
>     return false;
>   }
>   final boolean allowed = tracker.test(
>       new InflightAction(datanode, clock.millis()));
>   if (!allowed) {
>     return false;
>   }
>   final CommandForDatanode<T> datanodeCommand =
>       new CommandForDatanode<>(datanode.getUuid(), command);
>   eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
>   return true;
> } {code}
>  
>  
> {code:java}
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to