[
https://issues.apache.org/jira/browse/HDDS-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
GuoHao updated HDDS-9281:
-------------------------
Description:
Description:
When a host is shut down and the node deadline is exceeded, the state of the
node is changed to DEAD, and the replication manager of the SCM schedules the
replenishment of the replica of the container that the node is responsible for.
When the machine has more data and fewer nodes, other datanode nodes will
receive too many replication tasks and queue them for execution. When the
datanode is restarted during the execution process and reports the container it
is responsible for to the SCM, these replication tasks will still be executed
in the queue.
in flight replication task num for datanode like this:
!image-2023-09-14-16-50-00-841.png!
I know that each datanode command has a deadline, and I don't see it set in
LegacyReplicationManager; it's set in ReplicationManager, and if
LegacyReplicationManager is also set If LegacyReplicationManager also sets
deadline, then the replication task on datanode will not be executed when it
reaches deadline.
ReplicationManager code see:
#
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
{code:java}
public void sendDatanodeCommand(SCMCommand<?> command,
ContainerInfo containerInfo, DatanodeDetails target)
throws NotLeaderException {
long scmDeadline = clock.millis() + rmConf.eventTimeout;
sendDatanodeCommand(command, containerInfo, target, scmDeadline);
} {code}
LegacyReplicationManager code see:
#
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand
{code:java}
private <T extends Message> boolean sendAndTrackDatanodeCommand(
final DatanodeDetails datanode,
final SCMCommand<T> command,
final Predicate<InflightAction> tracker) {
try {
command.setTerm(scmContext.getTermOfLeader());
} catch (NotLeaderException nle) {
LOG.warn("Skip sending datanode command,"
+ " since current SCM is not leader.", nle);
return false;
}
final boolean allowed = tracker.test(
new InflightAction(datanode, clock.millis()));
if (!allowed) {
return false;
}
final CommandForDatanode<T> datanodeCommand =
new CommandForDatanode<>(datanode.getUuid(), command);
eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
return true;
} {code}
{code:java}
{code}
was:
Description:
When a host is shut down and the node deadline is exceeded, the state of the
node is changed to DEAD, and the replication manager of the SCM schedules the
replenishment of the replica of the container that the node is responsible for.
When the machine has more data and fewer nodes, other datanode nodes will
receive too many replication tasks and queue them for execution. When the
datanode is restarted during the execution process and reports the container it
is responsible for to the SCM, these replication tasks will still be executed
in the queue.
I know that each datanode command has a deadline, and I don't see it set in
LegacyReplicationManager; it's set in ReplicationManager, and if
LegacyReplicationManager is also set If LegacyReplicationManager also sets
deadline, then the replication task on datanode will not be executed when it
reaches deadline.
ReplicationManager code see:
#
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
{code:java}
public void sendDatanodeCommand(SCMCommand<?> command,
ContainerInfo containerInfo, DatanodeDetails target)
throws NotLeaderException {
long scmDeadline = clock.millis() + rmConf.eventTimeout;
sendDatanodeCommand(command, containerInfo, target, scmDeadline);
} {code}
LegacyReplicationManager code see:
#
org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand
{code:java}
private <T extends Message> boolean sendAndTrackDatanodeCommand(
final DatanodeDetails datanode,
final SCMCommand<T> command,
final Predicate<InflightAction> tracker) {
try {
command.setTerm(scmContext.getTermOfLeader());
} catch (NotLeaderException nle) {
LOG.warn("Skip sending datanode command,"
+ " since current SCM is not leader.", nle);
return false;
}
final boolean allowed = tracker.test(
new InflightAction(datanode, clock.millis()));
if (!allowed) {
return false;
}
final CommandForDatanode<T> datanodeCommand =
new CommandForDatanode<>(datanode.getUuid(), command);
eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
return true;
} {code}
{code:java}
{code}
> The DatanodeCommand sent in LegacyReplicationManager does not set the deadline
> ------------------------------------------------------------------------------
>
> Key: HDDS-9281
> URL: https://issues.apache.org/jira/browse/HDDS-9281
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: GuoHao
> Priority: Major
> Attachments: image-2023-09-14-16-50-00-841.png
>
>
> Description:
> When a host is shut down and the node deadline is exceeded, the state of the
> node is changed to DEAD, and the replication manager of the SCM schedules the
> replenishment of the replica of the container that the node is responsible
> for.
> When the machine has more data and fewer nodes, other datanode nodes will
> receive too many replication tasks and queue them for execution. When the
> datanode is restarted during the execution process and reports the container
> it is responsible for to the SCM, these replication tasks will still be
> executed in the queue.
> in flight replication task num for datanode like this:
> !image-2023-09-14-16-50-00-841.png!
>
> I know that each datanode command has a deadline, and I don't see it set in
> LegacyReplicationManager; it's set in ReplicationManager, and if
> LegacyReplicationManager is also set If LegacyReplicationManager also sets
> deadline, then the replication task on datanode will not be executed when it
> reaches deadline.
>
> ReplicationManager code see:
> #
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager#sendDatanodeCommand
> {code:java}
> public void sendDatanodeCommand(SCMCommand<?> command,
> ContainerInfo containerInfo, DatanodeDetails target)
> throws NotLeaderException {
> long scmDeadline = clock.millis() + rmConf.eventTimeout;
> sendDatanodeCommand(command, containerInfo, target, scmDeadline);
> } {code}
>
> LegacyReplicationManager code see:
> #
> org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager#sendAndTrackDatanodeCommand
>
> {code:java}
> private <T extends Message> boolean sendAndTrackDatanodeCommand(
> final DatanodeDetails datanode,
> final SCMCommand<T> command,
> final Predicate<InflightAction> tracker) {
> try {
> command.setTerm(scmContext.getTermOfLeader());
> } catch (NotLeaderException nle) {
> LOG.warn("Skip sending datanode command,"
> + " since current SCM is not leader.", nle);
> return false;
> }
> final boolean allowed = tracker.test(
> new InflightAction(datanode, clock.millis()));
> if (!allowed) {
> return false;
> }
> final CommandForDatanode<T> datanodeCommand =
> new CommandForDatanode<>(datanode.getUuid(), command);
> eventPublisher.fireEvent(SCMEvents.DATANODE_COMMAND, datanodeCommand);
> return true;
> } {code}
>
>
> {code:java}
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]