[ 
https://issues.apache.org/jira/browse/HDDS-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600987#comment-17600987
 ] 

Ethan Rose commented on HDDS-7198:
----------------------------------

{quote}For EC, things will be worse. We have opted to do container copy for EC, 
so there will only ever be the decommission node as the source.
{quote}
Interesting. We may want to file an EC improvement jira for this. Sounds like 
decommissioning a datanode with most/all EC data could be extremely slow with 
the current implementation.
{quote}A decommissioning node will have zero write load. Perhaps we could 
return it last for normal reads to alleviate that load on it too.
{quote}
This is an interesting idea. We should check if the replica order is preserved 
at every step from SCM -> OM -> client if we want to do this, although the WIP 
OM container cache might throw this off.
{quote}However we don't want to de-prioritize the decommissioning nodes 
completely - if they are not serving writes and potentially not reads, they 
will be otherwise idle.
{quote}
I'm not sure having the decom node completely idle is a bad thing, depending on 
the reason for the decom. If the node just needs an OS upgrade/tuning, for 
example, it is fine to take some load while decommissioning. If the node has a 
faulty NIC or slow/failing OS disk then it would be best if the node is idle 
while decom just serves as a guarantee to the admin that they can safely 
discard everything on the node when the process completes. Maybe in the faulty 
case, just shutting the node down and letting the system mark it as dead is the 
what the admin should do here instead.

If decom is only intended to be used for fully functioning nodes and we do not 
want the node idle, then having SCM specify the order for replication with the 
decom node at different positions in the list would be doable, but would 
require designing some potentially involved heuristics on the SCM side to 
balance the load.

> Datanodes should avoid using decommissioning nodes as a container replication 
> source
> ------------------------------------------------------------------------------------
>
>                 Key: HDDS-7198
>                 URL: https://issues.apache.org/jira/browse/HDDS-7198
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode, SCM
>            Reporter: Ethan Rose
>            Priority: Major
>
> Currently when SCM tells a target datanode to replicate a container, it sends 
> the target datanode an ordered list of source datanodes it should download 
> the container from. The target then shuffles the list and tries to download 
> from the sources in the resulting order one by one until one of them succeeds.
> In failure scenarios this works fine. The node that had the failure will not 
> be included in the source list, distributing the source replication load 
> throughout the cluster. However, when a datanode is decommissioning, it will 
> be included in the source list with no distinction from other replicas, 
> causing it to bear a disproportionate amount of the replication load.
> For example, if every container in the cluster has three replicas and one 
> datanode is being decommissioned, the decommissioning node will be the source 
> for 33% of the replications, while the other 66% will be distributed 
> throughout the cluster based on placement of the other container replicas. 
> With datanodes currently throttled at 10 concurrent replication requests, 
> this will place continuous load on the decommissioning node (which may 
> already be in a bad state hence why it is being removed), while decreasing 
> parallelization of the overall replications required.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to