Wei-Chiu Chuang created HDDS-15412:
--------------------------------------
Summary: [DataNode] Disk volume-specific container replication
thread pool
Key: HDDS-15412
URL: https://issues.apache.org/jira/browse/HDDS-15412
Project: Apache Ozone
Issue Type: Task
Components: EC, Ozone Datanode
Reporter: Wei-Chiu Chuang
Attachments: Screenshot 2026-05-26 at 11.30.41 PM.png, Screenshot
2026-05-26 at 11.31.00 PM.png, Screenshot 2026-05-27 at 6.57.48 AM.png
We noticed a pattern where during EC decommission, the replication starts fast
and every disk runs at full speed; however it doesn't last. After a while, the
overall replication gradually slows until only one disk runs at full speed.
!Screenshot 2026-05-26 at 11.30.41 PM.png|width=100%!
!Screenshot 2026-05-26 at 11.31.00 PM.png|width=100%!
!Screenshot 2026-05-27 at 6.57.48 AM.png|width=100%!
It turns out that when a disk is assigned multiple replication tasks
concurrently, the tasks slows down, delaying other replication tasks even
though they are assigned to different disks. Eventually, the rest of disks
become idle while that particular disk is full of replication tasks.
Proposal: Building on top of HDDS-15073, we need to create separate thread
pools for each disks, so that if replication tasks assigned to the a disk start
to back off, they don't interfere with replication tasks assigned to other
disks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]