[jira] [Updated] (HDDS-11461) Improve the impact of DataNode I/O

Shilun Fan (Jira) Sun, 15 Sep 2024 20:36:05 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-11461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shilun Fan updated HDDS-11461:
------------------------------
    Description: 
Our object storage service is built on Ozone and currently has over 3K nodes 
across different clusters. Customers have high demands for the P99 latency of 
our system access.

Under normal circumstances, reading 200 bytes of data might take 10ms to 20ms. 
However, monitoring data sometimes shows that reading 200 bytes can take up to 
500ms.

Upon investigating the issue with the DN, we find that when the machine hosting 
the DN experiences high I/O wait or system load, the performance of DN access 
is adversely affected.

The factors contributing to high I/O wait or system load are diverse, including 
DataScanner scans, EC block recovery, or containers being in an UnderReplicated 
state.

We aim to design a mechanism that allows DN to sense the system's I/O 
conditions to some extent (such as high system load, high I/O wait, slow 
network, or slow disk) and report this data to the SCM.

This data will be used to enhance system functionality:

When a DN detects high I/O or degraded read/write performance:

- Automatically reduce the rate of DataScanner scans.
- If a specific disk's performance deteriorates, skip that disk during data 
writes.

When the SCM detects high I/O or degraded read/write performance on DNs:

- Issue commands to bypass these poorly performing DNs.
- When returning a list of DNs to clients for data reads, place the degraded 
DNs at the end of the list.


> Improve the impact of DataNode I/O
> ----------------------------------
>
>                 Key: HDDS-11461
>                 URL: https://issues.apache.org/jira/browse/HDDS-11461
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode
>         Environment: Our object storage service is built on Ozone and 
> currently has over 3K nodes across different clusters. Customers have high 
> demands for the P99 latency of our system access.
> Under normal circumstances, reading 200 bytes of data might take 10ms to 
> 20ms. However, monitoring data sometimes shows that reading 200 bytes can 
> take up to 500ms.
> Upon investigating the issue with the DN, we find that when the machine 
> hosting the DN experiences high I/O wait or system load, the performance of 
> DN access is adversely affected.
> The factors contributing to high I/O wait or system load are diverse, 
> including DataScanner scans, EC block recovery, or containers being in an 
> UnderReplicated state.
> We aim to design a mechanism that allows DN to sense the system's I/O 
> conditions to some extent (such as high system load, high I/O wait, slow 
> network, or slow disk) and report this data to the SCM.
> This data will be used to enhance system functionality:
> When a DN detects high I/O or degraded read/write performance:
> - Automatically reduce the rate of DataScanner scans.
> - If a specific disk's performance deteriorates, skip that disk during data 
> writes.
> When the SCM detects high I/O or degraded read/write performance on DNs:
> - Issue commands to bypass these poorly performing DNs.
> - When returning a list of DNs to clients for data reads, place the degraded 
> DNs at the end of the list.
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>
> Our object storage service is built on Ozone and currently has over 3K nodes 
> across different clusters. Customers have high demands for the P99 latency of 
> our system access.
> Under normal circumstances, reading 200 bytes of data might take 10ms to 
> 20ms. However, monitoring data sometimes shows that reading 200 bytes can 
> take up to 500ms.
> Upon investigating the issue with the DN, we find that when the machine 
> hosting the DN experiences high I/O wait or system load, the performance of 
> DN access is adversely affected.
> The factors contributing to high I/O wait or system load are diverse, 
> including DataScanner scans, EC block recovery, or containers being in an 
> UnderReplicated state.
> We aim to design a mechanism that allows DN to sense the system's I/O 
> conditions to some extent (such as high system load, high I/O wait, slow 
> network, or slow disk) and report this data to the SCM.
> This data will be used to enhance system functionality:
> When a DN detects high I/O or degraded read/write performance:
> - Automatically reduce the rate of DataScanner scans.
> - If a specific disk's performance deteriorates, skip that disk during data 
> writes.
> When the SCM detects high I/O or degraded read/write performance on DNs:
> - Issue commands to bypass these poorly performing DNs.
> - When returning a list of DNs to clients for data reads, place the degraded 
> DNs at the end of the list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11461) Improve the impact of DataNode I/O

Reply via email to