[ 
https://issues.apache.org/jira/browse/HDDS-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816877#comment-17816877
 ] 

Kohei Sugihara commented on HDDS-9979:
--------------------------------------

The original screenshot of Grafana includes some hidden values. I've attached 
the exact values corresponding to the picture for all nodes in Bytes.
 * [^cluster1.csv]
 * [^cluster2.csv]

Here are the actual disk usage using `du` command by summing up all container 
directories.
 * [^du.cluster1.csv]
 * [^du.cluster2.csv]

Distinguishing the container version V1, V2, and V3 is used our internal 
purpose, so it's extra information for this issue. We recognize the container 
version using the `.container` file to simplify it, and some directories that 
have any `.container` file are recognized as unknown.
The point of these values is the sum of each node exceed the reserved space 
(20%) in the most of nodes in Cluster#1. Cluster #1 has long history which are 
launched using Ozone 1.1 in Jan 2021 and experienced disk full and replication 
storm sometimes. Therefore it has some leaks in the cluster. But anyway, the 
metrics should have to report actual cluster state precisely even it has leak.

> Sum of used, available, and reserved exceeds the physical volume size
> ---------------------------------------------------------------------
>
>                 Key: HDDS-9979
>                 URL: https://issues.apache.org/jira/browse/HDDS-9979
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: DN
>    Affects Versions: 1.4.0
>            Reporter: Kohei Sugihara
>            Assignee: Zita Dombi
>            Priority: Major
>         Attachments: cluster1.csv, cluster1.png, cluster2.csv, cluster2.png, 
> du.cluster1.csv, du.cluster2.csv, image-2024-01-08-19-31-38-781.png
>
>
> While reviewing DN metrics, I noticed the sum of Used, Available, and 
> Reserved is different from the actual volume size. I don't survey Jira deeply 
> for existing similar issues, so I'm appreciate tell me similar issues if you 
> know. We experienced this issue in two clusters. Cluster #1 gains much data 
> and experienced disk full many times.
> h2. Example 1: Cluster #1
> This cluster is consisted from 36 nodes. Each node has 36 -24- of 14 TB HDD 
> drives. Expected total capacity per a single node is calculated by: 36 bays * 
> 14 TB * 10^12 / 1024^4 = 458 TiB, so the sum of 
> {{{}volume_info_metrics_{used,available,reserved{}}}} should be equal to 458 
> TiB. However, we experience differ results.
> The cluster1.png shows a stacked bar graph. Reported metrics are vary and 
> exceeds 458 TiB.
> !cluster1.png!
> h2. Example 2: Cluster #2
> This is another example and each node has 12 of 14 TB HDD drives. Expected 
> total capacity per a single node is calculated by: 12 bays * 14 TB * 10^12 / 
> 1024^4 = 153 TiB.
> The cluster2.png shows a stacked bar graph. Reported metrics is almost same 
> among DNs but some exceptions exceed the physical capacity.
> !cluster2.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to