[jira] [Commented] (HDDS-13882) Datanode status as HEALTHY even for NoDiskSpace

Sumit Agrawal (Jira) Tue, 18 Nov 2025 03:02:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039125#comment-18039125
 ]


Sumit Agrawal commented on HDDS-13882:
--------------------------------------

READ_ONLY mode is a temp state which can change very frequently, so this can 
not be a state. From datanode report which already identify if enough space is 
there or not, might be Recon UI can identify those DNs from datanode node 
report and report warning, Same for DN UI also.
 # Metric to represent
 # Recon or other UI to display same information.

 

> Datanode status as HEALTHY even for NoDiskSpace
> -----------------------------------------------
>
>                 Key: HDDS-13882
>                 URL: https://issues.apache.org/jira/browse/HDDS-13882
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 2.0.0
>            Reporter: Soumitra Sulav
>            Assignee: Siddhant Sangwan
>            Priority: Critical
>
>  
> Datanode status is shown as HEALTHY.
> Even when the available capacity on each datanode is just 4.5 GB on the 
> datanode dir.
> The pipeline create fails as it cannot allocated the minimum 5GB for a 
> container.
> {code:java}
> scm@installer-4:~$ ozone admin pipeline create
> Unable to find enough nodes that meet the space requirement of 1073741824 
> bytes for metadata and 5368709120 bytes for data in healthy node set. Nodes 
> required: 1 Found: 0
> scm@installer-4:~$ df -Th
> Filesystem      Type      Size  Used Avail Use% Mounted on
> /dev/root       ext4      7.6G  3.2G  4.5G  42% /
> devtmpfs        devtmpfs  1.9G     0  1.9G   0% /dev
> tmpfs           tmpfs     1.9G     0  1.9G   0% /dev/shm
> tmpfs           tmpfs     382M  944K  381M   1% /run
> tmpfs           tmpfs     5.0M     0  5.0M   0% /run/lock
> tmpfs           tmpfs     1.9G     0  1.9G   0% /sys/fs/cgroup
> /dev/loop0      squashfs   24M   24M     0 100% /snap/amazon-ssm-agent/11321
> /dev/loop1      squashfs   60M   60M     0 100% /snap/core20/2603
> /dev/loop3      squashfs   92M   92M     0 100% /snap/lxd/32669
> /dev/loop2      squashfs   69M   69M     0 100% /snap/core22/2012
> /dev/loop4      squashfs   45M   45M     0 100% /snap/snapd/24672
> /dev/nvme0n1p15 vfat       98M  6.3M   92M   7% /boot/efi
> tmpfs           tmpfs     382M     0  382M   0% /run/user/0
> root@installer-4:~# ozone admin datanode list
> Datanode: 920fd52c-9140-46b8-bc19-1c6ddb701cba 
> (/default-rack/10.65.157.76/installer-4.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: b7b077dd-7797-4992-9960-753cadeb51bb 
> (/default-rack/10.65.147.12/installer-6.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: c2af08f4-4494-46df-81e1-ea0eebc6b150 
> (/default-rack/10.65.154.72/installer-9.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6b7edc3-eb20-451f-947e-99147e76c188 
> (/default-rack/10.65.156.11/installer-7.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6dfe2cc-03ec-4747-95cc-a6b4a957bdf2 
> (/default-rack/10.65.159.174/installer-8.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 001e2d76-ef05-493a-bdb6-d9debf1af2ea 
> (/default-rack/10.65.144.98/installer-5.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 3261dcbf-7f78-4164-b9e2-b13e03f9a8d2 
> (/default-rack/10.65.150.32/installer-10.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster. {code}
> Due to this SCM always stays in SafeMode even though all prechecks have 
> passed.
> {code:java}
> root@installer-4:~# egrep 'SafeMode|Entering startup' 
> /opt/ozone/current/logs/ozone-scm-scm-installer-4.domain.log
> 2025-11-04 12:13:54,164 [main] INFO 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe mode.
> 2025-11-04 12:13:54,295 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed 
> Containers with one replica threshold count 0, with ec n replica threshold 
> count 0.
> 2025-11-04 12:13:54,298 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total 
> pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:13:54,299 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total 
> pipeline count is 0, pipeline's with at least one datanode reported threshold 
> count is 0
> 2025-11-04 12:13:55,006 [main] INFO org.apache.hadoop.hdds.scm.ha.SCMContext: 
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false}.
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
> Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed 
> Containers with one replica threshold count 0, with ec n replica threshold 
> count 0.
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: 
> Refreshed Total pipeline count is 0, pipeline's with at least one datanode 
> reported threshold count is 0
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-ContainerRegistrationReportForContainerSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ContainerSafeModeRule 
> rule is successfully validated
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 1 
> DataNodes registered, 1 required.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: DataNodeSafeModeRule 
> rule is successfully validated
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: All SCM safe mode pre 
> check rules have passed
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from 
> SafeModeStatus{safeModeStatus=true, preCheckPassed=false} to 
> SafeModeStatus{safeModeStatus=true, preCheckPassed=true}.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator: trigger a 
> one-shot run on scm1-RatisPipelineUtilsThread.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: 
> AtleastOneDatanodeReportedRule rule is successfully validated {code}
> {code:java}
> root@installer-4:~# ozone admin safemode status --verbose
> SCM is in safe mode.
> validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required 
> datanodes (=1)
> validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines 
> (=0) >= healthyPipelineThresholdCount (=0)
> validated:true, ContainerSafeModeRule, 100.00% of [Ratis] Containers(0 / 0) 
> with at least one reported replica (=1.00) >= safeModeCutoff (=0.99);
> 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= 
> safeModeCutoff (=0.99);
> validated:true, AtleastOneDatanodeReportedRule, reported Ratis/THREE 
> pipelines with at least one datanode (=0) >= threshold (=0)
> root@installer-4:~# ozone admin scm roles
> installer-4.vpc.cloudera.com:9894:LEADER:b1623451-5346-40b2-b1fc-5faa7c649b83:10.65.157.76
> installer-6.vpc.cloudera.com:9894:FOLLOWER:02a1f01e-0f38-427c-959d-c7733a07d106:10.65.147.12
> installer-5.vpc.cloudera.com:9894:FOLLOWER:738a934b-6466-46d1-b2e2-b10dcdaa45ec:10.65.144.98
> root@installer-4:~# ozone admin om roles
> om1 : FOLLOWER (installer-4.vpc.cloudera.com)
> om2 : FOLLOWER (installer-5.vpc.cloudera.com)
> om3 : LEADER (installer-6.vpc.cloudera.com){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-13882) Datanode status as HEALTHY even for NoDiskSpace

Reply via email to