[jira] [Assigned] (HDDS-13882) Datanode status as HEALTHY even for NoDiskSpace

Krishna Kumar Asawa (Jira) Sun, 16 Nov 2025 21:38:07 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Krishna Kumar Asawa reassigned HDDS-13882:
------------------------------------------

    Assignee: Siddhant Sangwan

> Datanode status as HEALTHY even for NoDiskSpace
> -----------------------------------------------
>
>                 Key: HDDS-13882
>                 URL: https://issues.apache.org/jira/browse/HDDS-13882
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 2.0.0
>            Reporter: Soumitra Sulav
>            Assignee: Siddhant Sangwan
>            Priority: Critical
>
>  
> Datanode status is shown as HEALTHY.
> Even when the available capacity on each datanode is just 4.5 GB on the 
> datanode dir.
> The pipeline create fails as it cannot allocated the minimum 5GB for a 
> container.
> {code:java}
> scm@installer-4:~$ ozone admin pipeline create
> Unable to find enough nodes that meet the space requirement of 1073741824 
> bytes for metadata and 5368709120 bytes for data in healthy node set. Nodes 
> required: 1 Found: 0
> scm@installer-4:~$ df -Th
> Filesystem      Type      Size  Used Avail Use% Mounted on
> /dev/root       ext4      7.6G  3.2G  4.5G  42% /
> devtmpfs        devtmpfs  1.9G     0  1.9G   0% /dev
> tmpfs           tmpfs     1.9G     0  1.9G   0% /dev/shm
> tmpfs           tmpfs     382M  944K  381M   1% /run
> tmpfs           tmpfs     5.0M     0  5.0M   0% /run/lock
> tmpfs           tmpfs     1.9G     0  1.9G   0% /sys/fs/cgroup
> /dev/loop0      squashfs   24M   24M     0 100% /snap/amazon-ssm-agent/11321
> /dev/loop1      squashfs   60M   60M     0 100% /snap/core20/2603
> /dev/loop3      squashfs   92M   92M     0 100% /snap/lxd/32669
> /dev/loop2      squashfs   69M   69M     0 100% /snap/core22/2012
> /dev/loop4      squashfs   45M   45M     0 100% /snap/snapd/24672
> /dev/nvme0n1p15 vfat       98M  6.3M   92M   7% /boot/efi
> tmpfs           tmpfs     382M     0  382M   0% /run/user/0
> root@installer-4:~# ozone admin datanode list
> Datanode: 920fd52c-9140-46b8-bc19-1c6ddb701cba 
> (/default-rack/10.65.157.76/installer-4.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: b7b077dd-7797-4992-9960-753cadeb51bb 
> (/default-rack/10.65.147.12/installer-6.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: c2af08f4-4494-46df-81e1-ea0eebc6b150 
> (/default-rack/10.65.154.72/installer-9.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6b7edc3-eb20-451f-947e-99147e76c188 
> (/default-rack/10.65.156.11/installer-7.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6dfe2cc-03ec-4747-95cc-a6b4a957bdf2 
> (/default-rack/10.65.159.174/installer-8.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 001e2d76-ef05-493a-bdb6-d9debf1af2ea 
> (/default-rack/10.65.144.98/installer-5.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 3261dcbf-7f78-4164-b9e2-b13e03f9a8d2 
> (/default-rack/10.65.150.32/installer-10.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster. {code}
> Due to this SCM always stays in SafeMode even though all prechecks have 
> passed.
> {code:java}
> root@installer-4:~# egrep 'SafeMode|Entering startup' 
> /opt/ozone/current/logs/ozone-scm-scm-installer-4.domain.log
> 2025-11-04 12:13:54,164 [main] INFO 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe mode.
> 2025-11-04 12:13:54,295 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed 
> Containers with one replica threshold count 0, with ec n replica threshold 
> count 0.
> 2025-11-04 12:13:54,298 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total 
> pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:13:54,299 [main] INFO 
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total 
> pipeline count is 0, pipeline's with at least one datanode reported threshold 
> count is 0
> 2025-11-04 12:13:55,006 [main] INFO org.apache.hadoop.hdds.scm.ha.SCMContext: 
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false}.
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
> Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed 
> Containers with one replica threshold count 0, with ec n replica threshold 
> count 0.
> 2025-11-04 12:14:00,321 
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater] 
> INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: 
> Refreshed Total pipeline count is 0, pipeline's with at least one datanode 
> reported threshold count is 0
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-ContainerRegistrationReportForContainerSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ContainerSafeModeRule 
> rule is successfully validated
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 1 
> DataNodes registered, 1 required.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: DataNodeSafeModeRule 
> rule is successfully validated
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: All SCM safe mode pre 
> check rules have passed
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from 
> SafeModeStatus{safeModeStatus=true, preCheckPassed=false} to 
> SafeModeStatus{safeModeStatus=true, preCheckPassed=true}.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator: trigger a 
> one-shot run on scm1-RatisPipelineUtilsThread.
> 2025-11-04 12:14:35,593 
> [scm1-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: 
> AtleastOneDatanodeReportedRule rule is successfully validated {code}
> {code:java}
> root@installer-4:~# ozone admin safemode status --verbose
> SCM is in safe mode.
> validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required 
> datanodes (=1)
> validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines 
> (=0) >= healthyPipelineThresholdCount (=0)
> validated:true, ContainerSafeModeRule, 100.00% of [Ratis] Containers(0 / 0) 
> with at least one reported replica (=1.00) >= safeModeCutoff (=0.99);
> 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= 
> safeModeCutoff (=0.99);
> validated:true, AtleastOneDatanodeReportedRule, reported Ratis/THREE 
> pipelines with at least one datanode (=0) >= threshold (=0)
> root@installer-4:~# ozone admin scm roles
> installer-4.vpc.cloudera.com:9894:LEADER:b1623451-5346-40b2-b1fc-5faa7c649b83:10.65.157.76
> installer-6.vpc.cloudera.com:9894:FOLLOWER:02a1f01e-0f38-427c-959d-c7733a07d106:10.65.147.12
> installer-5.vpc.cloudera.com:9894:FOLLOWER:738a934b-6466-46d1-b2e2-b10dcdaa45ec:10.65.144.98
> root@installer-4:~# ozone admin om roles
> om1 : FOLLOWER (installer-4.vpc.cloudera.com)
> om2 : FOLLOWER (installer-5.vpc.cloudera.com)
> om3 : LEADER (installer-6.vpc.cloudera.com){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (HDDS-13882) Datanode status as HEALTHY even for NoDiskSpace

Reply via email to