[
https://issues.apache.org/jira/browse/HDDS-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krishna Kumar Asawa reassigned HDDS-13882:
------------------------------------------
Assignee: Siddhant Sangwan
> Datanode status as HEALTHY even for NoDiskSpace
> -----------------------------------------------
>
> Key: HDDS-13882
> URL: https://issues.apache.org/jira/browse/HDDS-13882
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 2.0.0
> Reporter: Soumitra Sulav
> Assignee: Siddhant Sangwan
> Priority: Critical
>
>
> Datanode status is shown as HEALTHY.
> Even when the available capacity on each datanode is just 4.5 GB on the
> datanode dir.
> The pipeline create fails as it cannot allocated the minimum 5GB for a
> container.
> {code:java}
> scm@installer-4:~$ ozone admin pipeline create
> Unable to find enough nodes that meet the space requirement of 1073741824
> bytes for metadata and 5368709120 bytes for data in healthy node set. Nodes
> required: 1 Found: 0
> scm@installer-4:~$ df -Th
> Filesystem Type Size Used Avail Use% Mounted on
> /dev/root ext4 7.6G 3.2G 4.5G 42% /
> devtmpfs devtmpfs 1.9G 0 1.9G 0% /dev
> tmpfs tmpfs 1.9G 0 1.9G 0% /dev/shm
> tmpfs tmpfs 382M 944K 381M 1% /run
> tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock
> tmpfs tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
> /dev/loop0 squashfs 24M 24M 0 100% /snap/amazon-ssm-agent/11321
> /dev/loop1 squashfs 60M 60M 0 100% /snap/core20/2603
> /dev/loop3 squashfs 92M 92M 0 100% /snap/lxd/32669
> /dev/loop2 squashfs 69M 69M 0 100% /snap/core22/2012
> /dev/loop4 squashfs 45M 45M 0 100% /snap/snapd/24672
> /dev/nvme0n1p15 vfat 98M 6.3M 92M 7% /boot/efi
> tmpfs tmpfs 382M 0 382M 0% /run/user/0
> root@installer-4:~# ozone admin datanode list
> Datanode: 920fd52c-9140-46b8-bc19-1c6ddb701cba
> (/default-rack/10.65.157.76/installer-4.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: b7b077dd-7797-4992-9960-753cadeb51bb
> (/default-rack/10.65.147.12/installer-6.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: c2af08f4-4494-46df-81e1-ea0eebc6b150
> (/default-rack/10.65.154.72/installer-9.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6b7edc3-eb20-451f-947e-99147e76c188
> (/default-rack/10.65.156.11/installer-7.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: e6dfe2cc-03ec-4747-95cc-a6b4a957bdf2
> (/default-rack/10.65.159.174/installer-8.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 001e2d76-ef05-493a-bdb6-d9debf1af2ea
> (/default-rack/10.65.144.98/installer-5.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster.
> Datanode: 3261dcbf-7f78-4164-b9e2-b13e03f9a8d2
> (/default-rack/10.65.150.32/installer-10.domain/0 pipelines)
> Operational State: IN_SERVICE
> Health State: HEALTHY
> Related pipelines:
> No pipelines in cluster. {code}
> Due to this SCM always stays in SafeMode even though all prechecks have
> passed.
> {code:java}
> root@installer-4:~# egrep 'SafeMode|Entering startup'
> /opt/ozone/current/logs/ozone-scm-scm-installer-4.domain.log
> 2025-11-04 12:13:54,164 [main] INFO
> org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe mode.
> 2025-11-04 12:13:54,295 [main] INFO
> org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed
> Containers with one replica threshold count 0, with ec n replica threshold
> count 0.
> 2025-11-04 12:13:54,298 [main] INFO
> org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
> pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:13:54,299 [main] INFO
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total
> pipeline count is 0, pipeline's with at least one datanode reported threshold
> count is 0
> 2025-11-04 12:13:55,006 [main] INFO org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false}.
> 2025-11-04 12:14:00,321
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater]
> INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
> 2025-11-04 12:14:00,321
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater]
> INFO org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed
> Containers with one replica threshold count 0, with ec n replica threshold
> count 0.
> 2025-11-04 12:14:00,321
> [b1623451-5346-40b2-b1fc-5faa7c649b83@group-7D6A6F9524D7-StateMachineUpdater]
> INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Refreshed Total pipeline count is 0, pipeline's with at least one datanode
> reported threshold count is 0
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-ContainerRegistrationReportForContainerSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ContainerSafeModeRule
> rule is successfully validated
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 1
> DataNodes registered, 1 required.
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: DataNodeSafeModeRule
> rule is successfully validated
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: All SCM safe mode pre
> check rules have passed
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from
> SafeModeStatus{safeModeStatus=true, preCheckPassed=false} to
> SafeModeStatus{safeModeStatus=true, preCheckPassed=true}.
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator: trigger a
> one-shot run on scm1-RatisPipelineUtilsThread.
> 2025-11-04 12:14:35,593
> [scm1-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> AtleastOneDatanodeReportedRule rule is successfully validated {code}
> {code:java}
> root@installer-4:~# ozone admin safemode status --verbose
> SCM is in safe mode.
> validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required
> datanodes (=1)
> validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines
> (=0) >= healthyPipelineThresholdCount (=0)
> validated:true, ContainerSafeModeRule, 100.00% of [Ratis] Containers(0 / 0)
> with at least one reported replica (=1.00) >= safeModeCutoff (=0.99);
> 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >=
> safeModeCutoff (=0.99);
> validated:true, AtleastOneDatanodeReportedRule, reported Ratis/THREE
> pipelines with at least one datanode (=0) >= threshold (=0)
> root@installer-4:~# ozone admin scm roles
> installer-4.vpc.cloudera.com:9894:LEADER:b1623451-5346-40b2-b1fc-5faa7c649b83:10.65.157.76
> installer-6.vpc.cloudera.com:9894:FOLLOWER:02a1f01e-0f38-427c-959d-c7733a07d106:10.65.147.12
> installer-5.vpc.cloudera.com:9894:FOLLOWER:738a934b-6466-46d1-b2e2-b10dcdaa45ec:10.65.144.98
> root@installer-4:~# ozone admin om roles
> om1 : FOLLOWER (installer-4.vpc.cloudera.com)
> om2 : FOLLOWER (installer-5.vpc.cloudera.com)
> om3 : LEADER (installer-6.vpc.cloudera.com){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]