[
https://issues.apache.org/jira/browse/HDDS-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-11780:
----------------------------------
Labels: pull-request-available (was: )
> Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes
> --------------------------------------------------------------------
>
> Key: HDDS-11780
> URL: https://issues.apache.org/jira/browse/HDDS-11780
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Client, SCM
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
> Labels: pull-request-available
>
> h4. Issue:
> There was an observed delay in the SCM exiting safe mode, taking slightly
> over a minute in one instance. This issue is related to the
> {*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
> health of pipelines to the SCM. Delays in pipeline reporting can impact the
> time taken to meet the criteria for exiting safe mode, thereby delaying
> client operations like writes.
> h4. Cause:
> The delay was caused by the following factors:
> # {*}DataNode Registration{*}: DataNodes register with the SCM using
> heartbeat intervals, which take around 30 seconds. If the leader {*}SCM is
> restarted and regains leadership{*}, all DataNodes and pipelines need to
> re-register, leading to delays.
> # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only
> when all associated DataNodes are active and reporting. This process may take
> additional time due to the stabilisation of pipelines and network conditions.
> h4. Logs:
> From the logs:
> {code:java}
> 2024-10-24 00:45:49,630 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: containers
> with one replica threshold count 0
> 2024-10-24 00:45:49,636 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
> pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:45:49,639 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Total pipeline count is 4, pipeline's with at least one datanode reported
> threshold count is 4
> 2024-10-24 00:45:55,452 INFO [main]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false}.
> 2024-10-24 00:46:01,908 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:46:01,911 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule:
> Refreshed one replica container threshold 0, currentThreshold 0
> 2024-10-24 00:46:01,912 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Refreshed Total pipeline count is 4, pipeline's with at least one datanode
> reported threshold count is 4
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> ContainerSafeModeRule rule is successfully validated
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Refreshed Total pipeline count is 4, pipeline's with at least one datanode
> reported threshold count is 4
> org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed
> for deleteBlocks
> 2024-10-24 00:47:09,206 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. 1 DataNodes registered, 1 required.
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> DataNodeSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> All SCM safe mode pre check rules have passed
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
> preCheckPassed=true}.
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator:
> trigger a one-shot run on node3-RatisPipelineUtilsThread.
> 2024-10-24 00:47:09,207 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Pipelines with at least one datanode reported count is 2,
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Pipelines with at least one datanode reported count is 4,
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> AtleastOneDatanodeReportedRule rule is successfully validated
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 1, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> HealthyPipelineSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> ScmSafeModeManager, all rules are successfully validated
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM exiting safe mode.
> 2024-10-24 00:47:09,936 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=true} to SafeModeStatus{safeModeStatus=false,
> preCheckPassed=true}.
> {code}
> The *HealthyPipelineSafeModeRule* took a significant time to validate the
> required healthy pipeline threshold, contributing to the delay.
> h4. Impact:
> Due to this delay, Ozone client write operations were impacted, as they rely
> on the SCM being fully operational to allocate blocks or create keys.
> h3. Resolution:
> To prevent write failures during such delays, we propose increasing the
> following configurations in the Ozone client:
> * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
> * {{BLOCK_ALLOCATION_RETRY_COUNT}}
> This will allow the client to wait longer (slightly over a minute) for the
> SCM to exit safe mode, ensuring that writes are not prematurely failed during
> such scenarios.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]