[
https://issues.apache.org/jira/browse/HDDS-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashish Kumar resolved HDDS-11521.
---------------------------------
Resolution: Not A Problem
SCM already retries on different pipeline when pipeline creation fails, but to
have pipeline there should be enough healthy DNs.
Client in this case need not retry to get block allocation on different
pipeline as it is already been handled in SCM.
> Race condition between pipeline close and block allocation causes client
> aborts
> -------------------------------------------------------------------------------
>
> Key: HDDS-11521
> URL: https://issues.apache.org/jira/browse/HDDS-11521
> Project: Apache Ozone
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Wei-Chiu Chuang
> Assignee: Ashish Kumar
> Priority: Critical
>
> We have a HMaster aborted prematurely. Looking at the relevant logs (HMaster,
> SCM), it appears there is a race condition where if the client waiting to
> allocate a new block while the pipeline of the block is closed, the client
> would wait for up to 60 seconds, and then abort without retry.
> Expected behavior: the client should (either be preempted when the pipeline
> is closed or wait for 60 second timeout) retry with another pipeline.
> Relevant log:
> Pipeline creation:
> {noformat}
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:b097b750-84ac-4aac-98b2-0917935b7cda
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:0abd3422-fb3b-48dc-9dfa-27978cc3e1d6
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:06c64b46-3f66-45e6-8c65-1ce0bd979379
> {noformat}
> Pipeline close:
> {noformat}
> 2024-10-01 09:51:24,132 INFO
> [node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
> Datanode
> 06c64b46-3f66-45e6-8c65-1ce0bd979379(ccycloud-5.quasar-aljjma.root.comops.site/10.140.13.6)
> moved to stale state. Finalizing its pipelines
> [PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0,
> PipelineID=ffe5aa54-f12b-4334-aae4-5921f54bb916,
> PipelineID=053b7d1d-e351-454b-94f1-f2cf81c403df,
> PipelineID=4b647943-30e5-49d7-8f4a-cd374b7e8e1b,
> PipelineID=1b5f1653-671b-4959-a684-2c8eb7a6b96f]
> 2024-10-01 09:51:24,140 INFO
> [node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl:
> Pipeline PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0 moved to CLOSED
> state
> {noformat}
> Pipeline allocation timeout
> {noformat}
> 2024-10-01 09:52:07,368 WARN [IPC Server handler 95 on
> 9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider:
> Pipeline creation failed for repConfig: RATIS/THREE. Retrying get pipelines
> call once.
> java.io.IOException: Pipeline 48431096-9933-46d6-a462-abfc89ecd8b0 is not
> ready in 60000 ms
> at
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
> at
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitPipelineReady(PipelineManagerImpl.java:725)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:103)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
> at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:163)
> at
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:216)
> at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:198)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]