[ 
https://issues.apache.org/jira/browse/HDDS-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDDS-11521:
-----------------------------------
    Affects Version/s: 2.0.0

> Race condition between pipeline close and block allocation causes client 
> aborts
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-11521
>                 URL: https://issues.apache.org/jira/browse/HDDS-11521
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Wei-Chiu Chuang
>            Priority: Critical
>
> We have a HMaster aborted prematurely. Looking at the relevant logs (HMaster, 
> SCM), it appears there is a race condition where if the client waiting to 
> allocate a new block while the pipeline of the block is closed, the client 
> would wait for up to 60 seconds, and then abort without retry.
> Expected behavior: the client should (either be preempted when the pipeline 
> is closed or wait for 60 second timeout) retry with another pipeline.
> Relevant log:
> Pipeline creation:
> {noformat}
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:b097b750-84ac-4aac-98b2-0917935b7cda
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:0abd3422-fb3b-48dc-9dfa-27978cc3e1d6
> 2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
> 9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
> CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
> d6-a462-abfc89ecd8b0 to datanode:06c64b46-3f66-45e6-8c65-1ce0bd979379
> {noformat}
> Pipeline close:
> {noformat}
> 2024-10-01 09:51:24,132 INFO 
> [node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
>  Datanode 
> 06c64b46-3f66-45e6-8c65-1ce0bd979379(ccycloud-5.quasar-aljjma.root.comops.site/10.140.13.6)
>  moved to stale state. Finalizing its pipelines 
> [PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0, 
> PipelineID=ffe5aa54-f12b-4334-aae4-5921f54bb916, 
> PipelineID=053b7d1d-e351-454b-94f1-f2cf81c403df, 
> PipelineID=4b647943-30e5-49d7-8f4a-cd374b7e8e1b, 
> PipelineID=1b5f1653-671b-4959-a684-2c8eb7a6b96f]
> 2024-10-01 09:51:24,140 INFO 
> [node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl:
>  Pipeline PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0 moved to CLOSED 
> state
> {noformat}
> Pipeline allocation timeout
> {noformat}
> 2024-10-01 09:52:07,368 WARN [IPC Server handler 95 on 
> 9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider: 
> Pipeline creation failed for repConfig: RATIS/THREE. Retrying get pipelines 
> call once.
> java.io.IOException: Pipeline 48431096-9933-46d6-a462-abfc89ecd8b0 is not 
> ready in 60000 ms
>         at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitPipelineReady(PipelineManagerImpl.java:725)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:103)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
>         at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:163)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:216)
>         at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:198)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to