Wei-Chiu Chuang created HDDS-11521:
--------------------------------------

             Summary: Race condition between pipeline close and block 
allocation causes client aborts
                 Key: HDDS-11521
                 URL: https://issues.apache.org/jira/browse/HDDS-11521
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Wei-Chiu Chuang


We have a HMaster aborted prematurely. Looking at the relevant logs (HMaster, 
SCM), it appears there is a race condition where if the client waiting to 
allocate a new block while the pipeline of the block is closed, the client 
would wait for up to 60 seconds, and then abort without retry.

Expected behavior: the client should (either be preempted when the pipeline is 
closed or wait for 60 second timeout) retry with another pipeline.

Relevant log:

Pipeline creation:
{noformat}
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:b097b750-84ac-4aac-98b2-0917935b7cda
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:0abd3422-fb3b-48dc-9dfa-27978cc3e1d6
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on 
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:06c64b46-3f66-45e6-8c65-1ce0bd979379
{noformat}

Pipeline close:
{noformat}
2024-10-01 09:51:24,132 INFO 
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
 Datanode 
06c64b46-3f66-45e6-8c65-1ce0bd979379(ccycloud-5.quasar-aljjma.root.comops.site/10.140.13.6)
 moved to stale state. Finalizing its pipelines 
[PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0, 
PipelineID=ffe5aa54-f12b-4334-aae4-5921f54bb916, 
PipelineID=053b7d1d-e351-454b-94f1-f2cf81c403df, 
PipelineID=4b647943-30e5-49d7-8f4a-cd374b7e8e1b, 
PipelineID=1b5f1653-671b-4959-a684-2c8eb7a6b96f]
2024-10-01 09:51:24,140 INFO 
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl:
 Pipeline PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0 moved to CLOSED state
{noformat}

Pipeline allocation timeout
{noformat}
2024-10-01 09:52:07,368 WARN [IPC Server handler 95 on 
9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider: 
Pipeline creation failed for repConfig: RATIS/THREE. Retrying get pipelines 
call once.
java.io.IOException: Pipeline 48431096-9933-46d6-a462-abfc89ecd8b0 is not ready 
in 60000 ms
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitPipelineReady(PipelineManagerImpl.java:725)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:103)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
        at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:163)
        at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:216)
        at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:198)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to