Wei-Chiu Chuang created HDDS-11521:
--------------------------------------
Summary: Race condition between pipeline close and block
allocation causes client aborts
Key: HDDS-11521
URL: https://issues.apache.org/jira/browse/HDDS-11521
Project: Apache Ozone
Issue Type: Bug
Reporter: Wei-Chiu Chuang
We have a HMaster aborted prematurely. Looking at the relevant logs (HMaster,
SCM), it appears there is a race condition where if the client waiting to
allocate a new block while the pipeline of the block is closed, the client
would wait for up to 60 seconds, and then abort without retry.
Expected behavior: the client should (either be preempted when the pipeline is
closed or wait for 60 second timeout) retry with another pipeline.
Relevant log:
Pipeline creation:
{noformat}
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:b097b750-84ac-4aac-98b2-0917935b7cda
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:0abd3422-fb3b-48dc-9dfa-27978cc3e1d6
2024-10-01 09:51:07,285 INFO [IPC Server handler 95 on
9863]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for pipeline:PipelineID=48431096-9933-46
d6-a462-abfc89ecd8b0 to datanode:06c64b46-3f66-45e6-8c65-1ce0bd979379
{noformat}
Pipeline close:
{noformat}
2024-10-01 09:51:24,132 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode
06c64b46-3f66-45e6-8c65-1ce0bd979379(ccycloud-5.quasar-aljjma.root.comops.site/10.140.13.6)
moved to stale state. Finalizing its pipelines
[PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0,
PipelineID=ffe5aa54-f12b-4334-aae4-5921f54bb916,
PipelineID=053b7d1d-e351-454b-94f1-f2cf81c403df,
PipelineID=4b647943-30e5-49d7-8f4a-cd374b7e8e1b,
PipelineID=1b5f1653-671b-4959-a684-2c8eb7a6b96f]
2024-10-01 09:51:24,140 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl:
Pipeline PipelineID=48431096-9933-46d6-a462-abfc89ecd8b0 moved to CLOSED state
{noformat}
Pipeline allocation timeout
{noformat}
2024-10-01 09:52:07,368 WARN [IPC Server handler 95 on
9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider:
Pipeline creation failed for repConfig: RATIS/THREE. Retrying get pipelines
call once.
java.io.IOException: Pipeline 48431096-9933-46d6-a462-abfc89ecd8b0 is not ready
in 60000 ms
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitPipelineReady(PipelineManagerImpl.java:725)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:103)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
at
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:163)
at
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:216)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:198)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]