Attila Doroszlai created HDDS-8982:
--------------------------------------
Summary: Log flooded by WritableRatisContainerProvider if
pipeline's nodes are not found
Key: HDDS-8982
URL: https://issues.apache.org/jira/browse/HDDS-8982
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Reporter: Attila Doroszlai
HDDS-8981 revealed an infinite loop.
Steps (based on broken test case):
# Stop all datanodes
# Restart SCM
# Force SCM exit safe mode
# Allocate block
WritableRatisContainerProvider is stuck in an infinite loop:
{code}
[IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat,
datanode b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c not found.
[IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat,
datanode d37619cb-6c2f-4ca2-895f-33a0c8ee91be not found.
[IPC Server handler 0 on default port 15001] WARN node.SCMNodeManager
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat,
datanode 65bcd90d-da66-4520-87f5-dc49e35f7dc8 not found.
[IPC Server handler 0 on default port 15001] WARN
container.ContainerManagerImpl
(ContainerManagerImpl.java:getMatchingContainer(342)) - Container allocation
failed on pipeline=Pipeline[ Id: b7134617-f12a-460c-9ef7-36bb6f2810c6, Nodes:
b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c(localhost/127.0.0.1)d37619cb-6c2f-4ca2-895f-33a0c8ee91be(localhost/127.0.0.1)65bcd90d-da66-4520-87f5-dc49e35f7dc8(localhost/127.0.0.1),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:65bcd90d-da66-4520-87f5-dc49e35f7dc8,
CreationTimestamp2023-07-06T11:51:52.213+02:00[Europe/Vienna]]
java.lang.IllegalArgumentException
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at
org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1118)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:648)
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:349)
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:329)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:204)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:101)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:70)
at
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:164)
at
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113)
at
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14238)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
{code}
Pipeline is open, datanodes are stopped, thus they do not register after SCM
restart.
{{minHealthyVolumeNum}} throws exception, {{getMatchingContainer}} catches it
and returns {{null}}. The loop in {{WritableRatisContainerProvider}} exits
only after such pipelines are closed (4,5 minutes, 57768149 iterations in this
case):
{code:title=disabled per-attempt log, log number of attempts on exit from loop}
2023-07-06 12:48:05,378 [IPC Server handler 3 on default port 15001] INFO
ha.SequenceIdGenerator (SequenceIdGenerator.java:getNextId(140)) - Allocate a
batch for delTxnId, change lastId from 0 to 1000.
...
2023-07-06 12:52:33,880 [BackgroundPipelineScrubberThread] INFO
pipeline.PipelineManagerImpl (PipelineManagerImpl.java:closePipeline(510)) -
Pipeline Pipeline[ Id: 1fcef9a6-655d-485d-ac83-20ec2bcc9225, Nodes:
9ec03363-f1aa-410a-aa90-5cac6833e36c(localhost/127.0.0.1), ReplicationConfig:
RATIS/ONE, State:OPEN, leaderId:9ec03363-f1aa-410a-aa90-5cac6833e36c,
CreationTimestamp2023-07-06T12:47:33.874+02:00[Europe/Vienna]] moved to CLOSED
state
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] INFO
pipeline.WritableRatisContainerProvider
(WritableRatisContainerProvider.java:getContainer(158)) - Could not find
available pipeline of repConfig: RATIS/THREE even after retrying
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR
pipeline.WritableRatisContainerProvider
(WritableRatisContainerProvider.java:getContainer(177)) - Give up after
57768149 attempts, unable to allocate block size=4194304, RATIS/THREE
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR
block.BlockManagerImpl (BlockManagerImpl.java:allocateBlock(172)) - Unable to
allocate a block for the size: 4194304, replicationConfig: RATIS/THREE
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]