Attila Doroszlai created HDDS-8982:
--------------------------------------

             Summary: Log flooded by WritableRatisContainerProvider if 
pipeline's nodes are not found
                 Key: HDDS-8982
                 URL: https://issues.apache.org/jira/browse/HDDS-8982
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
            Reporter: Attila Doroszlai


HDDS-8981 revealed an infinite loop.

Steps (based on broken test case):

# Stop all datanodes
# Restart SCM
# Force SCM exit safe mode
# Allocate block

WritableRatisContainerProvider is stuck in an infinite loop:

{code}
[IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager 
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, 
datanode b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c not found.
[IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager 
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, 
datanode d37619cb-6c2f-4ca2-895f-33a0c8ee91be not found.
[IPC Server handler 0 on default port 15001] WARN  node.SCMNodeManager 
(SCMNodeManager.java:minHealthyVolumeNum(1114)) - Cannot generate NodeStat, 
datanode 65bcd90d-da66-4520-87f5-dc49e35f7dc8 not found.
[IPC Server handler 0 on default port 15001] WARN  
container.ContainerManagerImpl 
(ContainerManagerImpl.java:getMatchingContainer(342)) - Container allocation 
failed on pipeline=Pipeline[ Id: b7134617-f12a-460c-9ef7-36bb6f2810c6, Nodes: 
b1b8f5aa-dc08-49ca-95ee-cd36b9a3d13c(localhost/127.0.0.1)d37619cb-6c2f-4ca2-895f-33a0c8ee91be(localhost/127.0.0.1)65bcd90d-da66-4520-87f5-dc49e35f7dc8(localhost/127.0.0.1),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:65bcd90d-da66-4520-87f5-dc49e35f7dc8, 
CreationTimestamp2023-07-06T11:51:52.213+02:00[Europe/Vienna]]
java.lang.IllegalArgumentException
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
        at 
org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1118)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:648)
        at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:349)
        at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:329)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:204)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:101)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:70)
        at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:164)
        at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
        at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192)
        at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113)
        at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14238)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
{code}

Pipeline is open, datanodes are stopped, thus they do not register after SCM 
restart.  

{{minHealthyVolumeNum}} throws exception, {{getMatchingContainer}} catches it 
and returns {{null}}.  The loop in {{WritableRatisContainerProvider}} exits 
only after such pipelines are closed (4,5 minutes, 57768149 iterations in this 
case):

{code:title=disabled per-attempt log, log number of attempts on exit from loop}
2023-07-06 12:48:05,378 [IPC Server handler 3 on default port 15001] INFO  
ha.SequenceIdGenerator (SequenceIdGenerator.java:getNextId(140)) - Allocate a 
batch for delTxnId, change lastId from 0 to 1000.
...
2023-07-06 12:52:33,880 [BackgroundPipelineScrubberThread] INFO  
pipeline.PipelineManagerImpl (PipelineManagerImpl.java:closePipeline(510)) - 
Pipeline Pipeline[ Id: 1fcef9a6-655d-485d-ac83-20ec2bcc9225, Nodes: 
9ec03363-f1aa-410a-aa90-5cac6833e36c(localhost/127.0.0.1), ReplicationConfig: 
RATIS/ONE, State:OPEN, leaderId:9ec03363-f1aa-410a-aa90-5cac6833e36c, 
CreationTimestamp2023-07-06T12:47:33.874+02:00[Europe/Vienna]] moved to CLOSED 
state
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] INFO  
pipeline.WritableRatisContainerProvider 
(WritableRatisContainerProvider.java:getContainer(158)) - Could not find 
available pipeline of repConfig: RATIS/THREE even after retrying
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR 
pipeline.WritableRatisContainerProvider 
(WritableRatisContainerProvider.java:getContainer(177)) - Give up after 
57768149 attempts, unable to allocate block size=4194304, RATIS/THREE
2023-07-06 12:52:33,880 [IPC Server handler 0 on default port 15001] ERROR 
block.BlockManagerImpl (BlockManagerImpl.java:allocateBlock(172)) - Unable to 
allocate a block for the size: 4194304, replicationConfig: RATIS/THREE
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to