[ 
https://issues.apache.org/jira/browse/HDDS-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136587#comment-17136587
 ] 

Nanda kumar commented on HDDS-3669:
-----------------------------------

I agree that the proposed code will make it robust, but we should never land in 
such a state in the first place.

Maybe we should just check if the pipeline is in CLOSED state before removing. 
If the pipeline is in CLOSED state the pipeline is already removed from 
{{query2OpenPipelines}}. If the pipeline is not in CLOSED state, remove 
pipeline should throw an exception.

> SCM Infinite loop in BlockManagerImpl.allocateBlock
> ---------------------------------------------------
>
>                 Key: HDDS-3669
>                 URL: https://issues.apache.org/jira/browse/HDDS-3669
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 0.6.0
>            Reporter: maobaolong
>            Assignee: maobaolong
>            Priority: Major
>              Labels: Triaged
>
> The following step can reproduce this issue
> - A new ozone cluster with only a factor three pipeline
> - put a big file(1G) into cluster, during the put process,  we kill the 
> leader datanode of this pipeline.
> The put command will hang, the following log will fill the scm log file.
> 2020-05-27 17:32:46,988 [IPC Server handler 23 on default port 9863] WARN 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager: Container 
> allocation failed for pipeline=Pipeline[ Id: 
> bf7dd356-2d97-4b2a-8a81-e2ddd25bc5a1, Nodes: 
> e859cad9-c7f6-451a-a039-af06103aa978{ip: 127.0.0.1, host: localhost, 
> networkLocation: /default-rack, certSerialId: 
> null}1cd2bf20-a791-42a0-b4cd-b26d995cb8eb{ip: 127.0.0.1, host: localhost, 
> networkLocation: /default-rack, certSerialId: 
> null}0827f3bb-0d94-435a-a157-4db2c84cdedf{ip: 127.0.0.1, host: localhost, 
> networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:3, 
> State:OPEN, leaderId:0827f3bb-0d94-435a-a157-4db2c84cdedf, 
> CreationTimestamp2020-05-27T08:05:36.590Z] requiredSize=268435456 {}
> org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
> PipelineID=bf7dd356-2d97-4b2a-8a81-e2ddd25bc5a1 not found
>         at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getContainers(PipelineStateMap.java:301)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.getContainers(PipelineStateManager.java:95)
>         at 
> org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.getContainersInPipeline(SCMPipelineManager.java:360)
>         at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getContainersForOwner(SCMContainerManager.java:507)
>         at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.getMatchingContainer(SCMContainerManager.java:428)
>         at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:230)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:190)
>         at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:167)
>         at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:119)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:74)
>         at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:100)
>         at 
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13303)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to