[ 
https://issues.apache.org/jira/browse/HDDS-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-6546:
-----------------------------
    Description: 
upgrade acceptance test is especially flaky recently:

{code}
15:11:05.927    INFO    Running command 'ozone sh key put 
/new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.       
15:11:08.856    INFO    ${rc} = 255     
15:11:08.856    INFO    ${output} = INTERNAL_ERROR Allocated 0 blocks. 
Requested 1 blocks       
{code}

from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937

~~Goal: Check and wait for SCM/DN readiness before creating key.~~

As [~erose] and I dug into the issue it seems there are some pipeline map 
inconsistencies happening on the SCM side, where a pipeline ID is retrieved 
from query2OpenPipelines but is already removed from pipelineMap in 
PipelineStateMap:

{code}
scm_1    | 2022-03-23 14:56:32,154 [IPC Server handler 90 on default port 9863] 
ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
scm_1    | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23 not found
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
scm_1    |      at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown 
Source)
scm_1    |      at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
scm_1    |      at java.base/java.lang.reflect.Method.invoke(Method.java:566)
scm_1    |      at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
scm_1    |      at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
scm_1    |      at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
scm_1    |      at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
scm_1    |      at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
scm_1    |      at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
scm_1    |      at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
scm_1    |      at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
scm_1    |      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
scm_1    |      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
scm_1    |      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
scm_1    |      at java.base/java.security.AccessController.doPrivileged(Native 
Method)
scm_1    |      at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
scm_1    |      at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
scm_1    |      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

from 
https://github.com/elek/ozone-build-results/blob/master/2022/03/23/13958/acceptance-misc/upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log#L3539-L3567

And the issue dates back to at least Nov-Dec last year (2021):

https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11982/acceptance-misc/docker-1.2.0-finalized.log#L3269

  was:
upgrade acceptance test is especially flaky recently:

{code}
15:11:05.927    INFO    Running command 'ozone sh key put 
/new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.       
15:11:08.856    INFO    ${rc} = 255     
15:11:08.856    INFO    ${output} = INTERNAL_ERROR Allocated 0 blocks. 
Requested 1 blocks       
{code}

from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937

Goal: Check and wait for SCM/DN readiness before creating key.


> Fix flaky SCM initialization in upgrade acceptance test
> -------------------------------------------------------
>
>                 Key: HDDS-6546
>                 URL: https://issues.apache.org/jira/browse/HDDS-6546
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: test
>            Reporter: Siyao Meng
>            Priority: Major
>
> upgrade acceptance test is especially flaky recently:
> {code}
> 15:11:05.927  INFO    Running command 'ozone sh key put 
> /new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.       
> 15:11:08.856  INFO    ${rc} = 255     
> 15:11:08.856  INFO    ${output} = INTERNAL_ERROR Allocated 0 blocks. 
> Requested 1 blocks       
> {code}
> from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937
> ~~Goal: Check and wait for SCM/DN readiness before creating key.~~
> As [~erose] and I dug into the issue it seems there are some pipeline map 
> inconsistencies happening on the SCM side, where a pipeline ID is retrieved 
> from query2OpenPipelines but is already removed from pipelineMap in 
> PipelineStateMap:
> {code}
> scm_1    | 2022-03-23 14:56:32,154 [IPC Server handler 90 on default port 
> 9863] ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
> scm_1    | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
> PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23 not found
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
> scm_1    |    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown 
> Source)
> scm_1    |    at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> scm_1    |    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
> scm_1    |    at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
> scm_1    |    at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
> scm_1    |    at 
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
> scm_1    |    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
> scm_1    |    at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> scm_1    |    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> scm_1    |    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
> scm_1    |    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
> {code}
> from 
> https://github.com/elek/ozone-build-results/blob/master/2022/03/23/13958/acceptance-misc/upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log#L3539-L3567
> And the issue dates back to at least Nov-Dec last year (2021):
> https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11982/acceptance-misc/docker-1.2.0-finalized.log#L3269



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to