[
https://issues.apache.org/jira/browse/HDDS-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522413#comment-17522413
]
Siyao Meng commented on HDDS-6546:
----------------------------------
So I was able to "set a breakpoint" at the failure to pause the upgrade test
run (so the cluster doesn't get destroyed when a test case failed, but the
services are still running) after maybe 20 runs 5 min each, then I tried the
exact same {{key put}} command in scm container (after ~9 hours. I let it run
overnight). And it looks like SCM fixes itself during the time (but I have not
found any signs in the SCM logs that this pipeline map is fixed, maybe in the
trace/debug log).
The manual key put is successful. I also can't find the problemsome pipeline ID
in Recon Web UI.
{code:title=Time in UTC. SCM throwing PipelineNotFoundException from
PipelineStateMap.getPipeline, the proof of a repro}
scm_1 | 2022-04-14 06:15:33,270 [IPC Server handler 0 on default port 9863]
WARN ha.SequenceIdGenerator: Failed to allocate a batch for containerId,
expected lastId is 0, actual lastId is 1.
scm_1 | 2022-04-14 06:15:33,271 [IPC Server handler 0 on default port 9863]
INFO ha.SequenceIdGenerator: Allocate a batch for containerId, change lastId
from 1 to 1001.
scm_1 | 2022-04-14 06:15:33,274 [IPC Server handler 0 on default port 9863]
ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
scm_1 | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=10d41c84-82d8-4973-b57a-7002a391f096 not found
scm_1 | at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
scm_1 | at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
scm_1 | at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown
Source)
scm_1 | at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
scm_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
scm_1 | at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
scm_1 | at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
scm_1 | at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
scm_1 | at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
scm_1 | at
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
scm_1 | at
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
scm_1 | at
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
scm_1 | at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
scm_1 | at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
scm_1 | at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
scm_1 | at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
scm_1 | at
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
scm_1 | at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
scm_1 | at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
scm_1 | at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
scm_1 | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
scm_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
scm_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
scm_1 | at java.base/java.security.AccessController.doPrivileged(Native
Method)
scm_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
scm_1 | at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
scm_1 | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
scm_1 | 2022-04-14 06:18:28,187 [ReplicationMonitor] INFO
container.ReplicationManager: Replication Manager is not ready to run until
300000ms after safemode exit
{code}
{code:title=Attempting a manual repro but the pipeline seems to be fixed}
$ pwd
${HOME}/repo/ozone/hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/upgrade
$ docker-compose -f compose/ha/docker-compose.yaml exec scm bash
bash-4.2$ ozone sh bucket list /new2-volume/
[ {
"metadata" : { },
"volumeName" : "new2-volume",
"name" : "new2-bucket",
"storageType" : "DISK",
"versioning" : false,
"usedBytes" : 0,
"usedNamespace" : 0,
"creationTime" : "2022-04-14T06:15:32.154Z",
"modificationTime" : "2022-04-14T06:15:32.154Z",
"quotaInBytes" : -1,
"quotaInNamespace" : -1,
"bucketLayout" : "OBJECT_STORE",
"owner" : "hadoop",
"link" : false
} ]
bash-4.2$ ozone sh key list /new2-volume/new2-bucket/
[ ]
bash-4.2$ ozone sh key put /new2-volume/new2-bucket/new2-key
/opt/hadoop/NOTICE.txt
bash-4.2$ ozone sh key list /new2-volume/new2-bucket/
[ {
"volumeName" : "new2-volume",
"bucketName" : "new2-bucket",
"name" : "new2-key",
"dataSize" : 17526,
"creationTime" : "2022-04-14T16:17:27.052Z",
"modificationTime" : "2022-04-14T16:17:28.217Z",
"replicationConfig" : {
"replicationFactor" : "THREE",
"requiredNodes" : 3,
"replicationType" : "RATIS"
},
"replicationFactor" : 3,
"replicationType" : "RATIS"
} ]
{code}
> SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization
> ------------------------------------------------------------------------------
>
> Key: HDDS-6546
> URL: https://issues.apache.org/jira/browse/HDDS-6546
> Project: Apache Ozone
> Issue Type: Bug
> Components: test
> Reporter: Siyao Meng
> Priority: Critical
> Attachments: acceptance-misc-run-6012168314.zip
>
>
> Found in upgrade acceptance test (in GH action misc), which is especially
> flaky recently:
> {code}
> 15:11:05.927 INFO Running command 'ozone sh key put
> /new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.
> 15:11:08.856 INFO ${rc} = 255
> 15:11:08.856 INFO ${output} = INTERNAL_ERROR Allocated 0 blocks.
> Requested 1 blocks
> {code}
> from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937
> -Goal: Check and wait for SCM/DN readiness before creating key.-
> As [~erose] and I dug into the issue it seems there are some pipelineMap
> inconsistencies happening on the SCM side, where a pipeline ID is retrieved
> from {{query2OpenPipelines}} when SCM is trying to get an OPEN pipeline but
> is already removed from {{pipelineMap}} in {{PipelineStateMap}} as it has
> already been CLOSED previously:
> {code}
> scm_1 | 2022-03-23 14:56:32,154 [IPC Server handler 90 on default port
> 9863] ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
> scm_1 | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
> PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23 not found
> scm_1 | at
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
> scm_1 | at
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
> scm_1 | at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> scm_1 | at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> scm_1 | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> scm_1 | at
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
> scm_1 | at
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
> scm_1 | at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
> scm_1 | at
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
> scm_1 | at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
> scm_1 | at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
> scm_1 | at
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> scm_1 | at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
> scm_1 | at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
> scm_1 | at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> scm_1 | at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
> scm_1 | at
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
> scm_1 | at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
> scm_1 | at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
> scm_1 | at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
> scm_1 | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
> scm_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
> scm_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
> scm_1 | at java.base/java.security.AccessController.doPrivileged(Native
> Method)
> scm_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> scm_1 | at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
> scm_1 | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
> {code}
> while the pipeline {{8515aa81-2361-482a-82a8-bc5b5340dc23}} has obviously
> been CLOSED earlier:
> {code}
> scm_1 | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id:
> 630ef3dc-4fdf-4d5d-a72e-1f5feba433fa, Nodes:
> c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net,
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
> ReplicationConfig: RATIS/ONE, State:OPEN,
> leaderId:c6fe6b6e-0b93-4766-9b16-2d622849069e,
> CreationTimestamp2022-03-23T14:53:12.187247Z[UTC]] moved to CLOSED state
> scm_1 | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port
> 9860] INFO pipeline.PipelineManagerImpl: Container #1 closed for
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1 | 2022-03-23 14:53:51,128 [IPC Server handler 0 on default port
> 9860] INFO pipeline.PipelineManagerImpl: Container #3 closed for
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1 | 2022-03-23 14:53:51,131 [IPC Server handler 0 on default port
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id:
> 8515aa81-2361-482a-82a8-bc5b5340dc23, Nodes:
> 1caafab5-8815-4e66-92ed-d7e53d381e4b{ip: 10.9.0.16, host: ha_dn2_1.ha_net,
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
> 0}45e92118-31a9-4b6f-9e18-0a4abff7f9e5{ip: 10.9.0.17, host: ha_dn3_1.ha_net,
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
> 0}c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net,
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
> ReplicationConfig: RATIS/THREE, State:OPEN,
> leaderId:1caafab5-8815-4e66-92ed-d7e53d381e4b,
> CreationTimestamp2022-03-23T14:53:12.187694Z[UTC]] moved to CLOSED state
> {code}
> from
> https://github.com/elek/ozone-build-results/blob/master/2022/03/23/13958/acceptance-misc/upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log#L3539-L3567
> And it seems that the issue dates back to at least Nov-Dec last year (2021):
> https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11982/acceptance-misc/docker-1.2.0-finalized.log#L3269
> Potentially more (and earlier) failures can be found in
> https://elek.github.io/ozone-build-results/
> Click into {{acceptance (misc)}} failures and there is a chance that you see
> the same {{PipelineNotFoundException}} inside
> {{./upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]