[jira] [Comment Edited] (HDDS-6546) SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization

Siyao Meng (Jira) Thu, 14 Apr 2022 16:40:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522516#comment-17522516
 ]


Siyao Meng edited comment on HDDS-6546 at 4/14/22 11:39 PM:
------------------------------------------------------------

Oh. It looks like the (potential) second bug in the comment above is due to SCM 
somehow failing to finalize the upgrade and it [got stuck with pipeline 
creation 
frozen|https://github.com/apache/ozone/blob/d6e0e4747f09ce5c04bcd2ac42cf9ce3e73198db/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java#L64-L67],
 how strange:

{code:java|title=SCMUpgradeFinalizer#preFinalizeUpgrade}
    // Pipeline creation will remain frozen until postFinalizeUpgrade()
    pipelineManager.freezePipelineCreation();

    waitForAllPipelinesToDestroy(pipelineManager);
{code}

And now it is steadily reproducible on my Linux box (Arch Linux. x86_64. Kernel 
5.17.2-arch3-1. docker 1:20.10.14-1. containerd 1.6.2-1. docker-compose 1.29.2, 
build 5becea4c), even after I manually moved the data directory away 
({{./upgrades/non-rolling-upgrade/1.1.0-1.2.0/data}}) and restart the test. Hmm

UPDATE: It turns out once I comment out these two trace log lines in 
./ha/docker-compose.yaml , the problem goes away. There's definitely something 
wrong on the trace logging side. Potentially the trace logging path has some 
side effect that BREAKS the upgrade finalization path. (Or there's could be 
something wrong with my Arch Linux box, but very unlikely. Can be disproven if 
someone else runs my [debug 
branch|https://github.com/smengcl/hadoop-ozone/commits/HDDS-6546-debug2])

{code:yaml|title=trace logger wreaking havoc?}
#      OZONE_OPTS: -Dhadoop.root.logger=TRACE,console
#      OZONE_DAEMON_ROOT_LOGGER: TRACE,RFA
{code}

and btw CLI reports {{FINALIZATION_DONE}}

{code}
bash-4.2$ ozone admin scm finalizationstatus
FINALIZATION_DONE
bash-4.2$ ozone sh key put /new2-volume/new2-bucket/new2-key 
/opt/hadoop/NOTICE.txt
INTERNAL_ERROR Allocated 0 blocks. Requested 1 blocks
bash-4.2$ ozone admin scm finalizationstatus
FINALIZATION_DONE
bash-4.2$ ozone admin scm finalizeupgrade
Upgrade has already been finalized.
Exiting...
bash-4.2$ ozone sh key put /new2-volume/new2-bucket/new2-key 
/opt/hadoop/NOTICE.txt
INTERNAL_ERROR Allocated 0 blocks. Requested 1 blocks
{code}

My debug2 branch, which has the docker compose yaml and script mod to stop the 
cluster destruction on failure (it works locally, but not sure it works on GH 
action): https://github.com/smengcl/hadoop-ozone/commits/HDDS-6546-debug2

To run it

{code}
mvn clean install -DskipTests -e -Dmaven.javadoc.skip=true -DskipShade
cd hadoop-ozone/dist/target/ozone-*/compose/upgrade

while true; do ./test.sh; sleep 5; done
{code}

Then it should pause when the test fails.


was (Author: smeng):
Oh. It looks like the (potential) second bug in the comment above is due to SCM 
somehow failing to finalize the upgrade and it [got stuck with pipeline 
creation 
frozen|https://github.com/apache/ozone/blob/d6e0e4747f09ce5c04bcd2ac42cf9ce3e73198db/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java#L64-L67],
 how strange:

{code:java|title=SCMUpgradeFinalizer#preFinalizeUpgrade}
    // Pipeline creation will remain frozen until postFinalizeUpgrade()
    pipelineManager.freezePipelineCreation();

    waitForAllPipelinesToDestroy(pipelineManager);
{code}

And now it is steadily reproducible on my Linux box (Arch Linux. x86_64. Kernel 
5.17.2-arch3-1. docker 1:20.10.14-1. containerd 1.6.2-1. docker-compose 1.29.2, 
build 5becea4c), even after I manually moved the data directory away 
({{./upgrades/non-rolling-upgrade/1.1.0-1.2.0/data}}) and restart the test. Hmm

and btw CLI reports {{FINALIZATION_DONE}}

{code}
bash-4.2$ ozone admin scm finalizationstatus
FINALIZATION_DONE
bash-4.2$ ozone sh key put /new2-volume/new2-bucket/new2-key 
/opt/hadoop/NOTICE.txt
INTERNAL_ERROR Allocated 0 blocks. Requested 1 blocks
bash-4.2$ ozone admin scm finalizationstatus
FINALIZATION_DONE
bash-4.2$ ozone admin scm finalizeupgrade
Upgrade has already been finalized.
Exiting...
bash-4.2$ ozone sh key put /new2-volume/new2-bucket/new2-key 
/opt/hadoop/NOTICE.txt
INTERNAL_ERROR Allocated 0 blocks. Requested 1 blocks
{code}

My debug2 branch, which has the docker compose yaml and script mod to stop the 
cluster destruction on failure (it works locally, but not sure it works on GH 
action): https://github.com/smengcl/hadoop-ozone/commits/HDDS-6546-debug2

To run it

{code}
mvn clean install -DskipTests -e -Dmaven.javadoc.skip=true -DskipShade
cd hadoop-ozone/dist/target/ozone-*/compose/upgrade

while true; do ./test.sh; sleep 5; done
{code}

Then it should pause when the test fails.

> SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization
> ------------------------------------------------------------------------------
>
>                 Key: HDDS-6546
>                 URL: https://issues.apache.org/jira/browse/HDDS-6546
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: test
>            Reporter: Siyao Meng
>            Priority: Critical
>         Attachments: acceptance-misc-run-6012168314.zip, 
> potential_bug2_recon_1.png, potential_bug2_recon_2.png, 
> potential_bug2_scm_1.log, potential_bug2_scmwebui_1.png
>
>
> Found in upgrade acceptance test (in GH action misc), which is especially 
> flaky recently:
> {code}
> 15:11:05.927  INFO    Running command 'ozone sh key put 
> /new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.       
> 15:11:08.856  INFO    ${rc} = 255     
> 15:11:08.856  INFO    ${output} = INTERNAL_ERROR Allocated 0 blocks. 
> Requested 1 blocks       
> {code}
> from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937
> -Goal: Check and wait for SCM/DN readiness before creating key.-
> As [~erose] and I dug into the issue it seems there are some pipelineMap 
> inconsistencies happening on the SCM side, where a pipeline ID is retrieved 
> from {{query2OpenPipelines}} when SCM is trying to get an OPEN pipeline but 
> is already removed from {{pipelineMap}} in {{PipelineStateMap}} as it has 
> already been CLOSED previously:
> {code}
> scm_1    | 2022-03-23 14:56:32,154 [IPC Server handler 90 on default port 
> 9863] ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
> scm_1    | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
> PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23 not found
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
> scm_1    |    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown 
> Source)
> scm_1    |    at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> scm_1    |    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
> scm_1    |    at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
> scm_1    |    at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
> scm_1    |    at 
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
> scm_1    |    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
> scm_1    |    at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> scm_1    |    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> scm_1    |    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
> scm_1    |    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
> {code}
> while the pipeline {{8515aa81-2361-482a-82a8-bc5b5340dc23}} has obviously 
> been CLOSED earlier:
> {code}
> scm_1    | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 
> 630ef3dc-4fdf-4d5d-a72e-1f5feba433fa, Nodes: 
> c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, 
> ReplicationConfig: RATIS/ONE, State:OPEN, 
> leaderId:c6fe6b6e-0b93-4766-9b16-2d622849069e, 
> CreationTimestamp2022-03-23T14:53:12.187247Z[UTC]] moved to CLOSED state
> scm_1    | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Container #1 closed for 
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1    | 2022-03-23 14:53:51,128 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Container #3 closed for 
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1    | 2022-03-23 14:53:51,131 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 
> 8515aa81-2361-482a-82a8-bc5b5340dc23, Nodes: 
> 1caafab5-8815-4e66-92ed-d7e53d381e4b{ip: 10.9.0.16, host: ha_dn2_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
> 0}45e92118-31a9-4b6f-9e18-0a4abff7f9e5{ip: 10.9.0.17, host: ha_dn3_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
> 0}c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, 
> ReplicationConfig: RATIS/THREE, State:OPEN, 
> leaderId:1caafab5-8815-4e66-92ed-d7e53d381e4b, 
> CreationTimestamp2022-03-23T14:53:12.187694Z[UTC]] moved to CLOSED state
> {code}
> from 
> https://github.com/elek/ozone-build-results/blob/master/2022/03/23/13958/acceptance-misc/upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log#L3539-L3567
> And it seems that the issue dates back to at least Nov-Dec last year (2021):
> https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11982/acceptance-misc/docker-1.2.0-finalized.log#L3269
> Potentially more (and earlier) failures can be found in 
> https://elek.github.io/ozone-build-results/
> Click into {{acceptance (misc)}} failures and there is a chance that you see 
> the same {{PipelineNotFoundException}} inside 
> {{./upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-6546) SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization

Reply via email to