[jira] [Commented] (HDDS-6546) SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization

Siyao Meng (Jira) Fri, 15 Apr 2022 10:30:07 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522908#comment-17522908
 ]


Siyao Meng commented on HDDS-6546:
----------------------------------

Back to the CLOSED pipeline allocate block issue, so far it seems that the 
issue is a benign race condition that shouldn't cause serious consequences.

The CLOSED pipeline sometimes remains in the {{query2OpenPipelines}} map right 
after upgrade finalization, and is erroneously retrieved during 
{{allocateBlock}}, causing SCM's {{PipelineNotFoundException}} and client's 
{{INTERNAL_ERROR Allocated 0 blocks. Requested 1 blocks}}, but is seemingly 
cleaned up by {{ReplicationMonitor/ReplicationManager}} shortly after safemode 
exits (5 min):

{code:title=A second repro on my Linux x86_64 box. After another ~80 runs (took 
~7 hours)}
scm_1    | 2022-04-15 11:44:02,526 [IPC Server handler 2 on default port 9860] 
INFO upgrade.UpgradeFinalizer: Open pipeline found after SCM finalization
scm_1    | 2022-04-15 11:44:02,526 [IPC Server handler 2 on default port 9860] 
INFO upgrade.UpgradeFinalizer: Finalization is done.
scm_1    | 2022-04-15 11:44:02,526 [IPC Server handler 2 on default port 9860] 
INFO upgrade.UpgradeFinalizer: Finalization is done.
scm_1    | 2022-04-15 11:44:06,868 
[EventQueue-PipelineReportForPipelineReportHandler] INFO 
pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 
562f32c1-7377-4e73-987e-61de0d68b16a, Nodes: 724eac09-3e86-4ab6-b7ee-
420ba505ea84{ip: 10.9.0.16, host: ha_dn2_1.ha_net, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: null, persistedOpState:
IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}f33c42e3-1206-4f40-a3a3-9b8b2e4fb6f4{ip: 10.9.0.17, host: ha_dn3_1.ha_net, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
STANDALONE=9859],
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}c4ea4688-0390-4b98-8740-f9e3e25a30ce{ip: 10.9.0.15, host: ha_dn1_1.ha_net, 
ports: [REPLICATION=988
6, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/T
HREE, State:ALLOCATED, leaderId:c4ea4688-0390-4b98-8740-f9e3e25a30ce, 
CreationTimestamp2022-04-15T11:43:22.523Z[UTC]] moved to OPEN state
scm_1    | 2022-04-15 11:44:12,089 [SCMBlockDeletingService#0] INFO 
block.SCMBlockDeletingService: Totally added 3 blocks to be deleted for 3 
datanodes, task elapsed time: 1ms
scm_1    | 2022-04-15 11:44:37,677 [IPC Server handler 13 on default port 9863] 
WARN ha.SequenceIdGenerator: Failed to allocate a batch for containerId, 
expected lastId is 0, actual lastId is 1.
scm_1    | 2022-04-15 11:44:37,677 [IPC Server handler 13 on default port 9863] 
INFO ha.SequenceIdGenerator: Allocate a batch for containerId, change lastId 
from 1 to 1001.
scm_1    | 2022-04-15 11:44:37,682 [IPC Server handler 13 on default port 9863] 
ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
scm_1    | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
PipelineID=d4e631f8-0594-49c2-8259-e9324f8c2637 not found
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
scm_1    |      at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown 
Source)
scm_1    |      at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
scm_1    |      at java.base/java.lang.reflect.Method.invoke(Method.java:566)
scm_1    |      at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
scm_1    |      at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
scm_1    |      at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
scm_1    |      at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
scm_1    |      at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
scm_1    |      at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
scm_1    |      at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
scm_1    |      at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
scm_1    |      at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
scm_1    |      at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
scm_1    |      at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
scm_1    |      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
scm_1    |      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
scm_1    |      at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
scm_1    |      at java.base/java.security.AccessController.doPrivileged(Native 
Method)
scm_1    |      at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
scm_1    |      at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
scm_1    |      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
scm_1    | 2022-04-15 11:47:10,652 [ReplicationMonitor] INFO 
container.ReplicationManager: Replication Manager is not ready to run until 
300000ms after safemode exit
scm_1    | 2022-04-15 11:52:10,655 [ReplicationMonitor] INFO 
container.ReplicationManager: Sending close container command for container #1 
to datanode 724eac09-3e86-4ab6-b7ee-420ba505ea84{ip: 10.9.0.16, host: ha_d
n2_1.ha_net, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, 
RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, 
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpoc
hSec: 0}.
scm_1    | 2022-04-15 11:52:10,655 [ReplicationMonitor] INFO 
container.ReplicationManager: Sending close container command for container #3 
to datanode 724eac09-3e86-4ab6-b7ee-420ba505ea84{ip: 10.9.0.16, host: ha_d
n2_1.ha_net, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, 
RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, 
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpoc
hSec: 0}.
scm_1    | 2022-04-15 11:52:10,655 [ReplicationMonitor] INFO 
container.ReplicationManager: Replication Monitor Thread took 4 milliseconds 
for processing 3 containers.
scm_1    | 2022-04-15 11:57:10,656 [ReplicationMonitor] INFO 
container.ReplicationManager: Replication Monitor Thread took 0 milliseconds 
for processing 3 containers.
{code}

At this moment I cannot exclude the possibly of [the procedures in 
{{SCMUpgradeFinalizer}} 
class|https://github.com/apache/ozone/blob/d6e0e4747f09ce5c04bcd2ac42cf9ce3e73198db/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java#L89]
 causing this temporary glitch, one reason being this error is observed always 
after {{--- RUNNING WITH NEW VERSION 1.2.0 FINALIZED ---}}

On a side note, interestingly it seems it's much easier to repro this on Github 
Actions (triggered 1 in 3 ~ 1 in 5) than on my local Linux box (1 in 20 ~ 1 in 
80). Maybe a slower machine / different distro / older Linux Kernel also helps 
in triggering this race condition (IDK).

> SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization
> ------------------------------------------------------------------------------
>
>                 Key: HDDS-6546
>                 URL: https://issues.apache.org/jira/browse/HDDS-6546
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: test
>            Reporter: Siyao Meng
>            Priority: Critical
>         Attachments: acceptance-misc-run-6012168314.zip, 
> potential_bug2_recon_1.png, potential_bug2_recon_2.png, 
> potential_bug2_scm_1.log, potential_bug2_scmwebui_1.png
>
>
> Found in upgrade acceptance test (in GH action misc), which is especially 
> flaky recently:
> {code}
> 15:11:05.927  INFO    Running command 'ozone sh key put 
> /new2-volume/new2-bucket/new2-key /opt/hadoop/NOTICE.txt 2>&1'.       
> 15:11:08.856  INFO    ${rc} = 255     
> 15:11:08.856  INFO    ${output} = INTERNAL_ERROR Allocated 0 blocks. 
> Requested 1 blocks       
> {code}
> from https://github.com/apache/ozone/pull/3199#issuecomment-1081394937
> -Goal: Check and wait for SCM/DN readiness before creating key.-
> As [~erose] and I dug into the issue it seems there are some pipelineMap 
> inconsistencies happening on the SCM side, where a pipeline ID is retrieved 
> from {{query2OpenPipelines}} when SCM is trying to get an OPEN pipeline but 
> is already removed from {{pipelineMap}} in {{PipelineStateMap}} as it has 
> already been CLOSED previously:
> {code}
> scm_1    | 2022-03-23 14:56:32,154 [IPC Server handler 90 on default port 
> 9863] ERROR block.BlockManagerImpl: Pipeline Machine count is zero.
> scm_1    | org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
> PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23 not found
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:157)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:137)
> scm_1    |    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown 
> Source)
> scm_1    |    at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> scm_1    |    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:83)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:68)
> scm_1    |    at com.sun.proxy.$Proxy16.getPipeline(Unknown Source)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:212)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.newBlock(BlockManagerImpl.java:200)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
> scm_1    |    at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> scm_1    |    at 
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
> scm_1    |    at 
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13937)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
> scm_1    |    at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
> scm_1    |    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
> scm_1    |    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
> scm_1    |    at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> scm_1    |    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> scm_1    |    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
> scm_1    |    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
> {code}
> while the pipeline {{8515aa81-2361-482a-82a8-bc5b5340dc23}} has obviously 
> been CLOSED earlier:
> {code}
> scm_1    | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 
> 630ef3dc-4fdf-4d5d-a72e-1f5feba433fa, Nodes: 
> c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, 
> ReplicationConfig: RATIS/ONE, State:OPEN, 
> leaderId:c6fe6b6e-0b93-4766-9b16-2d622849069e, 
> CreationTimestamp2022-03-23T14:53:12.187247Z[UTC]] moved to CLOSED state
> scm_1    | 2022-03-23 14:53:51,127 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Container #1 closed for 
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1    | 2022-03-23 14:53:51,128 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Container #3 closed for 
> pipeline=PipelineID=8515aa81-2361-482a-82a8-bc5b5340dc23
> scm_1    | 2022-03-23 14:53:51,131 [IPC Server handler 0 on default port 
> 9860] INFO pipeline.PipelineManagerImpl: Pipeline Pipeline[ Id: 
> 8515aa81-2361-482a-82a8-bc5b5340dc23, Nodes: 
> 1caafab5-8815-4e66-92ed-d7e53d381e4b{ip: 10.9.0.16, host: ha_dn2_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
> 0}45e92118-31a9-4b6f-9e18-0a4abff7f9e5{ip: 10.9.0.17, host: ha_dn3_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
> 0}c6fe6b6e-0b93-4766-9b16-2d622849069e{ip: 10.9.0.15, host: ha_dn1_1.ha_net, 
> ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
> STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, 
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, 
> ReplicationConfig: RATIS/THREE, State:OPEN, 
> leaderId:1caafab5-8815-4e66-92ed-d7e53d381e4b, 
> CreationTimestamp2022-03-23T14:53:12.187694Z[UTC]] moved to CLOSED state
> {code}
> from 
> https://github.com/elek/ozone-build-results/blob/master/2022/03/23/13958/acceptance-misc/upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log#L3539-L3567
> And it seems that the issue dates back to at least Nov-Dec last year (2021):
> https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11982/acceptance-misc/docker-1.2.0-finalized.log#L3269
> Potentially more (and earlier) failures can be found in 
> https://elek.github.io/ozone-build-results/
> Click into {{acceptance (misc)}} failures and there is a chance that you see 
> the same {{PipelineNotFoundException}} inside 
> {{./upgrade/1.1.0-1.2.0/docker-1.2.0-finalized.log}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-6546) SCM attempts to allocate block in a CLOSED pipeline after upgrade finalization

Reply via email to