GlenGeng commented on pull request #2295:
URL: https://github.com/apache/ozone/pull/2295#issuecomment-852035731
Hey guys, in our production cluster, we meet the similar contention problem,
and the consequence is worse: all SCMs crashed at the same time, and can not be
restarted again.
The pipeline is `fa2906e2-d989-402f-8148-d9bef3ac5258`,
1, pipeline is created at `14:38:06`
> 2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to
datanode:71935d59-fadc-4951-a08f-035450e70737
2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to
datanode:0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending
CreatePipelineCommand for
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to
datanode:160ad09d-73b5-4392-b08b-3b93befd71d2
2021-05-30 14:38:06,653
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater]
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Created pipeline
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes:
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack5827
88, certSerialId: null, persistedOpState: IN_SERVICE,
persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED,
leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z].
2, pipeline action and pipeline scrubber is triggered at the same time, at
`14:44:06`, and the pipeline is moved to CLOSED state twice, and removed by one
of them.
> 2021-05-30 14:44:06,650
[EventQueue-PipelineActionsForPipelineActionHandler] INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline
action CLOSE for PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 from datanode
0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7. Reason :
0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7 is in candidate state for 304440ms
2021-05-30 14:44:06,654 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Scrubbing pipeline:
id: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 since it stays at ALLOCATED
stage for -6 mins.
2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler]
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes:
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persiste
dOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:RATIS,
Factor:THREE, State:ALLOCATED, leaderId:,
CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler]
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 71935d59-fadc-4951-a08f-035450e70737
2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler]
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler]
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline Pipeline[
Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes:
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null,
persistedOpState: IN_SERVICE,
persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED,
leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 71935d59-fadc-4951-a08f-035450e70737
2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to
datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
2021-05-30 14:44:06,659
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater]
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes:
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51,
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858,
STANDALONE=9859], networkLocation: /rack582788, cert
SerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec:
0}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:,
CreationTimestamp2021-05-30T06:38:06.651Z] removed.
3, when the second removePipeline is applied by SCMStateMachine, the
PipelineNotFoundException crashed the SCM, and SCM refuses to start again.
> 2021-05-30 14:44:06,659
[EventQueue-PipelineActionsForPipelineActionHandler] INFO
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler: Invoking method public
abstract void
org.apache.hadoop.hdds.scm.pipeline.StateManager.removePipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$PipelineID)
throws java.io.IOException on target
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl@335aa187, cost 1709.382us
2021-05-30 14:44:06,662
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater]
ERROR org.apache.ratis.statemachine.StateMachine: Terminating with exit status
1: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:131)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.removePipeline(PipelineStateMap.java:343)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerV2Impl.removePipeline(PipelineStateManagerV2Impl.java:228)
at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:155)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:132)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1661)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:234)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:179)
at java.lang.Thread.run(Thread.java:748)
2021-05-30 14:44:06,664 [shutdown-hook-0] INFO
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down StorageContainerManager at
tdw-11-0-119-77/11.0.119.77
************************************************************/
I was planing to create a jira for the contention issue, seems your guys
meet it as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]