GlenGeng commented on pull request #2295:
URL: https://github.com/apache/ozone/pull/2295#issuecomment-852035731


   Hey guys, in our production cluster, we meet the similar contention problem, 
and the consequence is worse: all SCMs crashed at the same time, and can not be 
restarted again.
   
   The pipeline is `fa2906e2-d989-402f-8148-d9bef3ac5258`,
   
   1, pipeline is created at `14:38:06`
   
   > 2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to 
datanode:71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to 
datanode:0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:38:06,651 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Sending 
CreatePipelineCommand for 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 to 
datanode:160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:38:06,653 
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] 
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Created pipeline 
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack5827
 88, certSerialId: null, persistedOpState: IN_SERVICE, 
persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED, 
leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z].
   
   
   2, pipeline action and pipeline scrubber is triggered at the same time, at 
`14:44:06`, and the pipeline is moved to CLOSED state twice, and removed by one 
of them.
   
   > 2021-05-30 14:44:06,650 
[EventQueue-PipelineActionsForPipelineActionHandler] INFO 
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline 
action CLOSE for PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 from datanode 
0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7. Reason : 
0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7 is in candidate state for 304440ms
   2021-05-30 14:44:06,654 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Scrubbing pipeline: 
id: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 since it stays at ALLOCATED 
stage for -6 mins.
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] 
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline 
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, persiste
 dOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:RATIS, 
Factor:THREE, State:ALLOCATED, leaderId:, 
CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] 
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] 
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:44:06,654 [EventQueue-PipelineActionsForPipelineActionHandler] 
INFO org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerV2Impl: Pipeline Pipeline[ 
Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, 
persistedOpState: IN_SERVICE, 
 persistedOpStateExpiryEpochSec: 0}, Type:RATIS, Factor:THREE, State:ALLOCATED, 
leaderId:, CreationTimestamp2021-05-30T06:38:06.651Z] moved to CLOSED state
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 71935d59-fadc-4951-a08f-035450e70737
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7
   2021-05-30 14:44:06,657 [RatisPipelineUtilsThread - 0] INFO 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider: Send 
pipeline:PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 close command to 
datanode 160ad09d-73b5-4392-b08b-3b93befd71d2
   2021-05-30 14:44:06,659 
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] 
INFO org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline 
Pipeline[ Id: fa2906e2-d989-402f-8148-d9bef3ac5258, Nodes: 
71935d59-fadc-4951-a08f-035450e70737{ip: 11.61.45.54, host: 11.61.45.54, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}0a16a9a7-1af0-4fbe-9b32-9e67df46b4c7{ip: 11.26.17.139, host: 11.26.17.139, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack561349, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}160ad09d-73b5-4392-b08b-3b93befd71d2{ip: 11.61.45.51, host: 11.61.45.51, 
ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9858, RATIS_SERVER=9858, 
STANDALONE=9859], networkLocation: /rack582788, cert
 SerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 
0}, Type:RATIS, Factor:THREE, State:CLOSED, leaderId:, 
CreationTimestamp2021-05-30T06:38:06.651Z] removed.
   
   
   3, when the second removePipeline is applied by SCMStateMachine, the 
PipelineNotFoundException crashed the SCM, and SCM refuses to start again.
   
   > 2021-05-30 14:44:06,659 
[EventQueue-PipelineActionsForPipelineActionHandler] INFO 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler: Invoking method public 
abstract void 
org.apache.hadoop.hdds.scm.pipeline.StateManager.removePipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$PipelineID)
 throws java.io.IOException on target 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl@335aa187, cost 1709.382us
   2021-05-30 14:44:06,662 
[9e77f811-8df6-4a59-9642-0f40d6f01764@group-58D600BFB107-StateMachineUpdater] 
ERROR org.apache.ratis.statemachine.StateMachine: Terminating with exit status 
1: PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
   org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: 
PipelineID=fa2906e2-d989-402f-8148-d9bef3ac5258 not found
           at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:131)
           at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.removePipeline(PipelineStateMap.java:343)
           at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerV2Impl.removePipeline(PipelineStateManagerV2Impl.java:228)
           at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
           at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:155)
           at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:132)
           at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1661)
           at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:234)
           at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:179)
           at java.lang.Thread.run(Thread.java:748)
   2021-05-30 14:44:06,664 [shutdown-hook-0] INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: 
   /************************************************************
   SHUTDOWN_MSG: Shutting down StorageContainerManager at 
tdw-11-0-119-77/11.0.119.77
   ************************************************************/
   
   
   I was planing to create a jira for the contention issue, seems your guys 
meet it as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to