[ 
https://issues.apache.org/jira/browse/HDDS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-5336:
---------------------------------
    Labels: pull-request-available  (was: )

> Fix datanode capacity related race condition
> --------------------------------------------
>
>                 Key: HDDS-5336
>                 URL: https://issues.apache.org/jira/browse/HDDS-5336
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Major
>              Labels: pull-request-available
>
> After merging master into the upgrade branch in HDDS-5321, an intermittent 
> failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat: 
> https://github.com/apache/ozone/runs/2787582345
> The issue occurs in SCMNodeManager#register, where the node is added to the 
> nodeStateManager firing the NEW_NODE event, before the node report containing 
> storage information for the new node is processed. The event triggers a one 
> shot run on the background pipeline creator which will read the node's 
> storage information to determine if it can hold a pipeline. If the storage 
> report has not yet been processed when this happens, no pipeline will be 
> created to use the new node when it is registered, because the node still 
> appears to have no free space.
> Relevant log lines from the test failure:
> {code}
> 2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO  
> net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new 
> node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN  
> pipeline.PipelinePlacementPolicy 
> (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation 
> failed due to no sufficient healthy datanodes. Required 3. Found 2.
> 2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO  
> pipeline.BackgroundPipelineCreator 
> (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a 
> one-shot run on RatisPipelineUtilsThread.
> 2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO  
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$create$0(170)) - Sending 
> CreatePipelineCommand for 
> pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to 
> datanode:b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO  
> pipeline.PipelineStateManager 
> (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline 
> Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes: 
> b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: 
> localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, 
> REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: 
> /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE, 
> State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO  
> ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) - 
> Invoking method public abstract void 
> org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline)
>  throws java.io.IOException on target 
> org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost 
> 655.117us
> 2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN  
> pipeline.PipelinePlacementPolicy 
> (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation 
> failed due to no sufficient healthy datanodes with enough space for even a 
> single container. Required 3. Found 2. Container size 5368709120.
> 2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO  node.SCMNodeManager 
> (SCMNodeManager.java:register(386)) - Registered Data node : 
> b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: 
> localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, 
> REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: 
> /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines 
> of type RATIS and factor ONE.
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> 2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO  
> node.TestSCMNodeManager 
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines 
> of type RATIS and factor THREE.
> {code}
> Note that the new node is the third node registered, so we would expect a 
> Ratis factor three pipeline to be created after this event. Factor one 
> pipeline creation succeeds for this new node due to HDDS-5337, although this 
> is not related to this test failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to