[
https://issues.apache.org/jira/browse/HDDS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated HDDS-5336:
-----------------------------
Description:
After merging master into the upgrade branch in HDDS-5321, an intermittent
failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat:
https://github.com/apache/ozone/runs/2787582345
The issue occurs in SCMNodeManager#register, where the node is added to the
nodeStateManager firing the NEW_NODE event, before the node report containing
storage information for the new node is processed. The event triggers a one
shot run on the background pipeline creator which will read the node's storage
information to determine if it can hold a pipeline. If the storage report has
not yet been processed when this happens, no pipeline will be created to use
the new node when it is registered, because the node still appears to have no
free space.
Relevant log lines from the test failure:
{code}
2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO
net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node:
/default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN
pipeline.PipelinePlacementPolicy
(PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation
failed due to no sufficient healthy datanodes. Required 3. Found 2.
2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO
pipeline.BackgroundPipelineCreator
(BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a
one-shot run on RatisPipelineUtilsThread.
2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO
pipeline.RatisPipelineProvider
(RatisPipelineProvider.java:lambda$create$0(170)) - Sending
CreatePipelineCommand for
pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to
datanode:b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
pipeline.PipelineStateManager
(PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[
Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes:
b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0,
RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId:
null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:,
CreationTimestamp2021-06-09T21:04:44.088Z].
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) -
Invoking method public abstract void
org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline)
throws java.io.IOException on target
org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost
655.117us
2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN
pipeline.PipelinePlacementPolicy
(PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation
failed due to no sufficient healthy datanodes with enough space for even a
single container. Required 3. Found 2. Container size 5368709120.
2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager
(SCMNodeManager.java:register(386)) - Registered Data node :
b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0,
RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId:
null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of
type RATIS and factor ONE.
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
{code}
Note that the new node is the third node registered, so we would expect a Ratis
factor three pipeline to be created after this event. Factor one pipeline
creation succeeds for this new node due to HDDS-5337.
was:
After merging master into the upgrade branch in HDDS-5321, an intermittent
failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat:
https://github.com/apache/ozone/runs/2787582345
The issue occurs in SCMNodeManager#register, where the node is added to the
nodeStateManager firing the NEW_NODE event, before the node report containing
storage information for the new node is processed. The event triggers a one
shot run on the background pipeline creator which will read the node's storage
information to determine if it can hold a pipeline. If the storage report has
not yet been processed when this happens, no pipeline will be created to use
the new node when it is registered, because the node still appears to have no
free space.
Relevant log lines from the test failure:
{code}
2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO
net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node:
/default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN
pipeline.PipelinePlacementPolicy
(PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation
failed due to no sufficient healthy datanodes. Required 3. Found 2.
2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO
pipeline.BackgroundPipelineCreator
(BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a
one-shot run on RatisPipelineUtilsThread.
2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO
pipeline.RatisPipelineProvider
(RatisPipelineProvider.java:lambda$create$0(170)) - Sending
CreatePipelineCommand for
pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to
datanode:b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
pipeline.PipelineStateManager
(PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[
Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes:
b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0,
RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId:
null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:,
CreationTimestamp2021-06-09T21:04:44.088Z].
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) -
Invoking method public abstract void
org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline)
throws java.io.IOException on target
org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost
655.117us
2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN
pipeline.PipelinePlacementPolicy
(PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation
failed due to no sufficient healthy datanodes with enough space for even a
single container. Required 3. Found 2. Container size 5368709120.
2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager
(SCMNodeManager.java:register(386)) - Registered Data node :
b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0,
RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId:
null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of
type RATIS and factor ONE.
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO
node.TestSCMNodeManager
(TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of
type RATIS and factor THREE.
{code}
> Fix datanode capacity related race condition
> --------------------------------------------
>
> Key: HDDS-5336
> URL: https://issues.apache.org/jira/browse/HDDS-5336
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
>
> After merging master into the upgrade branch in HDDS-5321, an intermittent
> failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat:
> https://github.com/apache/ozone/runs/2787582345
> The issue occurs in SCMNodeManager#register, where the node is added to the
> nodeStateManager firing the NEW_NODE event, before the node report containing
> storage information for the new node is processed. The event triggers a one
> shot run on the background pipeline creator which will read the node's
> storage information to determine if it can hold a pipeline. If the storage
> report has not yet been processed when this happens, no pipeline will be
> created to use the new node when it is registered, because the node still
> appears to have no free space.
> Relevant log lines from the test failure:
> {code}
> 2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO
> net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new
> node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN
> pipeline.PipelinePlacementPolicy
> (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation
> failed due to no sufficient healthy datanodes. Required 3. Found 2.
> 2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO
> pipeline.BackgroundPipelineCreator
> (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a
> one-shot run on RatisPipelineUtilsThread.
> 2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO
> pipeline.RatisPipelineProvider
> (RatisPipelineProvider.java:lambda$create$0(170)) - Sending
> CreatePipelineCommand for
> pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to
> datanode:b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
> pipeline.PipelineStateManager
> (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline
> Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes:
> b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
> localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0,
> REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation:
> /default-rack, certSerialId: null, persistedOpState: IN_SERVICE,
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE,
> State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO
> ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) -
> Invoking method public abstract void
> org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline)
> throws java.io.IOException on target
> org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost
> 655.117us
> 2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN
> pipeline.PipelinePlacementPolicy
> (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation
> failed due to no sufficient healthy datanodes with enough space for even a
> single container. Required 3. Found 2. Container size 5368709120.
> 2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager
> (SCMNodeManager.java:register(386)) - Registered Data node :
> b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host:
> localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0,
> REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation:
> /default-rack, certSerialId: null, persistedOpState: IN_SERVICE,
> persistedOpStateExpiryEpochSec: 0}
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines
> of type RATIS and factor ONE.
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> 2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO
> node.TestSCMNodeManager
> (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines
> of type RATIS and factor THREE.
> {code}
> Note that the new node is the third node registered, so we would expect a
> Ratis factor three pipeline to be created after this event. Factor one
> pipeline creation succeeds for this new node due to HDDS-5337.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]