Stephen O'Donnell created HDDS-3107:
---------------------------------------

             Summary: Pipelines may not be rack aware on cluster startup
                 Key: HDDS-3107
                 URL: https://issues.apache.org/jira/browse/HDDS-3107
             Project: Hadoop Distributed Data Store
          Issue Type: Sub-task
          Components: SCM
    Affects Versions: 0.6.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


Given a 6 node cluster with 2 racks so there are 3 nodes per rack, it is 
possible for the pipeline to be created in a non-rack-aware way on startup.

Using a robot test, like the one in HDDS-3084 intermittently I can see that if 
all nodes from one rack get registered first, a pipeline creation is triggered 
on them resulting in a pipeline which is all on one rack. Then the next 3 nodes 
register and as there are no nodes available on the other rack, they too join a 
"one rack" pipeline.

This log snippet shows this happening. I will attach the full docker-compose 
log:

{code}
egrep "Sending CreatePipelineCommand|Registered Data node|Created pipe" 
docker-ozone-topology-ozone-topology-readdata-scm.log
scm_1         | 2020-02-28 12:27:57,826 [IPC Server handler 6 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: 
ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:57,840 [IPC Server handler 9 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: 
ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:57,903 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=16806a56-8e35-46b2-aefd-cb5232d6f5f7 to 
datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
scm_1         | 2020-02-28 12:27:57,924 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
16806a56-8e35-46b2-aefd-cb5232d6f5f7, Nodes: 
32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: 
ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:57.891553Z]
scm_1         | 2020-02-28 12:27:57,932 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=5a3edf1e-84f6-48ef-a333-6f3e924898a6 to 
datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
scm_1         | 2020-02-28 12:27:57,933 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
5a3edf1e-84f6-48ef-a333-6f3e924898a6, Nodes: 
74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: 
ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:57.932422Z]
scm_1         | 2020-02-28 12:27:58,213 [IPC Server handler 8 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: 
ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=ba2034fc-cb11-482a-9843-435294862240 to 
datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
scm_1         | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
ba2034fc-cb11-482a-9843-435294862240, Nodes: 
4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: 
ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:58.216275Z]
scm_1         | 2020-02-28 12:27:58,218 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to 
datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
scm_1         | 2020-02-28 12:27:58,219 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to 
datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
scm_1         | 2020-02-28 12:27:58,220 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to 
datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
scm_1         | 2020-02-28 12:27:58,221 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
4f16913d-ec06-44b4-a577-6664a517e401, Nodes: 
4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: 
ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: 
ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: 
ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, 
certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:58.218896Z]
scm_1         | 2020-02-28 12:27:58,645 [IPC Server handler 7 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host: 
ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:58,645 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=4739840f-8bb3-4742-ac5e-ac519b51e0fd to 
datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
scm_1         | 2020-02-28 12:27:58,647 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
4739840f-8bb3-4742-ac5e-ac519b51e0fd, Nodes: 
66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host: 
ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:58.645455Z]
scm_1         | 2020-02-28 12:27:59,339 [IPC Server handler 7 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: 
ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:59,340 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=555b9a1d-1c4a-4d9f-b198-492da7005ccd to 
datanode:9be38eea-bacc-434a-876d-50b105d4daa2
scm_1         | 2020-02-28 12:27:59,341 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
555b9a1d-1c4a-4d9f-b198-492da7005ccd, Nodes: 
9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: 
ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:59.340193Z]
scm_1         | 2020-02-28 12:27:59,672 [IPC Server handler 6 on 9861] INFO 
node.SCMNodeManager: Registered Data node : 
cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host: 
ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}
scm_1         | 2020-02-28 12:27:59,673 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=a6d77ef7-52c0-4f6a-8c22-f0b405da08a1 to 
datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
scm_1         | 2020-02-28 12:27:59,674 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
a6d77ef7-52c0-4f6a-8c22-f0b405da08a1, Nodes: 
cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host: 
ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, 
CreationTimestamp2020-02-28T12:27:59.673585Z]
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to 
datanode:9be38eea-bacc-434a-876d-50b105d4daa2
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to 
datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO 
pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for 
pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to 
datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
scm_1         | 2020-02-28 12:27:59,684 [RatisPipelineUtilsThread] INFO 
pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 
70cfd35d-b778-42df-bcba-3ba14bd8ead0, Nodes: 
9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: 
ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, 
certSerialId: null}66ec72b2-4be5-453f-ac44-cc9857bad
{code}

I believe there are a few things to consider here:

1) Do we need a better way to see if rack awareness is enabled? Currently we 
check the network topology for a count of rack nodes, but these are only 
created as the nodes register. Should we use the cluster map to determine the 
intended number of racks on the cluster?

2) Should we fallback to non-rack-aware so easily? Pipelines are long lived, 
and if they are created non-rack aware, they will stay that way potential 
forever. Maybe we need to delay pipeline creation on startup until the node 
count settles?

3) If a pipeline or new container is being placed non-rack aware in a rack 
aware cluster should we complain loudly in the logs, JMX, in Recon?

4) Do we need something to check for non-rack aware pipelines and fix them if 
it can? Eg if we have 2 racks, and stop 1 rack, then we must create a 
non-rack-aware pipeline to keep on writing, but when the other rack is 
restarted, that pipeline should be destroyed and a new rack-aware one created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to