[
https://issues.apache.org/jira/browse/HDDS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Wagle updated HDDS-3107:
----------------------------------
Parent Issue: HDDS-3722 (was: HDDS-698)
> Pipelines may not be rack aware on cluster startup
> --------------------------------------------------
>
> Key: HDDS-3107
> URL: https://issues.apache.org/jira/browse/HDDS-3107
> Project: Hadoop Distributed Data Store
> Issue Type: Sub-task
> Components: SCM
> Affects Versions: 0.6.0
> Reporter: Stephen O'Donnell
> Priority: Major
> Attachments: docker-ozone-topology-ozone-topology-readdata-scm.log
>
>
> Given a 6 node cluster with 2 racks so there are 3 nodes per rack, it is
> possible for the pipeline to be created in a non-rack-aware way on startup.
> Using a robot test, like the one in HDDS-3084 intermittently I can see that
> if all nodes from one rack get registered first, a pipeline creation is
> triggered on them resulting in a pipeline which is all on one rack. Then the
> next 3 nodes register and as there are no nodes available on the other rack,
> they too join a "one rack" pipeline.
> This log snippet shows this happening. I will attach the full docker-compose
> log:
> {code}
> egrep "Sending CreatePipelineCommand|Registered Data node|Created pipe"
> docker-ozone-topology-ozone-topology-readdata-scm.log
> scm_1 | 2020-02-28 12:27:57,826 [IPC Server handler 6 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> 74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host:
> ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:57,840 [IPC Server handler 9 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> 32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host:
> ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:57,903 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=16806a56-8e35-46b2-aefd-cb5232d6f5f7 to
> datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
> scm_1 | 2020-02-28 12:27:57,924 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 16806a56-8e35-46b2-aefd-cb5232d6f5f7, Nodes:
> 32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host:
> ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:57.891553Z]
> scm_1 | 2020-02-28 12:27:57,932 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=5a3edf1e-84f6-48ef-a333-6f3e924898a6 to
> datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
> scm_1 | 2020-02-28 12:27:57,933 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 5a3edf1e-84f6-48ef-a333-6f3e924898a6, Nodes:
> 74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host:
> ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:57.932422Z]
> scm_1 | 2020-02-28 12:27:58,213 [IPC Server handler 8 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host:
> ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=ba2034fc-cb11-482a-9843-435294862240 to
> datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
> scm_1 | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> ba2034fc-cb11-482a-9843-435294862240, Nodes:
> 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host:
> ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:58.216275Z]
> scm_1 | 2020-02-28 12:27:58,218 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to
> datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
> scm_1 | 2020-02-28 12:27:58,219 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to
> datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
> scm_1 | 2020-02-28 12:27:58,220 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to
> datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
> scm_1 | 2020-02-28 12:27:58,221 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 4f16913d-ec06-44b4-a577-6664a517e401, Nodes:
> 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host:
> ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host:
> ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host:
> ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1,
> certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED,
> leaderId:null, CreationTimestamp2020-02-28T12:27:58.218896Z]
> scm_1 | 2020-02-28 12:27:58,645 [IPC Server handler 7 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> 66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host:
> ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:58,645 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=4739840f-8bb3-4742-ac5e-ac519b51e0fd to
> datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
> scm_1 | 2020-02-28 12:27:58,647 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 4739840f-8bb3-4742-ac5e-ac519b51e0fd, Nodes:
> 66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host:
> ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:58.645455Z]
> scm_1 | 2020-02-28 12:27:59,339 [IPC Server handler 7 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host:
> ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:59,340 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=555b9a1d-1c4a-4d9f-b198-492da7005ccd to
> datanode:9be38eea-bacc-434a-876d-50b105d4daa2
> scm_1 | 2020-02-28 12:27:59,341 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 555b9a1d-1c4a-4d9f-b198-492da7005ccd, Nodes:
> 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host:
> ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:59.340193Z]
> scm_1 | 2020-02-28 12:27:59,672 [IPC Server handler 6 on 9861] INFO
> node.SCMNodeManager: Registered Data node :
> cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host:
> ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}
> scm_1 | 2020-02-28 12:27:59,673 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=a6d77ef7-52c0-4f6a-8c22-f0b405da08a1 to
> datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
> scm_1 | 2020-02-28 12:27:59,674 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> a6d77ef7-52c0-4f6a-8c22-f0b405da08a1, Nodes:
> cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host:
> ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null,
> CreationTimestamp2020-02-28T12:27:59.673585Z]
> scm_1 | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to
> datanode:9be38eea-bacc-434a-876d-50b105d4daa2
> scm_1 | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to
> datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
> scm_1 | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO
> pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for
> pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to
> datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
> scm_1 | 2020-02-28 12:27:59,684 [RatisPipelineUtilsThread] INFO
> pipeline.PipelineStateManager: Created pipeline Pipeline[ Id:
> 70cfd35d-b778-42df-bcba-3ba14bd8ead0, Nodes:
> 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host:
> ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2,
> certSerialId: null}66ec72b2-4be5-453f-ac44-cc9857bad
> {code}
> I believe there are a few things to consider here:
> 1) Do we need a better way to see if rack awareness is enabled? Currently we
> check the network topology for a count of rack nodes, but these are only
> created as the nodes register. Should we use the cluster map to determine the
> intended number of racks on the cluster?
> 2) Should we fallback to non-rack-aware so easily? Pipelines are long lived,
> and if they are created non-rack aware, they will stay that way potential
> forever. Maybe we need to delay pipeline creation on startup until the node
> count settles?
> 3) If a pipeline or new container is being placed non-rack aware in a rack
> aware cluster should we complain loudly in the logs, JMX, in Recon?
> 4) Do we need something to check for non-rack aware pipelines and fix them if
> it can? Eg if we have 2 racks, and stop 1 rack, then we must create a
> non-rack-aware pipeline to keep on writing, but when the other rack is
> restarted, that pipeline should be destroyed and a new rack-aware one created.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]