Hi All, I would like to revisit network topology around pipeline creation and destruction.
Pipelines are created by the RatisPipelineProvider which delegates responsibility for picking the pipeline nodes to the PipelinePlacementPolicy. When picking the nodes for a pipeline, the PipelinePlacementPolicy will check for the topology and presence of more than 1 rack, and if so, try to create pipelines spanning multiple racks. Otherwise it will select random nodes - this is the fall back mechanism, intended to be used by clusters with a single rack, or no topology configured. As I have raised before, we have a couple of problems: 1) On cluster startup, pipeline creation is triggered immediately when nodes register. If at least 3 nodes from 1 rack register before any others, they can be part of a pipeline which is not rack aware. We have somewhat fixed this with safemode rules. 2) If the nodes per rack are not perfectly balanced, it would be possible for 3 DNs in 1 rack to have capacity for more pipelines, with all other nodes having no capacity. If that happens, the fallback mechanism would be used, and a non-rack aware pipeline would be created. 3) If something happens such that only 1 rack is available for some time (restart or rack outage) the cluster will create new pipelines on 1 rack, and these will never get destroyed, even when the missing rack returns to service. Proposal 1: If the cluster has multiple racks AND there are healthy nodes covering at least 2 racks, where healthy is defined as a node which is registered and not stale or dead, then we should not allow "fallback" (pipelines which span only 1 rack) pipelines to be created. This means if you have a badly configured cluster - eg Rack 1 = 10 nodes; Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5 would be constrained by this. IMO, this constraint is better than creating non rack aware pipelines, as the cluster setup should be fixed. This should also handle the case when the cluster degrades to 1 rack, as the healthy node definition will notice only 1 rack is alive. It would be quite easy to implement this in PipelinePlacementPolicy#filterViableNodes, as we already get the list of healthy nodes, and then exclude overloaded nodes. Questions: 1. Pipeline creation does not consider capacity - do we need to consider capacity in the "healthy node" definition? Eg, extend it to nodes which are not stale or dead, and have X bytes of available space? What if no nodes have enough space? 2. What happens to a pipeline if one node in the pipeline runs out of space? Will this be detected and the pipeline destroyed? Proposal 2: In the PipelineManager, there is already a thread called the BackgroundPipelineCreator. As well as creating pipelines, I think it should check existing pipelines using a similar rule as proposal 1. Ie, if the cluster has multiple racks, and there are healthy nodes spanning more than 1 rack, it should destroy non-rack aware pipelines. This would handle the case where the cluster degrades to a single rack, single rack pipelines get created, and then it returns to multi-rack. It would also allow for any non rack aware pipelines created at startup to be cleaned up. Questions: 1. Should the pipeline destruction be throttled? Consider the case where the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack will be involved in non-rack-aware pipelines up to their pipeline limit. When the second rack comes back online, it will not be able to create any pipelines, until we free capacity on the existing nodes. 2. Assuming the destruction is throttled, I would welcome some ideas about metrics that can be used to throttle it, that will handle small to large clusters. Perhaps something as simple as "destroy at least 1 and up to at most X% of bad pipelines, then run createPipelines, sleep, repeat". Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack is down and its 3 nodes are in a pipeline - we cannot create a new pipeline without briefly going to zero pipelines on the cluster. I would like to get some agreement on the proposals before making any code changes. Please let me know if there are any things I have missed or other potential problems this could introduce. Thanks, Stephen.