Topology: Non Rack Aware Pipelines

Stephen O'Donnell Mon, 03 Aug 2020 04:35:44 -0700

Hi All,

I would like to revisit network topology around pipeline creation and
destruction.


Pipelines are created by the RatisPipelineProvider which delegates
responsibility for picking the pipeline nodes to the
PipelinePlacementPolicy.

When picking the nodes for a pipeline, the PipelinePlacementPolicy will
check for the topology and presence of more than 1 rack, and if so, try to
create pipelines spanning multiple racks. Otherwise it will select random
nodes - this is the fall back mechanism, intended to be used by clusters
with a single rack, or no topology configured.

As I have raised before, we have a couple of problems:

1) On cluster startup, pipeline creation is triggered immediately when
nodes register. If at least 3 nodes from 1 rack register before any others,
they can be part of a pipeline which is not rack aware. We have somewhat
fixed this with safemode rules.

2) If the nodes per rack are not perfectly balanced, it would be possible
for 3 DNs in 1 rack to have capacity for more pipelines, with all other
nodes having no capacity. If that happens, the fallback mechanism would be
used, and a non-rack aware pipeline would be created.

3) If something happens such that only 1 rack is available for some time
(restart or rack outage) the cluster will create new pipelines on 1 rack,
and these will never get destroyed, even when the missing rack returns to
service.

Proposal 1:

If the cluster has multiple racks AND there are healthy nodes covering at
least 2 racks, where healthy is defined as a node which is registered and
not stale or dead, then we should not allow "fallback" (pipelines which
span only 1 rack) pipelines to be created.

This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of
that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
would be constrained by this.

IMO, this constraint is better than creating non rack aware pipelines, as
the cluster setup should be fixed.

This should also handle the case when the cluster degrades to 1 rack, as
the healthy node definition will notice only 1 rack is alive.

It would be quite easy to implement this in
PipelinePlacementPolicy#filterViableNodes, as we already get the list of
healthy nodes, and then exclude overloaded nodes.

Questions:

1. Pipeline creation does not consider capacity - do we need to consider
capacity in the "healthy node" definition? Eg, extend it to nodes which are
not stale or dead, and have X bytes of available space? What if no nodes
have enough space?

2. What happens to a pipeline if one node in the pipeline runs out of
space? Will this be detected and the pipeline destroyed?


Proposal 2:

In the PipelineManager, there is already a thread called the
BackgroundPipelineCreator. As well as creating pipelines, I think it should
check existing pipelines using a similar rule as proposal 1. Ie, if the
cluster has multiple racks, and there are healthy nodes spanning more than
1 rack, it should destroy non-rack aware pipelines.

This would handle the case where the cluster degrades to a single rack,
single rack pipelines get created, and then it returns to multi-rack. It
would also allow for any non rack aware pipelines created at startup to be
cleaned up.

Questions:

1. Should the pipeline destruction be throttled? Consider the case where
the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
will be involved in non-rack-aware pipelines up to their pipeline limit.
When the second rack comes back online, it will not be able to create any
pipelines, until we free capacity on the existing nodes.

2. Assuming the destruction is throttled, I would welcome some ideas about
metrics that can be used to throttle it, that will handle small to large
clusters. Perhaps something as simple as "destroy at least 1 and up to at
most X% of bad pipelines, then run createPipelines, sleep, repeat".

Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
is down and its 3 nodes are in a pipeline - we cannot create a new pipeline
without briefly going to zero pipelines on the cluster.

I would like to get some agreement on the proposals before making any code
changes. Please let me know if there are any things I have missed or other
potential problems this could introduce.

Thanks,

Stephen.

Topology: Non Rack Aware Pipelines

Reply via email to