Andrew Wong created KUDU-3127:
---------------------------------
Summary: Have replica placement and rebalancing consider
non-uniform hardware
Key: KUDU-3127
URL: https://issues.apache.org/jira/browse/KUDU-3127
Project: Kudu
Issue Type: Improvement
Components: CLI, master
Reporter: Andrew Wong
We've seen multiple deployments suffer from the fact of life that data centers
don't always have uniform hardware. Often times, racks are comprised of
whatever hardware we can salvage from other projects. As such, Kudu's
assumption that all tablet servers should be treated equally (sans location
awareness) can be a bad one.
There are a few pieces to making this better:
* Having Kudu determine the relative capacities of each tablet servers (either
automatically, or as input by an operator)
* Updating the replica placement policy to account for capacity across tablet
servers
* Bonus: have Kudu account for the current size used on each tablet server
Some things that might be worth considering:
* It seems reasonable to assume that each data directory is independent of one
another, so we should be able to determine with relative ease the total
capacity of a server by aggregating the total capacities of its data
directories. This doesn't account for colocated WAL directories, but that might
be a fine limitation, since we expect WAL usage to vary wildly as ingest
workloads vary. The capacity could be heartbeated to masters periodically, or
fetched via RPC by rebalancer tooling.
* Updating the placement policy seems trickier, since there are a lot of nice
properties with using the PO2C algorithm (e.g. eventual fixing of skew), and
with assuming that all tablets have equal weight (e.g. it's harder to fall into
the trap of moving a replica, only to move it back). Some variant of PO2C, but
based on _available space_ instead of replica count might be worth considering
for initial placement and for defining balance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)