[
https://issues.apache.org/jira/browse/KUDU-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112718#comment-17112718
]
Andrew Wong commented on KUDU-3127:
-----------------------------------
Another thing to be wary of is that any capacity-based placement policy for
initial placement may fall victim to the following trap:
* For the sake of example, let's say we use PO2C with available space as our
metric.
* When creating a table, the replica creation doesn't affect disk space much
since replicas may remain empty long before users start writing to them.
* As such, when adding many tablets replicas, the replicas end up skewed
against a snapshot of the available space across the cluster, and will continue
to be added with this skew until data is actually written to the newly created
tables.
* As a more concrete example, say we had a single replica on one tserver that
took 10MB, and we add 100 new replicas that will eventually take 10GB each, but
for the time being are left empty. The tserver with the initial 10MB will
likely remain mostly empty, while other tservers eventually fill up.
That isn't to say we shouldn't do this, if auto-rebalancing also accounts for
available space, this initial placing may not matter much.
FWIW, tablet data dir group selection also has this problem. The current
implementation uses PO2C and considers available space in case two data
directory candidates have the same number of tablet replicas assigned, and
otherwise use replica count. That said, it's probably more common for there to
be heterogeneous servers in a cluster vs heterogeneous devices in a server.
> Have replica placement and rebalancing consider non-uniform hardware
> --------------------------------------------------------------------
>
> Key: KUDU-3127
> URL: https://issues.apache.org/jira/browse/KUDU-3127
> Project: Kudu
> Issue Type: Improvement
> Components: CLI, master
> Reporter: Andrew Wong
> Priority: Major
>
> We've seen multiple deployments suffer from the fact of life that data
> centers don't always have uniform hardware. Often times, racks are comprised
> of whatever hardware we can salvage from other projects. As such, Kudu's
> assumption that all tablet servers should be treated equally (sans location
> awareness) can be a bad one.
> There are a few pieces to making this better:
> * Having Kudu determine the relative capacities of each tablet servers
> (either automatically, or as input by an operator)
> * Updating the replica placement policy to account for capacity across
> tablet servers
> * Bonus: have Kudu account for the current size used on each tablet server
> Some things that might be worth considering:
> * It seems reasonable to assume that each data directory is independent of
> one another, so we should be able to determine with relative ease the total
> capacity of a server by aggregating the total capacities of its data
> directories. This doesn't account for colocated WAL directories, but that
> might be a fine limitation, since we expect WAL usage to vary wildly as
> ingest workloads vary. The capacity could be heartbeated to masters
> periodically, or fetched via RPC by rebalancer tooling.
> * Updating the placement policy seems trickier, since there are a lot of
> nice properties with using the PO2C algorithm (e.g. eventual fixing of skew),
> and with assuming that all tablets have equal weight (e.g. it's harder to
> fall into the trap of moving a replica, only to move it back). Some variant
> of PO2C, but based on _available space_ instead of replica count might be
> worth considering for initial placement and for defining balance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)