[
https://issues.apache.org/jira/browse/HBASE-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884168#comment-16884168
]
Andrew Purtell edited comment on HBASE-22686 at 7/12/19 9:09 PM:
-----------------------------------------------------------------
{quote}For non-default number of {{hbase.regionserver.wal.max.splitters}} it
could result in a lot of requests to ZK only to find that those tasks have
already been taken.
{quote}
Factor in the removal of that enumeration of the regionserver znode children
each time through the loop just to determine the number of live regionservers,
I think the total work will be less.
{quote}each regionserver doesn't try to take more than its fair share.
{quote}
This I don't understand. The default limit is two. So now a regionserver can
grab two tasks instead of one. Hardly antithetical to the notion of fair share.
If we configure a regionserver to accept even more concurrent work than this,
it's "fair share" is still exactly what we are configuring, by intent.
That said I can see abstractly speaking the usefulness of a load leveling
strategy for even work distribution over the cluster but we still have one by
default because the process of acquiring work is probabilistic. Maybe if
someone configures a regionserver to allow like 100 concurrent split tasks
uneven distribution could be a problem. This is possible but very unlikely as
it's not what a reasonable cluster operator would choose.
I have a +1 on the PR so intend to commit this. Let me know if there are any
objections.
was (Author: apurtell):
{quote}For non-default number of {{hbase.regionserver.wal.max.splitters}} it
could result in a lot of requests to ZK only to find that those tasks have
already been taken.
{quote}
Factor in the removal of that enumeration of the regionserver znode children
each time through the loop just to determine the number of live regionservers,
I think the total work will be less.
{quote}each regionserver doesn't try to take more than its fair share.
{quote}
This I don't understand. The default limit is two. So now a regionserver can
grab two tasks instead of one. Hardly antithetical to the notion of fair share.
If we configure a regionserver to accept even more concurrent work than this,
it's "fair share" is still exactly what we are configuring, but intent.
That said I can see abstractly speaking the usefulness of a load leveling
strategy for even work distribution over the cluster but we still have one by
default because the process of acquiring work is probabilistic. Maybe if
someone configures a regionserver to allow like 100 concurrent split tasks
uneven distribution could be a problem. This is possible but very unlikely as
it's not what a reasonable cluster operator would choose.
I have a +1 on the PR so intend to commit this. Let me know if there are any
objections.
> ZkSplitLogWorkerCoordination doesn't allow a regionserver to pick up all of
> the split work it is capable of
> -----------------------------------------------------------------------------------------------------------
>
> Key: HBASE-22686
> URL: https://issues.apache.org/jira/browse/HBASE-22686
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.3.6, 1.4.11, 2.1.7
>
> Attachments: HBASE-22686-branch-1.patch, HBASE-22686.patch
>
>
> A region hosted by a crashed regionserver cannot be reassigned until the
> crashed regionserver's write-ahead logs have been processed and split into
> per-region recovered edits files. Reassignment of a region from a crashed
> server will be held up by the distributed split work backlog. Every
> regionserver runs a background daemon thread that manages the acquisition and
> execution of distributed log split tasks. This thread registers a watcher on
> a znode managed by the master. When the master is processing a server
> shutdown or crash or cluster restart when it detects the presence of
> unprocessed WAL files it will register the WAL files for processing under the
> znode. One or more live regionservers will attempt to get an exclusive lock
> on an entry. One of them wins, splits the WAL file, deletes the entry, then
> will acquire more work or go back to sleep if the worklist is empty. A
> regionserver can acquire at most a fixed number of log split tasks determined
> by configuration, hbase.regionserver.wal.max.splitters (default 2). If the
> number of entries/logs to process exceeds the number of available split
> workers in the cluster, perhaps due to the correlated failure of a
> significant subset of the fleet, then splitting work will fall behind.
> Regions may remain in RIT until the backlog is cleared.
> However, the regionserver side coordination logic -
> ZkSplitLogWorkerCoordination - only allows a regionserver to grab one task
> one at a time. Nearby javadoc says "This policy puts an upper-limit on the
> number of simultaneous log splitting that could be happening in a cluster."
> That upper limit will be the number of currently live regionservers. I don't
> feel I understand exactly why this is necessary or appropriate because a
> regionserver can handle more than one task at once and in fact the max number
> of concurrent split tasks it can accept is configurable.
> {code:java}
> /**
> * This function calculates how many splitters it could create based on
> expected average tasks per
> * RS and the hard limit upper bound(maxConcurrentTasks) set by
> configuration. <br>
> * At any given time, a RS allows spawn MIN(Expected Tasks/RS, Hard Upper
> Bound)
> * @param numTasks current total number of available tasks
> */
> private int calculateAvailableSplitters(int numTasks) {
> // at lease one RS(itself) available
> int availableRSs = 1;
> try {
> List<String> regionServers =
> ZKUtil.listChildrenNoWatch(watcher, watcher.rsZNode);
> availableRSs = Math.max(availableRSs, (regionServers == null) ? 0 :
> regionServers.size());
> } catch (KeeperException e) {
> // do nothing
> LOG.debug("getAvailableRegionServers got ZooKeeper exception", e);
> }
> int expectedTasksPerRS = (numTasks / availableRSs) + ((numTasks %
> availableRSs == 0) ? 0 : 1);
> expectedTasksPerRS = Math.max(1, expectedTasksPerRS); // at least be one
> // calculate how many more splitters we could spawn
> return Math.min(expectedTasksPerRS, maxConcurrentTasks)
> - this.tasksInProgress.get();
> {code}
> Shouldn't this simply be:
> {code:java}
> private int calculateAvailableSplitters() {
> return maxConcurrentTasks - tasksInProgress.get();
> }
> {code}
> ?
> This is branch-1.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)