[ 
https://issues.apache.org/jira/browse/HBASE-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884168#comment-16884168
 ] 

Andrew Purtell edited comment on HBASE-22686 at 7/12/19 9:09 PM:
-----------------------------------------------------------------

{quote}For non-default number of {{hbase.regionserver.wal.max.splitters}} it 
could result in a lot of requests to ZK only to find that those tasks have 
already been taken.
{quote}
Factor in the removal of that enumeration of the regionserver znode children 
each time through the loop just to determine the number of live regionservers, 
I think the total work will be less.
{quote}each regionserver doesn't try to take more than its fair share.
{quote}
This I don't understand. The default limit is two. So now a regionserver can 
grab two tasks instead of one. Hardly antithetical to the notion of fair share. 

If we configure a regionserver to accept even more concurrent work than this, 
it's "fair share" is still exactly what we are configuring, by intent.

That said I can see abstractly speaking the usefulness of a load leveling 
strategy for even work distribution over the cluster but we still have one by 
default because the process of acquiring work is probabilistic. Maybe if 
someone configures a regionserver to allow like 100 concurrent split tasks 
uneven distribution could be a problem. This is possible but very unlikely as 
it's not what a reasonable cluster operator would choose.

I have a +1 on the PR so intend to commit this. Let me know if there are any 
objections.


was (Author: apurtell):
{quote}For non-default number of {{hbase.regionserver.wal.max.splitters}} it 
could result in a lot of requests to ZK only to find that those tasks have 
already been taken.
{quote}
Factor in the removal of that enumeration of the regionserver znode children 
each time through the loop just to determine the number of live regionservers, 
I think the total work will be less.
{quote}each regionserver doesn't try to take more than its fair share.
{quote}
This I don't understand. The default limit is two. So now a regionserver can 
grab two tasks instead of one. Hardly antithetical to the notion of fair share. 

If we configure a regionserver to accept even more concurrent work than this, 
it's "fair share" is still exactly what we are configuring, but intent.

That said I can see abstractly speaking the usefulness of a load leveling 
strategy for even work distribution over the cluster but we still have one by 
default because the process of acquiring work is probabilistic. Maybe if 
someone configures a regionserver to allow like 100 concurrent split tasks 
uneven distribution could be a problem. This is possible but very unlikely as 
it's not what a reasonable cluster operator would choose.

I have a +1 on the PR so intend to commit this. Let me know if there are any 
objections.

> ZkSplitLogWorkerCoordination doesn't allow a regionserver to pick up all of 
> the split work it is capable of
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22686
>                 URL: https://issues.apache.org/jira/browse/HBASE-22686
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.3.6, 1.4.11, 2.1.7
>
>         Attachments: HBASE-22686-branch-1.patch, HBASE-22686.patch
>
>
> A region hosted by a crashed regionserver cannot be reassigned until the 
> crashed regionserver's write-ahead logs have been processed and split into 
> per-region recovered edits files. Reassignment of a region from a crashed 
> server will be held up by the distributed split work backlog. Every 
> regionserver runs a background daemon thread that manages the acquisition and 
> execution of distributed log split tasks. This thread registers a watcher on 
> a znode managed by the master. When the master is processing a server 
> shutdown or crash or cluster restart when it detects the presence of 
> unprocessed WAL files it will register the WAL files for processing under the 
> znode. One or more live regionservers will attempt to get an exclusive lock 
> on an entry. One of them wins, splits the WAL file, deletes the entry, then 
> will acquire more work or go back to sleep if the worklist is empty. A 
> regionserver can acquire at most a fixed number of log split tasks determined 
> by configuration, hbase.regionserver.wal.max.splitters (default 2). If the 
> number of entries/logs to process exceeds the number of available split 
> workers in the cluster, perhaps due to the correlated failure of a 
> significant subset of the fleet, then splitting work will fall behind. 
> Regions may remain in RIT until the backlog is cleared.
> However, the regionserver side coordination logic - 
> ZkSplitLogWorkerCoordination - only allows a regionserver to grab one task 
> one at a time. Nearby javadoc says "This policy puts an upper-limit on the 
> number of simultaneous log splitting that could be happening in a cluster." 
> That upper limit will be the number of currently live regionservers. I don't 
> feel I understand exactly why this is necessary or appropriate because a 
> regionserver can handle more than one task at once and in fact the max number 
> of concurrent split tasks it can accept is configurable.
> {code:java}
>   /**
>    * This function calculates how many splitters it could create based on 
> expected average tasks per
>    * RS and the hard limit upper bound(maxConcurrentTasks) set by 
> configuration. <br>
>    * At any given time, a RS allows spawn MIN(Expected Tasks/RS, Hard Upper 
> Bound)
>    * @param numTasks current total number of available tasks
>    */
>   private int calculateAvailableSplitters(int numTasks) {
>     // at lease one RS(itself) available
>     int availableRSs = 1;
>     try {
>       List<String> regionServers =
>           ZKUtil.listChildrenNoWatch(watcher, watcher.rsZNode);
>       availableRSs = Math.max(availableRSs, (regionServers == null) ? 0 : 
> regionServers.size());
>     } catch (KeeperException e) {
>       // do nothing
>       LOG.debug("getAvailableRegionServers got ZooKeeper exception", e);
>     }
>     int expectedTasksPerRS = (numTasks / availableRSs) + ((numTasks % 
> availableRSs == 0) ? 0 : 1);
>     expectedTasksPerRS = Math.max(1, expectedTasksPerRS); // at least be one
>     // calculate how many more splitters we could spawn
>     return Math.min(expectedTasksPerRS, maxConcurrentTasks)
>         - this.tasksInProgress.get();
>   {code}
> Shouldn't this simply be:
> {code:java}
>   private int calculateAvailableSplitters() {
>     return maxConcurrentTasks - tasksInProgress.get();
>   }
> {code}
> ?
> This is branch-1.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to