[
https://issues.apache.org/jira/browse/HBASE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Shulman updated HBASE-10924:
--------------------------------------
Status: Patch Available (was: Open)
> [region_mover]: Adjust region_mover script to retry unloading a server a
> configurable number of times in case of region splits/merges
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-10924
> URL: https://issues.apache.org/jira/browse/HBASE-10924
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 0.94.15
> Reporter: Aleksandr Shulman
> Assignee: Aleksandr Shulman
> Labels: region_mover, rolling_upgrade
> Fix For: 0.94.20
>
> Attachments: HBASE-10924-0.94-v2.patch, HBASE-10924-0.94-v3.patch
>
>
> Observed behavior:
> In about 5% of cases, my rolling upgrade tests fail because of stuck regions
> during a region server unload. My theory is that this occurs when region
> assignment information changes between the time the region list is generated,
> and the time when the region is to be moved.
> An example of such a region information change is a split or merge.
> Example:
> Regionserver A has 100 regions (#0-#99). The balancer is turned off and the
> regionmover script is called to unload this regionserver. The regionmover
> script will generate the list of 100 regions to be moved and then proceed
> down that list, moving the regions off in series. However, there is a region,
> #84, that has split into two daughter regions while regions 0-83 were moved.
> The script will be stuck trying to move #84, timeout, and then the failure
> will bubble up (attempt 1 failed).
> Proposed solution:
> This specific failure mode should be caught and the region_mover script
> should now attempt to move off all the regions. Now, it will have 16+1 (due
> to split) regions to move. There is a good chance that it will be able to
> move all 17 off without issues. However, should it encounter this same issue
> (attempt 2 failed), it will retry again. This process will continue until the
> maximum number of unload retry attempts has been reached.
> This is not foolproof, but let's say for the sake of argument that 5% of
> unload attempts hit this issue, then with a retry count of 3, it will reduce
> the unload failure probability from 0.05 to 0.000125 (0.05^3).
> Next steps:
> I am looking for feedback on this approach. If it seems like a sensible
> approach, I will create a strawman patch and test it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)