[
https://issues.apache.org/jira/browse/HBASE-28533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Roudnitsky updated HBASE-28533:
--------------------------------------
Description:
Depending on where the split procedure fails, SplitTableRegionProcedure
rollback can leave the parent region's RegionStateNode in SPLITTING after
rollback is complete, when the parent region is still online on the assigned
region server. This leaves active HMaster believing that the parent region is
offline according to its RegionStates, and causes subsequent procedures that
require the region to be online like merge/split/move to fail to start. One
workaround is to restart active HMaster to reset the in memory record of region
states.
Two scenarios where this can happen:
* If we get to SPLIT_TABLE_REGION_CLOSE_PARENT_REGION and the parent region has
a replica which is in transition, the unassign procedure in that step is never
created/rolled back and we are left with the parent region state in splitting.
* If region quotas are enabled and a split is run for a region whose namespace
is at its maximum region quota limit we will fail in
SPLIT_TABLE_REGION_PRE_OPERATION with QuotaExceededException and we are left
with the parent region state in splitting
To reproduce the region quota case in HBase shell:
{code:java}
> create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO =>
> 'UniformSplit'}
> region_a = <first region from list_regions 'test_ns:test_table'>
> region_b = <second region from list_regions 'test_ns:test_table'>
> split region_a, 'x'
# HMaster will report:
pid=405, state=ROLLEDBACK,
exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via
master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException:
Region split not possible for :<region_a> as quota limits are exceeded ;
SplitTableRegionProcedure table=test_ns:test_table, parent=...
> merge_region region_a, region_b
ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException:
org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not
OPEN; state=SPLITTING
> stop_master # trigger hmaster failover
> merge_region region_a, region_b # merge now succeeds {code}
was:
Depending on where the split procedure fails, SplitTableRegionProcedure
rollback can leave the parent region's RegionStateNode in SPLITTING after
rollback is complete, when the parent region is still online on the assigned
region server. This leaves active HMaster believing that the parent region is
offline according to its RegionStates, and causes subsequent procedures that
require the region to be online like merge/split/move to fail to start.
Workaround is to restart active HMaster to reset the in memory record of region
states.
Two scenarios where this happens:
*
When a SplitTableRegionProcedure is run for a region whose namespace is at its
maximum region quota limit, the split procedure will fail and rollback, and
Hmaster's in memory RegionStateNode for the region is left in a SPLITTING
state. Hmaster will then refuse to start any subsequent merge/split/move
procedures for that region because it believes the region is not OPEN, until it
is restarted and the in memory record of region states is reset.
In the first step of the split procedure SPLIT_TABLE_REGION_PREPARE the parent
region's RegionStateNode state is set to SPLITTING, and the transition is not
written to the meta table. In the next step SPLIT_TABLE_REGION_PRE_OPERATION
the region quota check is done, QuotaExceededException is thrown and the
procedure ends in ROLLEDBACK state without reverting the RegionStateNode back
to OPEN state. Hmaster is left believing the region is in a SPLITTING state
according to its in memory RegionStates, while the region is still online on
the assigned region server and according to meta.
To reproduce in HBase shell:
{code:java}
> create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO =>
> 'UniformSplit'}
> region_a = <first region from list_regions 'test_ns:test_table'>
> region_b = <second region from list_regions 'test_ns:test_table'>
> split region_a, 'x'
# HMaster will report:
pid=405, state=ROLLEDBACK,
exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via
master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException:
Region split not possible for :<region_a> as quota limits are exceeded ;
SplitTableRegionProcedure table=test_ns:test_table, parent=...
> merge_region region_a, region_b
ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException:
org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not
OPEN; state=SPLITTING
> stop_master # trigger hmaster failover
> merge_region region_a, region_b # merge now succeeds {code}
> Split procedure rollback can leave parent region state in SPLITTING after
> completion
> ------------------------------------------------------------------------------------
>
> Key: HBASE-28533
> URL: https://issues.apache.org/jira/browse/HBASE-28533
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Environment: Tested on HBase Version 2.5.8 and latest master branch
> Reporter: Daniel Roudnitsky
> Assignee: Daniel Roudnitsky
> Priority: Major
>
> Depending on where the split procedure fails, SplitTableRegionProcedure
> rollback can leave the parent region's RegionStateNode in SPLITTING after
> rollback is complete, when the parent region is still online on the assigned
> region server. This leaves active HMaster believing that the parent region is
> offline according to its RegionStates, and causes subsequent procedures that
> require the region to be online like merge/split/move to fail to start. One
> workaround is to restart active HMaster to reset the in memory record of
> region states.
> Two scenarios where this can happen:
> * If we get to SPLIT_TABLE_REGION_CLOSE_PARENT_REGION and the parent region
> has a replica which is in transition, the unassign procedure in that step is
> never created/rolled back and we are left with the parent region state in
> splitting.
> * If region quotas are enabled and a split is run for a region whose
> namespace is at its maximum region quota limit we will fail in
> SPLIT_TABLE_REGION_PRE_OPERATION with QuotaExceededException and we are left
> with the parent region state in splitting
> To reproduce the region quota case in HBase shell:
> {code:java}
> > create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> > create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO =>
> > 'UniformSplit'}
> > region_a = <first region from list_regions 'test_ns:test_table'>
> > region_b = <second region from list_regions 'test_ns:test_table'>
> > split region_a, 'x'
> # HMaster will report:
> pid=405, state=ROLLEDBACK,
> exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via
> master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException:
> Region split not possible for :<region_a> as quota limits are exceeded ;
> SplitTableRegionProcedure table=test_ns:test_table, parent=...
> > merge_region region_a, region_b
> ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException:
> org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not
> OPEN; state=SPLITTING
> > stop_master # trigger hmaster failover
> > merge_region region_a, region_b # merge now succeeds {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)