[
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk reopened HBASE-28068:
----------------------------------
Re-opening for branch-2.4 backport.
> Add hbase.normalizer.merge.merge_request_max_number_of_regions property to
> limit max number of regions in a merge request for merge normalization
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
> Issue Type: Improvement
> Components: Normalizer
> Affects Versions: 2.4.0, 2.5.0, 2.6.0, 3.0.0-alpha-4, 4.0.0-alpha-1
> Reporter: Ravi Kishore Valeti
> Assignee: Rahul Kumar
> Priority: Minor
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> In our production environment, while investigating an issue, we observed that
> the Noramlizer had scheduled one single merge procedure to an RS providing
> 27K+ empty regions of a table (this was a result of a failed copy table job
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the
> procedure framework bailed out after ~40mins. This was happening with each
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,XXXX.',
> STARTKEY => 'XXXX', ENDKEY => 'YYYY'},{*}regionSizeMb=0{*}],
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b,
> NAME => 'TEST.TEST_TABLE,XXXX', STARTKEY => 'XXYY', ENDKEY =>
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356,
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true,
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META,
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLEXXXX,
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)