[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk reopened HBASE-28068:
----------------------------------

> Add hbase.normalizer.merge.merge_request_max_number_of_regions property to 
> limit max number of regions in a merge request for merge normalization
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28068
>                 URL: https://issues.apache.org/jira/browse/HBASE-28068
>             Project: HBase
>          Issue Type: Improvement
>          Components: Normalizer
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0, 3.0.0-alpha-4, 4.0.0-alpha-1
>            Reporter: Ravi Kishore Valeti
>            Assignee: Rahul Kumar
>            Priority: Minor
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,XXXX.', 
> STARTKEY => 'XXXX', ENDKEY => 'YYYY'},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,XXXX', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLEXXXX, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to