[
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764315#comment-17764315
]
Viraj Jasani edited comment on HBASE-28068 at 9/12/23 5:49 PM:
---------------------------------------------------------------
{quote}I was thinking of exposing a property like
_*hbase.normalizer.plan_region_limit*_ (limit on every plan)
{quote}
Thanks [~rkrahul324], this sounds great!
Since we know the consequences of unlimited num of region merges that can be
triggered by normalizer, the default value can be kept as low as 10 (or a bit
higher, but not higher than 50) so we will have only 10 regions merged at a
time and in the next run, 10 more regions and so on.
IMO, we don't need to keep the default value as Long.MAX_VALUE. Even though it
takes too many normalizer runs to completely fix ~25k regions with size 0, it's
fine as opposed to the procedure resources getting heavily occupied only by
single normalizer run.
WDYT [~ndimiduk] [~zhangduo] [~apurtell] [~rvaleti]?
was (Author: vjasani):
{quote}I was thinking of exposing a property like
_*hbase.normalizer.plan_region_limit*_ (limit on every plan)
{quote}
sounds good, though given we know the consequences of unlimited num of merges
that can be triggered, the default value can be kept as low as 10 so we will
have only 10 regions merged at a time and in the next run, 10 more regions can
be merged and so on.
IMO, we don't need to keep the default value as Long.MAX_VALUE.
WDYT [~zhangduo] [~apurtell] [~rvaleti]?
> Normalizer should batch merging 0 sized/empty regions
> -----------------------------------------------------
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
> Issue Type: Improvement
> Components: Normalizer
> Affects Versions: 2.5.5
> Reporter: Ravi Kishore Valeti
> Assignee: Rahul Kumar
> Priority: Minor
> Fix For: 2.6.0, 3.0.0
>
>
> In our production environment, while investigating an issue, we observed that
> the Noramlizer had scheduled one single merge procedure to an RS providing
> 27K+ empty regions of a table (this was a result of a failed copy table job
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the
> procedure framework bailed out after ~40mins. This was happening with each
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,XXXX.',
> STARTKEY => 'XXXX', ENDKEY => 'YYYY'},{*}regionSizeMb=0{*}],
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b,
> NAME => 'TEST.TEST_TABLE,XXXX', STARTKEY => 'XXYY', ENDKEY =>
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356,
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true,
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META,
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLEXXXX,
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)