>From your first email: bq. including some with many regions in the KB size
Do you know if the above was result of the operation(s) from normalizer ? Since assuming you use standard max hfile size, there shouldn't be such small regions. Cheers On Sat, Apr 21, 2018 at 10:18 AM, Tim Robertson <[email protected]> wrote: > Thanks Ted, > > I should have been explicit - for the cases I've been working with they can > make their apps effectively go "read-only" for this house keeping step. At > the end a change of app config or a couple of table name changes (short > outage) would be needed. > > I've been using the SimpleNormalizer in 1.2.0 (CDH 5.12+) - I'll dig into > the recent changes. I had to run several iterations of small region > merging, plus a few iterations of SimpleNormalization to get a decent > result which took a long time (days). On Normalizer - I had wondered if an > approach of determining a good set of splits up front might be portable > into a Normalizer implementation. > > I suspect a one time rewrite is cheaper than normalization when a table is > in really bad shape. > > Thanks again, > Tim > > > > > On Sat, Apr 21, 2018 at 6:59 PM, Ted Yu <[email protected]> wrote: > > > Looking at proposed flow, have you considered the new data coming in > > between steps #a and #d ? > > > > Also, how would client application switch between the original table and > > the new table ? > > > > BTW since you mentioned SimpleNormalizer, which release are you using > (just > > want to see if all recent fixes to SimpleNormalizer were in the version > you > > use) ? > > > > Cheers > > > > On Sat, Apr 21, 2018 at 9:48 AM, Tim Robertson < > [email protected]> > > wrote: > > > > > Hi folks > > > > > > Recently I've seen a few clusters with badly unbalanced tables, > including > > > some with many regions in the KB size. It seems it is easy to overlook > > this > > > in ops. > > > > > > Understandably SimpleNormalizer does a fairly poor job at addressing > > this - > > > takes a long time, doesn't aggressively merge small regions, eagerly > > splits > > > well sized regions if many small ones exist etc. It works well if > enabled > > > on a well set up table though. > > > > > > I have been exploring approaches to tackle: > > > 1) determining region splits for a one time bulk load into a presplit > > > table[1] and > > > 2) approaches to fixing really badly skewed tables. > > > > > > I was thinking of creating a Jira which I'd assign to myself to add a > > > utility tool that would: > > > > > > a) read the HFiles for a table (optionally performing a MC first to > > > discard old edits) > > > b) analyze the block headers and determine splits that would take you > > > back to regions at e.g. 80% hbase.hregion.max.filesize > > > c) create a new pre-split table > > > d) run a table copy (or bulkload?) > > > > > > Does such a thing exist anywhere and I'm just missing it, or does > anyone > > > know of a better approach please? > > > > > > Thoughts, criticism, requests very welcome. > > > > > > Thanks, > > > Tim > > > > > > [1] > > > https://github.com/opencore/hbase-bulk-load-balanced/blob/ > > > master/src/test/java/com/opencore/hbase/example/ExampleUsageTest.java > > > > > >
