RE: random versus ordered input strategies

Doug Meil Wed, 15 Jun 2011 12:42:31 -0700

Assuming that this approach was taken to create "new" regions instead of 
splitting, since this is assuming monotonically increasing keys you would 
always have one "hot" region that would be receiving all of the new data.  That 
would limit load throughput.

-----Original Message-----
From: Fernando Padilla [mailto:f...@alum.mit.edu] 
Sent: Wednesday, June 15, 2011 3:28 PM
To: user@hbase.apache.org
Subject: Re: random versus ordered input strategies

I'm sorry to butt in, but I have a question about that "split region in half" 
growth strategy.  If it's too off-topic, you can ignore it..

I know that people always complain about using HBase for monotonously 
increasing inputs, and recommend to use a random or hashed key strategy 
to distribute the load across servers.

I've always wondered if you can change the "split region in half" 
strategy to "create new region".  If such a strategy could be useful for 
people that want to have monotonously increasing inputs.  It won't be as 
efficient as full randomly distributed keys, but might be a good enough 
compromise to keep code complexity down.

 From your example:  If you have an almost full region with keys 1-10.  
And I go to add key 11.  Right not it would split in half:  1-5, 6-11;  
And have to incur the overhead of creating new regions and doing lots of 
data copying to create all of the new region data files, etc.  And to 
add further insult once that new region is close to full, we have 1-5, 
6-16; once 17 is added, then it will go through a region split once 
again: 1-5, 6-10, 11-16, leaving half filled regions.

What if instead, you had a nearly full region, keys 1-10.  Then add key 
11.  And because it's in a "monotonously increasing inputs" mode, it 
would leave region 1-10 alone, and simply create a brand new region, 
11-11 (hopefully in a different region server).

I know that someone must have thought about this already, and must have 
good reasons to not follow it, but I've never seen this discussed in the 
architecture docs, nor in the email list.. so just wondering... :)

On 6/15/11 10:58 AM, Chris Tarnas wrote:
> There are a few ways:
>
> 1) Dynamically as data added. You start with one region and all data goes 
> there. When a region grows to big, it gets split in half. So if a region had 
> keys 1-10 we now have 1-5 and 5-10.

RE: random versus ordered input strategies

Reply via email to