RE: Smaller Region Size?

Mark Vigeant Wed, 23 Dec 2009 09:09:37 -0800

> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.

That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it 
was getting stored as just one region on one regionserver.

In response to St. Ack, I don't think my regions are performing too many 
splits: the regionserver logs mainly consist of the occasional ZooKeeper 
Connection error and these two repeatedly:

2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), 
Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, 
Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, 
Evicted/Run=NaN

2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
Skipping major compaction of Message because one (major) compacted file only 
and elapsedTime 339624149ms is < ttl=9223372036854775807

You're suggesting the performance would be improved if the dataset was larger? 
What are other parameters that can be fine-tuned to optimize based off data 
size?

Thanks
-Mark
-----Original Message-----
From: Ryan Rawson [mailto:[email protected]]
Sent: Tuesday, December 22, 2009 11:28 PM
To: [email protected]
Subject: Re: Smaller Region Size?

The biggest legitimate reason to run smaller region size is if your
data set is small (lets say 400mb) but highly accessed, so you want a
good spread of regions across your cluster.

Another is to run a larger region if you are having a huge table and
you want to keep absolute region count low. I am not 100% sold on this
yet.

I have a patch that can keep performance high during a highly split
table, by using parallel puts. This has been proven to keep aggregate
performance really high, and I hope it will make 0.20.3.

On Tue, Dec 22, 2009 at 2:31 PM, stack <[email protected]> wrote:
> On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> <[email protected]>wrote:
>
>> J-D,
>>
>> I noticed that performance for uploading data into tables got a lot better
>> as I lowered the max file size -- but up until a certain point, where the
>> performance began slowing down again.
>>
>>
> Tell us more.  What kinda size changes did you make?  How many regions were
> created?  Is the slow down because table is splitting all the time?  If you
> study regionserver logs, can you make out what the regionservers are
> spending their times doing?
>
>
>
>> Is there a rule of thumb/formula/notion to rely on when setting this
>> parameter for optimal performance? Thanks!
>>
>>
> We have most experience running defaults.  Generally folks go up from the
> default size because they want to host more data in about same number or
> regions.  Going down from the default I've not seen much of.
>
> St.Ack
>

This email message and any attachments are for the sole use of the intended 
recipients and may contain proprietary and/or confidential information which 
may be privileged or otherwise protected from disclosure. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not an 
intended recipient, please contact the sender by reply email and destroy the 
original message and any copies of the message as well as any attachments to 
the original message.

RE: Smaller Region Size?

Reply via email to