Re: Smaller Region Size?

Andrew Purtell Wed, 23 Dec 2009 09:48:00 -0800

How do you have clocks set up on your systems Mark? Are you using NTP to keep
them sane? Am I correct that they are sometimes running backward?



   - Andy



----- Original Message ----
> From: Mark Vigeant <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Wed, December 23, 2009 9:09:04 AM
> Subject: RE: Smaller Region Size?
> 
> > The biggest legitimate reason to run smaller region size is if your
> > data set is small (lets say 400mb) but highly accessed, so you want a
> > good spread of regions across your cluster.
> 
> That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it 
> was 
> getting stored as just one region on one regionserver.
> 
> In response to St. Ack, I don't think my regions are performing too many 
> splits: 
> the regionserver logs mainly consist of the occasional ZooKeeper Connection 
> error and these two repeatedly:
> 
> 2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
> Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), 
> Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, 
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, 
> Evicted/Run=NaN
> 
> 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
> Skipping major compaction of Message because one (major) compacted file only 
> and 
> elapsedTime 339624149ms is < ttl=9223372036854775807
> 
> You're suggesting the performance would be improved if the dataset was 
> larger? 
> What are other parameters that can be fine-tuned to optimize based off data 
> size?
> 
> Thanks
> -Mark
> -----Original Message-----
> From: Ryan Rawson [mailto:[email protected]]
> Sent: Tuesday, December 22, 2009 11:28 PM
> To: [email protected]
> Subject: Re: Smaller Region Size?
> 
> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.
> 
> Another is to run a larger region if you are having a huge table and
> you want to keep absolute region count low. I am not 100% sold on this
> yet.
> 
> I have a patch that can keep performance high during a highly split
> table, by using parallel puts. This has been proven to keep aggregate
> performance really high, and I hope it will make 0.20.3.
> 
> On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
> > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> > wrote:
> >
> >> J-D,
> >>
> >> I noticed that performance for uploading data into tables got a lot better
> >> as I lowered the max file size -- but up until a certain point, where the
> >> performance began slowing down again.
> >>
> >>
> > Tell us more.  What kinda size changes did you make?  How many regions were
> > created?  Is the slow down because table is splitting all the time?  If you
> > study regionserver logs, can you make out what the regionservers are
> > spending their times doing?
> >
> >
> >
> >> Is there a rule of thumb/formula/notion to rely on when setting this
> >> parameter for optimal performance? Thanks!
> >>
> >>
> > We have most experience running defaults.  Generally folks go up from the
> > default size because they want to host more data in about same number or
> > regions.  Going down from the default I've not seen much of.
> >
> > St.Ack
> >
> 
> This email message and any attachments are for the sole use of the intended 
> recipients and may contain proprietary and/or confidential information which 
> may 
> be privileged or otherwise protected from disclosure. Any unauthorized 
> review, 
> use, disclosure or distribution is prohibited. If you are not an intended 
> recipient, please contact the sender by reply email and destroy the original 
> message and any copies of the message as well as any attachments to the 
> original 
> message.

Re: Smaller Region Size?

Reply via email to