How do you have clocks set up on your systems Mark? Are you using NTP to keep them sane? Am I correct that they are sometimes running backward?
- Andy ----- Original Message ---- > From: Mark Vigeant <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Wed, December 23, 2009 9:09:04 AM > Subject: RE: Smaller Region Size? > > > The biggest legitimate reason to run smaller region size is if your > > data set is small (lets say 400mb) but highly accessed, so you want a > > good spread of regions across your cluster. > > That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it > was > getting stored as just one region on one regionserver. > > In response to St. Ack, I don't think my regions are performing too many > splits: > the regionserver logs mainly consist of the occasional ZooKeeper Connection > error and these two repeatedly: > > 2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), > Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, > Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, > Evicted/Run=NaN > > 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: > Skipping major compaction of Message because one (major) compacted file only > and > elapsedTime 339624149ms is < ttl=9223372036854775807 > > You're suggesting the performance would be improved if the dataset was > larger? > What are other parameters that can be fine-tuned to optimize based off data > size? > > Thanks > -Mark > -----Original Message----- > From: Ryan Rawson [mailto:[email protected]] > Sent: Tuesday, December 22, 2009 11:28 PM > To: [email protected] > Subject: Re: Smaller Region Size? > > The biggest legitimate reason to run smaller region size is if your > data set is small (lets say 400mb) but highly accessed, so you want a > good spread of regions across your cluster. > > Another is to run a larger region if you are having a huge table and > you want to keep absolute region count low. I am not 100% sold on this > yet. > > I have a patch that can keep performance high during a highly split > table, by using parallel puts. This has been proven to keep aggregate > performance really high, and I hope it will make 0.20.3. > > On Tue, Dec 22, 2009 at 2:31 PM, stack wrote: > > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant > > wrote: > > > >> J-D, > >> > >> I noticed that performance for uploading data into tables got a lot better > >> as I lowered the max file size -- but up until a certain point, where the > >> performance began slowing down again. > >> > >> > > Tell us more. What kinda size changes did you make? How many regions were > > created? Is the slow down because table is splitting all the time? If you > > study regionserver logs, can you make out what the regionservers are > > spending their times doing? > > > > > > > >> Is there a rule of thumb/formula/notion to rely on when setting this > >> parameter for optimal performance? Thanks! > >> > >> > > We have most experience running defaults. Generally folks go up from the > > default size because they want to host more data in about same number or > > regions. Going down from the default I've not seen much of. > > > > St.Ack > > > > This email message and any attachments are for the sole use of the intended > recipients and may contain proprietary and/or confidential information which > may > be privileged or otherwise protected from disclosure. Any unauthorized > review, > use, disclosure or distribution is prohibited. If you are not an intended > recipient, please contact the sender by reply email and destroy the original > message and any copies of the message as well as any attachments to the > original > message.
