> The biggest legitimate reason to run smaller region size is if your > data set is small (lets say 400mb) but highly accessed, so you want a > good spread of regions across your cluster.
That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it was getting stored as just one region on one regionserver. In response to St. Ack, I don't think my regions are performing too many splits: the regionserver logs mainly consist of the occasional ZooKeeper Connection error and these two repeatedly: 2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, Evicted/Run=NaN 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipping major compaction of Message because one (major) compacted file only and elapsedTime 339624149ms is < ttl=9223372036854775807 You're suggesting the performance would be improved if the dataset was larger? What are other parameters that can be fine-tuned to optimize based off data size? Thanks -Mark -----Original Message----- From: Ryan Rawson [mailto:[email protected]] Sent: Tuesday, December 22, 2009 11:28 PM To: [email protected] Subject: Re: Smaller Region Size? The biggest legitimate reason to run smaller region size is if your data set is small (lets say 400mb) but highly accessed, so you want a good spread of regions across your cluster. Another is to run a larger region if you are having a huge table and you want to keep absolute region count low. I am not 100% sold on this yet. I have a patch that can keep performance high during a highly split table, by using parallel puts. This has been proven to keep aggregate performance really high, and I hope it will make 0.20.3. On Tue, Dec 22, 2009 at 2:31 PM, stack <[email protected]> wrote: > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant > <[email protected]>wrote: > >> J-D, >> >> I noticed that performance for uploading data into tables got a lot better >> as I lowered the max file size -- but up until a certain point, where the >> performance began slowing down again. >> >> > Tell us more. What kinda size changes did you make? How many regions were > created? Is the slow down because table is splitting all the time? If you > study regionserver logs, can you make out what the regionservers are > spending their times doing? > > > >> Is there a rule of thumb/formula/notion to rely on when setting this >> parameter for optimal performance? Thanks! >> >> > We have most experience running defaults. Generally folks go up from the > default size because they want to host more data in about same number or > regions. Going down from the default I've not seen much of. > > St.Ack > This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
