The biggest legitimate reason to run smaller region size is if your data set is small (lets say 400mb) but highly accessed, so you want a good spread of regions across your cluster.
Another is to run a larger region if you are having a huge table and you want to keep absolute region count low. I am not 100% sold on this yet. I have a patch that can keep performance high during a highly split table, by using parallel puts. This has been proven to keep aggregate performance really high, and I hope it will make 0.20.3. On Tue, Dec 22, 2009 at 2:31 PM, stack <[email protected]> wrote: > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant > <[email protected]>wrote: > >> J-D, >> >> I noticed that performance for uploading data into tables got a lot better >> as I lowered the max file size -- but up until a certain point, where the >> performance began slowing down again. >> >> > Tell us more. What kinda size changes did you make? How many regions were > created? Is the slow down because table is splitting all the time? If you > study regionserver logs, can you make out what the regionservers are > spending their times doing? > > > >> Is there a rule of thumb/formula/notion to rely on when setting this >> parameter for optimal performance? Thanks! >> >> > We have most experience running defaults. Generally folks go up from the > default size because they want to host more data in about same number or > regions. Going down from the default I've not seen much of. > > St.Ack >
