[ 
https://issues.apache.org/jira/browse/HBASE-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182117#comment-13182117
 ] 

stack commented on HBASE-5138:
------------------------------

+1 on nice doc.
                
> [ref manual] Add a discussion on the number of regions
> ------------------------------------------------------
>
>                 Key: HBASE-5138
>                 URL: https://issues.apache.org/jira/browse/HBASE-5138
>             Project: HBase
>          Issue Type: Task
>            Reporter: Jean-Daniel Cryans
>
> ntelford on IRC made the good point that we say people shouldn't have too 
> many regions, but we don't say why. His problem currently is:
> {quote}
> 09:21 < ntelford> problem is, if you're running MR jobs on a subset of that 
> data, you need the regions to be as small as possible otherwise tasks don't 
> get allocated in parallel much
> 09:22 < ntelford> so we've found we have to strike a balance between keeping 
> them small for MR and keeping them large for HBase to behave well
> 09:22 < ntelford> we erred on the side of smaller regions because our MR 
> issues were more immediate - we couldn't find any documentation or anecdotal 
> evidence as to why HBase doesn't like lots of regions
> {quote}
> The three main issues I can think of when having too many regions are:
>  - mslab requires 2mb per memstore (that's 2mb per family per region). 1000 
> regions that have 2 families each is 3.9GB of heap used, and it's not even 
> storing data yet. NB: the 2MB value is configurable.
>  - if you fill all the regions at somewhat the same rate, the global memory 
> usage makes it that it forces tiny flushes when you have too many regions 
> which in turn generates compactions. Rewriting the same data tens of times is 
> the last thing you want. An example is filling 1000 regions (with one family) 
> equally and let's consider a lower bound for global memstore usage of 5GB 
> (the region server would have a big heap). Once it reaches 5GB it will force 
> flush the biggest region, at that point they should almost all have about 5MB 
> of data so it would flush that amount. 5MB inserted later, it would flush 
> another region that will now have a bit over 5MB of data, and so on.
>  - the new master is allergic to tons of regions, and will take a lot of time 
> assigning them and moving them around in batches. The reason is that it's 
> heavy on ZK usage, and it's not very async at the moment (could really be 
> improved).
> Another issue is the effect of the number of regions on mapreduce jobs. 
> Keeping 5 regions per RS would be too low for a job, whereas 1000 will 
> generate too many maps. This comes back to ntelford's problem of needing to 
> scan portions of tables. To solve his problem, we discussed using a custom 
> input format that generates many splits per region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to