Matt Corgan summarized the pro and con of having large number of regions
here:

https://issues.apache.org/jira/browse/HBASE-7667?focusedCommentId=13575024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13575024

Cheers

On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <joard...@gmail.com> wrote:

> Hi Kevin,
>
> Thanks a lot for your great answers.
> Regarding Q5. To clarify,
>
> lets say Facebook is using HBase for the integrated messaging/chat/email
> system in a very large-scale setup. And schema design of such system can
> change over the years (even over the months). Workload patterns may also
> change due to different usage characteristics (like the rate of messaging
> may be higher during a protest/specific event in a particular country). So,
> region/table hotspots have been created at random region servers within the
> cluster despite careful schema design and pre-planning.
>
> The facebook team rush to split the hotspotted regions manually and
> redistribute them over a new set of physical machines which are recently
> added to the system to increase scalability in the face of high user
> demand. Now hotspotted region data could be transferred into new physical
> machines gradually to handle the situations. Now if the shard (region) size
> is small enough then data transfer cost over the network could be minimum
> otherwise large volume of data needs to be transferred instantly.
>
> I have found in many places it is discouraged to have a large number of
> regions systems. However, would it be possible to have very large number of
> regions in a system thus minimizing data transfer cost in case hotspotting
> due to workload/design characteristics. Is there any drawbacks or known
> side-effects? I am rethinking other possibilities other pre-planned schema
> and row-key designs.
>
>
> Thanks again.
>
> Regards,
> Joarder Kamal
>
>
> On 11 February 2013 13:32, Kevin O'dell <kevin.od...@cloudera.com> wrote:
>
> > Hi Joarder,
> >
> >   Welcome to the HBase world.  Let me take some time to address your
> > questions the best I can:
> >
> >  1. How often you are facing Region or Table Hotspotting in HBase
> >    production systems? <--- Hotspotting is not something that just
> happens.
> >  This is usually caused by bad key design and writing to one region more
> > than the others.  I would recommend watching some of Lar's YouTube videos
> > on Schema Design in HBase.
> >    2. If a hotspot is created, how quickly it is automatically cleared
> out
> >    (assuming sudden workload change)? <--- It will not be automatically
> > "cleared out" I think you may be misinformed here.  Basically, it is on
> you
> > to watch you table and your write distribution and determine that you
> have
> > a hotspot and take the necessary action.  Usually the only action is to
> > split the region.  If hotspots become a habitual problem you would most
> > likely want to go back and re-evaluate your current key.
> >    3. How often this kind of situation happens - A hotspot is detected
> and
> >    vanished out before taking an action? or hotspots stays longer period
> of
> >    time? <--- Please see above
> >    4. Or if the hotspot is stays, how it is handled (in general) in
> >    production system? <--- Some people have to hotspot on purpose early
> on,
> > because they only write to a subset of regions.  You will have to
> manually
> > watch for hotspots(which is much easier in later releases).
> >    5. How large data transfer cost is minimized or avoid for re-sharding
> >    regions within a cluster in a single data center or within WAN? <---
> Not
> > quite sure what you are saying here, so I will take a best guess at it.
> >  Sharding is handled in HBase by region splitting.  The best way to
> success
> > in HBase is to understand your data and you needs BEFORE you create you
> > table and start writing into HBase.  This way you can presplit your table
> > to handle the incoming data and you won't have to do a massive amounts of
> > splits.  Later you can allow HBase to split your tables manually, or you
> > can set the maxfile size high and manually control the splits or
> sharding.
> >    6. Is hotspoting in HBase cluster is really a issue (big!) nowadays
> for
> >    OLAP workloads and real-time analytics? <--- Just design your schema
> > correctly and this should not be a problem for you.
> >
> > Please let me know if this answers your questions.
> >
> > On Sun, Feb 10, 2013 at 9:17 PM, Joarder KAMAL <joard...@gmail.com>
> wrote:
> >
> > > This is my first email in the group. I am having a more general and
> > > open-ended question but hope to get some reasoning from the HBase user
> > > communities.
> > > I am a very basic HBase user and still learning. My intention to use
> > HBase
> > > in one of our research project. Recently I was looking through Lars
> > > George's book "HBase - The Definitive Guide" and two particular topics
> > > caught my eyes. One is 'Region and Table Hotspotting' and the other is
> > > 'Region Auto-Sharding and Merging'.
> > >
> > > *Scenario: *
> > > If a hotspot is created in a particular region or in a table (having
> > > multiple regions) due to sudden workload change, then one may split the
> > > region into further small pieces and distributed it to a number of
> > > available physical machine in the cluster. This process should require
> > > large data transfer between different machines in the cluster and
> incur a
> > > performance cost. One may also change the 'key' definition and manage
> the
> > > regions. But I am not sure how effective or logical to change key
> designs
> > > on a production system.
> > >
> > > *Questions:*
> > >
> > >    1. How often you are facing Region or Table Hotspotting in HBase
> > >    production systems?
> > >    2. If a hotspot is created, how quickly it is automatically cleared
> > out
> > >    (assuming sudden workload change)?
> > >    3. How often this kind of situation happens - A hotspot is detected
> > and
> > >    vanished out before taking an action? or hotspots stays longer
> period
> > of
> > >    time?
> > >    4. Or if the hotspot is stays, how it is handled (in general) in
> > >    production system?
> > >    5. How large data transfer cost is minimized or avoid for
> re-sharding
> > >    regions within a cluster in a single data center or within WAN?
> > >    6. Is hotspoting in HBase cluster is really a issue (big!) nowadays
> > for
> > >    OLAP workloads and real-time analytics?
> > >
> > >
> > > Further directions to more information about region/table hotspotting
> is
> > > most welcome.
> > >
> > > Many thanks in advance.
> > >
> > > Regards,
> > > Joarder Kamal
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>

Reply via email to