Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

Joarder KAMAL Tue, 02 Jul 2013 22:30:59 -0700

Dear Nick,

Thanks a lot for the nice explanation. I just left yet with a small
confusion:


As you told: Your logical partitioning remains the same whether it's being
served by N, 2N, or 3.5N+2 RegionServers.

When a region is splitted into two, I guess the logical partitioning is
changed, right? Could you kindly clarify a bit more.
And my question is without making any change to my initial
row-key-generation function, how new writes will go to the new regions?
I assume it is hard to predict the number of RS initially as well as
creating pre-splitted regions in very large-scale production systems. I am
not worried about the default load-balancing behaviour of HBase. St.Ack and
you've also clearly explained that as well.

For an example: as indicated in the Lars George's book where he used <# of
RS> while finding the prefix, I guess <# of Regions> could be also used (if
regions are pre-splitted)
----------------------------------------------------------------------------------------------------------------------------
*Salting*
You can use a salting prefix to the key that guarantees a spread of all
rows across all region servers. For example:

byte prefix = (byte) (Long.hashCode(timestamp) % <number of region
servers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix), Bytes.toBytes(timestamp);

This formula will generate enough prefix numbers to ensure that rows are
sent to all region servers. Of course, the formula assumes a
*specific*number of servers, and if
you are planning to *grow your cluster* you should set this number to a *
multiple* instead.
----------------------------------------------------------------------------------------------------------------------------

Thanks again ...


Regards,
Joarder Kamal


On 3 July 2013 03:37, Nick Dimiduk <[email protected]> wrote:

> Hi Joarder,
>
> I think you're slightly confused about the impact of using a hashed (or
> sometimes called "salted") prefix for your rowkeys. This strategy for
> rowkey design has an impact on the logical ordering of your data, not
> necessarily the physical distribution of your data. In HBase, these are
> orthogonal concerns. It means that to execute a bucket-agnostic query, the
> client must initiate N scans. However, there's no guarantee that all
> regions starting with the same hash land on the same RegionServer. Region
> assignment is a complex beast; as I understand, it's based on a randomish,
> load-based assignment.
>
> Take a look at your existing table distributed on a size-N cluster. Do all
> regions that fall within the first bucket sit on the same RegionServer?
> Likely not. However, look at the number of regions assigned to each
> RegionServer. This should be close to even. Adding a new RegionServer to
> the cluster will result in some of those regions migrating from the other
> servers to the new one. The impact will be a decrease in the average number
> of regions served per RegionServer.
>
> Your logical partitioning remains the same whether it's being served by N,
> 2N, or 3.5N+2 RegionServers. Your client always needs to execute that
> bucket-agnostic query as N scans, touching each of the N buckets. Precisely
> which RegionServers are touched by any given scan depends entirely on how
> the balancer has distributed load on your cluster.
>
> Thanks,
> Nick
>
> On Thu, Jun 27, 2013 at 5:02 PM, Joarder KAMAL <[email protected]> wrote:
>
> > Thanks St.Ack for mentioning about the load-balancer.
> >
> > But my question was two folded:
> > Case-1. If a new RS is added, then the load-balancer will do it's job
> > considering no new region has been created in the meanwhile. // As you've
> > already answered.
> >
> > Case-2. Whether a new RS is added or not, an existing region is splitted
> > into two, then how the new writes will to the new region? Because, lets
> say
> > initially the hash function was calculated with *N* Regions and now there
> > are *N+1* Regions in the cluster.
> >
> > In that case, do I need to change the Hash function and reshuffle all
> the
> > existing data within the cluster !! Or, HBase has some mechanism to
> handle
> > this?
> >
> >
> > Many thanks again for helping me out...
> >
> >
> > 
> > Regards,
> > Joarder Kamal
> >
> > On 28 June 2013 02:26, Stack <[email protected]> wrote:
> >
> > > On Wed, Jun 26, 2013 at 4:24 PM, Joarder KAMAL <[email protected]>
> > wrote:
> > >
> > > > May be a simple question to answer for the experienced HBase users
> and
> > > > developers:
> > > >
> > > > If I use hash partitioning to evenly distribute write workloads into
> my
> > > > region servers and later add a new region server to scale or split an
> > > > existing region, then do I need to change my hash function and
> > re-shuffle
> > > > all the existing data in between all the region servers (old and
> new)?
> > > Or,
> > > > is there any better solution for this? Any guidance would be very
> much
> > > > helpful.
> > > >
> > >
> > > You do not need to change your hash function.
> > >
> > > When you add a new regionserver, the balancer will move some of the
> > > existing regions to the new host.
> > >
> > > St.Ack
> > >
> >
>

Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

Reply via email to