Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Rushabh Shah Wed, 07 Feb 2024 10:18:02 -0800

> This is the first time I've heard of a region split taking 4 minutes. For
us, it's always on the order of seconds. That's true even for a large 50+gb
region. It might be worth looking into why that's so slow for you.


For us also, the split takes less than a second. But to split a recently
split region, I have to wait for around 4 minutes for the master to clean
all the parent references.
It throws this exception.

2024-02-06 20:52:33,389 DEBUG
[.default.FPBQ.Fifo.handler=249,queue=15,port=61000]
assignment.SplitTableRegionProcedure - Splittable=false state=OPEN,
location=<RS-name>

2024-02-06 20:52:37,482 DEBUG
[.default.FPBQ.Fifo.handler=249,queue=15,port=61000] ipc.MetricsHBaseServer
- Unknown exception type
org.apache.hadoop.hbase.DoNotRetryIOException:
23d5a72661ce2027cb7388b694dc235a NOT splittable
        at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:231)
        at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:134)
        at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:1031)
        at org.apache.hadoop.hbase.master.HMaster$3.run(HMaster.java:2198)
        at
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:132)
        at
org.apache.hadoop.hbase.master.HMaster.splitRegion(HMaster.java:2191)
        at
org.apache.hadoop.hbase.master.MasterRpcServices.splitRegion(MasterRpcServices.java:860)
        at
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)


On Wed, Feb 7, 2024 at 10:02 AM Bryan Beaudreault <bbeaudrea...@apache.org>
wrote:

> This is the first time I've heard of a region split taking 4 minutes. For
> us, it's always on the order of seconds. That's true even for a large 50+gb
> region. It might be worth looking into why that's so slow for you.
>
> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> <rushabh.s...@salesforce.com.invalid> wrote:
>
> > Thank you Andrew, Bryan and Duo for your responses.
> >
> > > My main thought is that a migration like this should use bulk loading,
> > > But also, I think, that data transfer should be in bulk
> >
> > We are working on moving to bulk loading.
> >
> > > With Admin.splitRegion, you can specify a split point. You can use that
> > to
> > iteratively add a bunch of regions wherever you need them in the
> keyspace.
> > Yes, it's 2 at a time, but it should still be quick enough in the grand
> > scheme of a large migration.
> >
> >
> > Trying to do some back of the envelope calculations.
> > In a production environment, it took around 4 minutes to split a recently
> > split region which had 4 store files with a total of 5 GB of data.
> > Assuming we are migrating 5000 tenants at a time and normally we have
> > around 10% of the tenants (500 tenants) which have data
> >  spread across more than 1000 regions. We have around 10 huge tables
> where
> > we store the tenant's data for different use cases.
> > All the above numbers are on the *conservative* side.
> >
> > To create a split structure for 1000 regions, we need 10 iterations of
> the
> > splits (2^10 = 1024). This assumes we are parallely splitting the
> regions.
> > Each split takes around 4 minutes. So to create 1000 regions just for 1
> > tenant and for 1 table, it takes around 40 minutes.
> > For 10 tables for 1 tenant, it takes around 400 minutes.
> >
> > For 500 tenants, this will take around *140 days*. To reduce this time
> > further, we can also create a split structure for each tenant and each
> > table in parallel.
> > But this would put a lot of pressure on the cluster and also it will
> > require a lot of operational overhead and still we will end up with
> >  the whole process taking days, if not months.
> >
> > Since we are moving our infrastructure to Public Cloud, we anticipate
> this
> > huge migration happening once every month.
> >
> >
> > > Adding a splitRegion method that takes byte[][] for multiple split
> points
> > would be a nice UX improvement, but not
> > strictly necessary.
> >
> > IMHO for all the reasons stated above, I believe this is necessary.
> >
> >
> >
> >
> >
> > On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> > wrote:
> >
> > > As it is called 'pre' split, it means that it can only happen when
> > > there is no data in table.
> > >
> > > If there are already data in the table, you can not always create
> > > 'empty' regions, as you do not know whether there are already data in
> > > the given range...
> > >
> > > And technically, if you want to split a HFile into more than 2 parts,
> > > you need to design new algorithm as now in HBase we only support top
> > > reference and bottom reference...
> > >
> > > Thanks.
> > >
> > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道：
> > > >
> > > > My main thought is that a migration like this should use bulk
> loading,
> > > > which should be relatively easy given you already use MR
> > > > (HFileOutputFormat2). It doesn't solve the region-splitting problem.
> > With
> > > > Admin.splitRegion, you can specify a split point. You can use that to
> > > > iteratively add a bunch of regions wherever you need them in the
> > > keyspace.
> > > > Yes, it's 2 at a time, but it should still be quick enough in the
> grand
> > > > scheme of a large migration. Adding a splitRegion method that takes
> > > > byte[][] for multiple split points would be a nice UX improvement,
> but
> > > not
> > > > strictly necessary.
> > > >
> > > > On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> > > > <rushabh.s...@salesforce.com.invalid> wrote:
> > > >
> > > > > Hi Everyone,
> > > > > At my workplace, we use HBase + Phoenix to run our customer
> > workloads.
> > > Most
> > > > > of our phoenix tables are multi-tenant and we store the tenantID as
> > the
> > > > > leading part of the rowkey. Each tenant belongs to only 1 hbase
> > > cluster.
> > > > > Due to capacity planning, hardware refresh cycles and most recently
> > > move to
> > > > > public cloud initiatives, we have to migrate a tenant from one
> hbase
> > > > > cluster (source cluster) to another hbase cluster (target cluster).
> > > > > Normally we migrate a lot of tenants (in 10s of thousands) at a
> time
> > > and
> > > > > hence we have to copy a huge amount of data (in TBs) from multiple
> > > source
> > > > > clusters to a single target cluster. We have our internal tool
> which
> > > uses
> > > > > MapReduce framework to copy the data. Since all of these tenants
> > don’t
> > > have
> > > > > any presence on the target cluster (Note that the table is NOT
> empty
> > > since
> > > > > we have data for other tenants in the target cluster), they start
> > with
> > > one
> > > > > region and due to an organic split process, the data gets
> distributed
> > > among
> > > > > different regions and different regionservers. But the organic
> > > splitting
> > > > > process takes a lot of time and due to the distributed nature of
> the
> > MR
> > > > > framework, it causes hotspotting issues on the target cluster which
> > > often
> > > > > lasts for days. This causes availability issues where the CPU is
> > > saturated
> > > > > and/or disk saturation on the regionservers ingesting the data.
> Also
> > > this
> > > > > causes a lot of replication related alerts (Age of last ship,
> > LogQueue
> > > > > size) which goes on for days.
> > > > >
> > > > > In order to handle the huge influx of data, we should ideally
> > > pre-split the
> > > > > table on the target based on the split structure present on the
> > source
> > > > > cluster. If we pre-split and create empty regions with right region
> > > > > boundaries it will help to distribute the load to different regions
> > and
> > > > > region servers and will prevent hotspotting.
> > > > >
> > > > > Problems with the above approach:
> > > > > 1. Currently we allow pre splitting only while creating a new
> table.
> > > But in
> > > > > our production env, we already have the table created for other
> > > tenants. So
> > > > > we would like to pre-split an existing table for new tenants.
> > > > > 2. Currently we split a given region into just 2 daughter regions.
> > But
> > > if
> > > > > we have the split points information from the source cluster and if
> > the
> > > > > data for the to-be-migrated tenant is split across 100 regions on
> the
> > > > > source side, we would ideally like to create 100 empty regions on
> the
> > > > > target cluster.
> > > > >
> > > > > Trying to get early feedback from the community. Do you all think
> > this
> > > is a
> > > > > good idea? Open to other suggestions also.
> > > > >
> > > > >
> > > > > Thank you,
> > > > > Rushabh.
> > > > >
> > >
> >
>

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to