Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Andrew Purtell Thu, 08 Feb 2024 17:23:14 -0800

Rushabh already covered this but splitting is not complete until the region
can be split again. This is a very important nuance. The daughter regions
are online very quickly, as designed, but then background housekeeping
(compaction) must copy the data before the daughters become splittable.
Depending on compaction pressure, compaction queue depth, and the settings
of various tunables, waiting for some split daughters to become ready to
split again can take many minutes to hours.


So let's say we have replicas of a table at two sites, site A and site B.
The region boundaries of this table in A and B will be different. Now let's
also say that table data is stored with a key prefix mapping to every
unique tenant. When migrating a tenant, data copy will hotspot on the
region(s) hosting keys with the tenant's prefix. This is fine if there are
enough regions to absorb the load. We run into trouble when the region
boundaries in the sub-keyspace of interest are quite different in B versus
A. We get hotspotting and impact to operations until organic splitting
eventually mitigates the hotspotting, but this might also require many
minutes to hours, with noticeable performance degradation in the meantime.
To avoid that degradation we pace the sender but then the copy may take so
long as to miss SLA for the migration. To make the data movement performant
and stay within SLA we want to apply one or more splits or merges so the
region boundaries B roughly align to A, avoiding hotspotting. This will
also make shipping this data by bulk load instead efficient too by
minimizing the amount of HFile splitting necessary to load them at the
receiver.

So let's say we have some regions that need to be split N ways, where N is
order of ~10, by that I mean more than 1 and less than 100, in order to
(roughly) align region boundaries. We think this calls for an enhancement
to the split request API where the split should produce a requested number
of daughter-pairs. Today that is always 1 pair. Instead we might want 2, 5,
10, conceivably more. And it would be nice if guideposts for multi-way
splitting can be sent over in byte[][].

On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <bbeaudrea...@apache.org>
wrote:

> This is the first time I've heard of a region split taking 4 minutes. For
> us, it's always on the order of seconds. That's true even for a large 50+gb
> region. It might be worth looking into why that's so slow for you.
>
> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> <rushabh.s...@salesforce.com.invalid> wrote:
>
> > Thank you Andrew, Bryan and Duo for your responses.
> >
> > > My main thought is that a migration like this should use bulk loading,
> > > But also, I think, that data transfer should be in bulk
> >
> > We are working on moving to bulk loading.
> >
> > > With Admin.splitRegion, you can specify a split point. You can use that
> > to
> > iteratively add a bunch of regions wherever you need them in the
> keyspace.
> > Yes, it's 2 at a time, but it should still be quick enough in the grand
> > scheme of a large migration.
> >
> >
> > Trying to do some back of the envelope calculations.
> > In a production environment, it took around 4 minutes to split a recently
> > split region which had 4 store files with a total of 5 GB of data.
> > Assuming we are migrating 5000 tenants at a time and normally we have
> > around 10% of the tenants (500 tenants) which have data
> >  spread across more than 1000 regions. We have around 10 huge tables
> where
> > we store the tenant's data for different use cases.
> > All the above numbers are on the *conservative* side.
> >
> > To create a split structure for 1000 regions, we need 10 iterations of
> the
> > splits (2^10 = 1024). This assumes we are parallely splitting the
> regions.
> > Each split takes around 4 minutes. So to create 1000 regions just for 1
> > tenant and for 1 table, it takes around 40 minutes.
> > For 10 tables for 1 tenant, it takes around 400 minutes.
> >
> > For 500 tenants, this will take around *140 days*. To reduce this time
> > further, we can also create a split structure for each tenant and each
> > table in parallel.
> > But this would put a lot of pressure on the cluster and also it will
> > require a lot of operational overhead and still we will end up with
> >  the whole process taking days, if not months.
> >
> > Since we are moving our infrastructure to Public Cloud, we anticipate
> this
> > huge migration happening once every month.
> >
> >
> > > Adding a splitRegion method that takes byte[][] for multiple split
> points
> > would be a nice UX improvement, but not
> > strictly necessary.
> >
> > IMHO for all the reasons stated above, I believe this is necessary.
> >
> >
> >
> >
> >
> > On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> > wrote:
> >
> > > As it is called 'pre' split, it means that it can only happen when
> > > there is no data in table.
> > >
> > > If there are already data in the table, you can not always create
> > > 'empty' regions, as you do not know whether there are already data in
> > > the given range...
> > >
> > > And technically, if you want to split a HFile into more than 2 parts,
> > > you need to design new algorithm as now in HBase we only support top
> > > reference and bottom reference...
> > >
> > > Thanks.
> > >
> > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道：
> > > >
> > > > My main thought is that a migration like this should use bulk
> loading,
> > > > which should be relatively easy given you already use MR
> > > > (HFileOutputFormat2). It doesn't solve the region-splitting problem.
> > With
> > > > Admin.splitRegion, you can specify a split point. You can use that to
> > > > iteratively add a bunch of regions wherever you need them in the
> > > keyspace.
> > > > Yes, it's 2 at a time, but it should still be quick enough in the
> grand
> > > > scheme of a large migration. Adding a splitRegion method that takes
> > > > byte[][] for multiple split points would be a nice UX improvement,
> but
> > > not
> > > > strictly necessary.
> > > >
> > > > On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> > > > <rushabh.s...@salesforce.com.invalid> wrote:
> > > >
> > > > > Hi Everyone,
> > > > > At my workplace, we use HBase + Phoenix to run our customer
> > workloads.
> > > Most
> > > > > of our phoenix tables are multi-tenant and we store the tenantID as
> > the
> > > > > leading part of the rowkey. Each tenant belongs to only 1 hbase
> > > cluster.
> > > > > Due to capacity planning, hardware refresh cycles and most recently
> > > move to
> > > > > public cloud initiatives, we have to migrate a tenant from one
> hbase
> > > > > cluster (source cluster) to another hbase cluster (target cluster).
> > > > > Normally we migrate a lot of tenants (in 10s of thousands) at a
> time
> > > and
> > > > > hence we have to copy a huge amount of data (in TBs) from multiple
> > > source
> > > > > clusters to a single target cluster. We have our internal tool
> which
> > > uses
> > > > > MapReduce framework to copy the data. Since all of these tenants
> > don’t
> > > have
> > > > > any presence on the target cluster (Note that the table is NOT
> empty
> > > since
> > > > > we have data for other tenants in the target cluster), they start
> > with
> > > one
> > > > > region and due to an organic split process, the data gets
> distributed
> > > among
> > > > > different regions and different regionservers. But the organic
> > > splitting
> > > > > process takes a lot of time and due to the distributed nature of
> the
> > MR
> > > > > framework, it causes hotspotting issues on the target cluster which
> > > often
> > > > > lasts for days. This causes availability issues where the CPU is
> > > saturated
> > > > > and/or disk saturation on the regionservers ingesting the data.
> Also
> > > this
> > > > > causes a lot of replication related alerts (Age of last ship,
> > LogQueue
> > > > > size) which goes on for days.
> > > > >
> > > > > In order to handle the huge influx of data, we should ideally
> > > pre-split the
> > > > > table on the target based on the split structure present on the
> > source
> > > > > cluster. If we pre-split and create empty regions with right region
> > > > > boundaries it will help to distribute the load to different regions
> > and
> > > > > region servers and will prevent hotspotting.
> > > > >
> > > > > Problems with the above approach:
> > > > > 1. Currently we allow pre splitting only while creating a new
> table.
> > > But in
> > > > > our production env, we already have the table created for other
> > > tenants. So
> > > > > we would like to pre-split an existing table for new tenants.
> > > > > 2. Currently we split a given region into just 2 daughter regions.
> > But
> > > if
> > > > > we have the split points information from the source cluster and if
> > the
> > > > > data for the to-be-migrated tenant is split across 100 regions on
> the
> > > > > source side, we would ideally like to create 100 empty regions on
> the
> > > > > target cluster.
> > > > >
> > > > > Trying to get early feedback from the community. Do you all think
> > this
> > > is a
> > > > > good idea? Open to other suggestions also.
> > > > >
> > > > >
> > > > > Thank you,
> > > > > Rushabh.
> > > > >
> > >
> >
>


-- 
Best regards,
Andrew

Unrest, ignorance distilled, nihilistic imbeciles -
    It's what we’ve earned
Welcome, apocalypse, what’s taken you so long?
Bring us the fitting end that we’ve been counting on
   - A23, Welcome, Apocalypse

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to