Yep, I forgot about that nuance. I agree we can add a splitRegion overload
which takes a byte[][] for multiple split points.

On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org> wrote:

> Rushabh already covered this but splitting is not complete until the region
> can be split again. This is a very important nuance. The daughter regions
> are online very quickly, as designed, but then background housekeeping
> (compaction) must copy the data before the daughters become splittable.
> Depending on compaction pressure, compaction queue depth, and the settings
> of various tunables, waiting for some split daughters to become ready to
> split again can take many minutes to hours.
>
> So let's say we have replicas of a table at two sites, site A and site B.
> The region boundaries of this table in A and B will be different. Now let's
> also say that table data is stored with a key prefix mapping to every
> unique tenant. When migrating a tenant, data copy will hotspot on the
> region(s) hosting keys with the tenant's prefix. This is fine if there are
> enough regions to absorb the load. We run into trouble when the region
> boundaries in the sub-keyspace of interest are quite different in B versus
> A. We get hotspotting and impact to operations until organic splitting
> eventually mitigates the hotspotting, but this might also require many
> minutes to hours, with noticeable performance degradation in the meantime.
> To avoid that degradation we pace the sender but then the copy may take so
> long as to miss SLA for the migration. To make the data movement performant
> and stay within SLA we want to apply one or more splits or merges so the
> region boundaries B roughly align to A, avoiding hotspotting. This will
> also make shipping this data by bulk load instead efficient too by
> minimizing the amount of HFile splitting necessary to load them at the
> receiver.
>
> So let's say we have some regions that need to be split N ways, where N is
> order of ~10, by that I mean more than 1 and less than 100, in order to
> (roughly) align region boundaries. We think this calls for an enhancement
> to the split request API where the split should produce a requested number
> of daughter-pairs. Today that is always 1 pair. Instead we might want 2, 5,
> 10, conceivably more. And it would be nice if guideposts for multi-way
> splitting can be sent over in byte[][].
>
> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <bbeaudrea...@apache.org
> >
> wrote:
>
> > This is the first time I've heard of a region split taking 4 minutes. For
> > us, it's always on the order of seconds. That's true even for a large
> 50+gb
> > region. It might be worth looking into why that's so slow for you.
> >
> > On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> > <rushabh.s...@salesforce.com.invalid> wrote:
> >
> > > Thank you Andrew, Bryan and Duo for your responses.
> > >
> > > > My main thought is that a migration like this should use bulk
> loading,
> > > > But also, I think, that data transfer should be in bulk
> > >
> > > We are working on moving to bulk loading.
> > >
> > > > With Admin.splitRegion, you can specify a split point. You can use
> that
> > > to
> > > iteratively add a bunch of regions wherever you need them in the
> > keyspace.
> > > Yes, it's 2 at a time, but it should still be quick enough in the grand
> > > scheme of a large migration.
> > >
> > >
> > > Trying to do some back of the envelope calculations.
> > > In a production environment, it took around 4 minutes to split a
> recently
> > > split region which had 4 store files with a total of 5 GB of data.
> > > Assuming we are migrating 5000 tenants at a time and normally we have
> > > around 10% of the tenants (500 tenants) which have data
> > >  spread across more than 1000 regions. We have around 10 huge tables
> > where
> > > we store the tenant's data for different use cases.
> > > All the above numbers are on the *conservative* side.
> > >
> > > To create a split structure for 1000 regions, we need 10 iterations of
> > the
> > > splits (2^10 = 1024). This assumes we are parallely splitting the
> > regions.
> > > Each split takes around 4 minutes. So to create 1000 regions just for 1
> > > tenant and for 1 table, it takes around 40 minutes.
> > > For 10 tables for 1 tenant, it takes around 400 minutes.
> > >
> > > For 500 tenants, this will take around *140 days*. To reduce this time
> > > further, we can also create a split structure for each tenant and each
> > > table in parallel.
> > > But this would put a lot of pressure on the cluster and also it will
> > > require a lot of operational overhead and still we will end up with
> > >  the whole process taking days, if not months.
> > >
> > > Since we are moving our infrastructure to Public Cloud, we anticipate
> > this
> > > huge migration happening once every month.
> > >
> > >
> > > > Adding a splitRegion method that takes byte[][] for multiple split
> > points
> > > would be a nice UX improvement, but not
> > > strictly necessary.
> > >
> > > IMHO for all the reasons stated above, I believe this is necessary.
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> > > wrote:
> > >
> > > > As it is called 'pre' split, it means that it can only happen when
> > > > there is no data in table.
> > > >
> > > > If there are already data in the table, you can not always create
> > > > 'empty' regions, as you do not know whether there are already data in
> > > > the given range...
> > > >
> > > > And technically, if you want to split a HFile into more than 2 parts,
> > > > you need to design new algorithm as now in HBase we only support top
> > > > reference and bottom reference...
> > > >
> > > > Thanks.
> > > >
> > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道:
> > > > >
> > > > > My main thought is that a migration like this should use bulk
> > loading,
> > > > > which should be relatively easy given you already use MR
> > > > > (HFileOutputFormat2). It doesn't solve the region-splitting
> problem.
> > > With
> > > > > Admin.splitRegion, you can specify a split point. You can use that
> to
> > > > > iteratively add a bunch of regions wherever you need them in the
> > > > keyspace.
> > > > > Yes, it's 2 at a time, but it should still be quick enough in the
> > grand
> > > > > scheme of a large migration. Adding a splitRegion method that takes
> > > > > byte[][] for multiple split points would be a nice UX improvement,
> > but
> > > > not
> > > > > strictly necessary.
> > > > >
> > > > > On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> > > > > <rushabh.s...@salesforce.com.invalid> wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > > At my workplace, we use HBase + Phoenix to run our customer
> > > workloads.
> > > > Most
> > > > > > of our phoenix tables are multi-tenant and we store the tenantID
> as
> > > the
> > > > > > leading part of the rowkey. Each tenant belongs to only 1 hbase
> > > > cluster.
> > > > > > Due to capacity planning, hardware refresh cycles and most
> recently
> > > > move to
> > > > > > public cloud initiatives, we have to migrate a tenant from one
> > hbase
> > > > > > cluster (source cluster) to another hbase cluster (target
> cluster).
> > > > > > Normally we migrate a lot of tenants (in 10s of thousands) at a
> > time
> > > > and
> > > > > > hence we have to copy a huge amount of data (in TBs) from
> multiple
> > > > source
> > > > > > clusters to a single target cluster. We have our internal tool
> > which
> > > > uses
> > > > > > MapReduce framework to copy the data. Since all of these tenants
> > > don’t
> > > > have
> > > > > > any presence on the target cluster (Note that the table is NOT
> > empty
> > > > since
> > > > > > we have data for other tenants in the target cluster), they start
> > > with
> > > > one
> > > > > > region and due to an organic split process, the data gets
> > distributed
> > > > among
> > > > > > different regions and different regionservers. But the organic
> > > > splitting
> > > > > > process takes a lot of time and due to the distributed nature of
> > the
> > > MR
> > > > > > framework, it causes hotspotting issues on the target cluster
> which
> > > > often
> > > > > > lasts for days. This causes availability issues where the CPU is
> > > > saturated
> > > > > > and/or disk saturation on the regionservers ingesting the data.
> > Also
> > > > this
> > > > > > causes a lot of replication related alerts (Age of last ship,
> > > LogQueue
> > > > > > size) which goes on for days.
> > > > > >
> > > > > > In order to handle the huge influx of data, we should ideally
> > > > pre-split the
> > > > > > table on the target based on the split structure present on the
> > > source
> > > > > > cluster. If we pre-split and create empty regions with right
> region
> > > > > > boundaries it will help to distribute the load to different
> regions
> > > and
> > > > > > region servers and will prevent hotspotting.
> > > > > >
> > > > > > Problems with the above approach:
> > > > > > 1. Currently we allow pre splitting only while creating a new
> > table.
> > > > But in
> > > > > > our production env, we already have the table created for other
> > > > tenants. So
> > > > > > we would like to pre-split an existing table for new tenants.
> > > > > > 2. Currently we split a given region into just 2 daughter
> regions.
> > > But
> > > > if
> > > > > > we have the split points information from the source cluster and
> if
> > > the
> > > > > > data for the to-be-migrated tenant is split across 100 regions on
> > the
> > > > > > source side, we would ideally like to create 100 empty regions on
> > the
> > > > > > target cluster.
> > > > > >
> > > > > > Trying to get early feedback from the community. Do you all think
> > > this
> > > > is a
> > > > > > good idea? Open to other suggestions also.
> > > > > >
> > > > > >
> > > > > > Thank you,
> > > > > > Rushabh.
> > > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Unrest, ignorance distilled, nihilistic imbeciles -
>     It's what we’ve earned
> Welcome, apocalypse, what’s taken you so long?
> Bring us the fitting end that we’ve been counting on
>    - A23, Welcome, Apocalypse
>

Reply via email to