Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Rushabh Shah Fri, 09 Feb 2024 08:36:27 -0800

Thank you Andrew, Duo and Bryan for the helpful discussion. I will create a
jira soon.



On Fri, Feb 9, 2024 at 7:13 AM 张铎(Duo Zhang) <palomino...@gmail.com> wrote:

> Actually the 'split HFile by reading and then writing to multiple
> places' approach does not increase the offline region time a lot, it
> just increases the overall split processing time.
>
> Anyway, all these things are all just some initial thoughts. If we
> think this is a useful new feature, we could open an issue and also
> design doc.
>
> Thanks.
>
> Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道：
> >
> > Agreed, the key is going to be changing the design of references to
> support this new.
> >
> > Splits must continue to minimize offline time. Increasing that time at
> all would not be acceptable.
> >
> > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote:
> > >
> > > As I said above, split is not like merge, just simply changing the
> > > Admin API to take a byte[][] does not actually help here.
> > >
> > > For online splitting a region, our algorithm only supports splitting
> > > to two sub regions, and then we do compaction to clean up all the
> > > reference files and prepare for the next split.
> > >
> > > I'm not saying this is impossible, but I think the most difficult
> > > challenge here is how to design a new reference file algorithm to
> > > support referencing a 'range' of a HFile, not only top half or bottom
> > > half.
> > > In this way we can support splitting a region directly to more than 2
> > > sub regions.
> > >
> > > Or maybe we could change the way on how we split regions, instead of
> > > creating reference files, we directly read a HFile and output multiple
> > > HFiles in different ranges, put them in different region directories,
> > > and also make the flush write to multiple places(the current region
> > > and all the sub regions), and once everything is fine, we offline the
> > > old region(also makes it do a final flush), and then online all the
> > > sub regions. In this way we have nuch lower overall write
> > > amplification, but the problem is it will take a very long time when
> > > splitting, and also the fail recovery will be more complicated.
> > >
> > > Thanks.
> > >
> > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道：
> > >>
> > >> Yep, I forgot about that nuance. I agree we can add a splitRegion
> overload
> > >> which takes a byte[][] for multiple split points.
> > >>
> > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org>
> wrote:
> > >>>
> > >>> Rushabh already covered this but splitting is not complete until the
> region
> > >>> can be split again. This is a very important nuance. The daughter
> regions
> > >>> are online very quickly, as designed, but then background
> housekeeping
> > >>> (compaction) must copy the data before the daughters become
> splittable.
> > >>> Depending on compaction pressure, compaction queue depth, and the
> settings
> > >>> of various tunables, waiting for some split daughters to become
> ready to
> > >>> split again can take many minutes to hours.
> > >>>
> > >>> So let's say we have replicas of a table at two sites, site A and
> site B.
> > >>> The region boundaries of this table in A and B will be different.
> Now let's
> > >>> also say that table data is stored with a key prefix mapping to every
> > >>> unique tenant. When migrating a tenant, data copy will hotspot on the
> > >>> region(s) hosting keys with the tenant's prefix. This is fine if
> there are
> > >>> enough regions to absorb the load. We run into trouble when the
> region
> > >>> boundaries in the sub-keyspace of interest are quite different in B
> versus
> > >>> A. We get hotspotting and impact to operations until organic
> splitting
> > >>> eventually mitigates the hotspotting, but this might also require
> many
> > >>> minutes to hours, with noticeable performance degradation in the
> meantime.
> > >>> To avoid that degradation we pace the sender but then the copy may
> take so
> > >>> long as to miss SLA for the migration. To make the data movement
> performant
> > >>> and stay within SLA we want to apply one or more splits or merges so
> the
> > >>> region boundaries B roughly align to A, avoiding hotspotting. This
> will
> > >>> also make shipping this data by bulk load instead efficient too by
> > >>> minimizing the amount of HFile splitting necessary to load them at
> the
> > >>> receiver.
> > >>>
> > >>> So let's say we have some regions that need to be split N ways,
> where N is
> > >>> order of ~10, by that I mean more than 1 and less than 100, in order
> to
> > >>> (roughly) align region boundaries. We think this calls for an
> enhancement
> > >>> to the split request API where the split should produce a requested
> number
> > >>> of daughter-pairs. Today that is always 1 pair. Instead we might
> want 2, 5,
> > >>> 10, conceivably more. And it would be nice if guideposts for
> multi-way
> > >>> splitting can be sent over in byte[][].
> > >>>
> > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <
> bbeaudrea...@apache.org
> > >>>>
> > >>> wrote:
> > >>>
> > >>>> This is the first time I've heard of a region split taking 4
> minutes. For
> > >>>> us, it's always on the order of seconds. That's true even for a
> large
> > >>> 50+gb
> > >>>> region. It might be worth looking into why that's so slow for you.
> > >>>>
> > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> > >>>> <rushabh.s...@salesforce.com.invalid> wrote:
> > >>>>
> > >>>>> Thank you Andrew, Bryan and Duo for your responses.
> > >>>>>
> > >>>>>> My main thought is that a migration like this should use bulk
> > >>> loading,
> > >>>>>> But also, I think, that data transfer should be in bulk
> > >>>>>
> > >>>>> We are working on moving to bulk loading.
> > >>>>>
> > >>>>>> With Admin.splitRegion, you can specify a split point. You can use
> > >>> that
> > >>>>> to
> > >>>>> iteratively add a bunch of regions wherever you need them in the
> > >>>> keyspace.
> > >>>>> Yes, it's 2 at a time, but it should still be quick enough in the
> grand
> > >>>>> scheme of a large migration.
> > >>>>>
> > >>>>>
> > >>>>> Trying to do some back of the envelope calculations.
> > >>>>> In a production environment, it took around 4 minutes to split a
> > >>> recently
> > >>>>> split region which had 4 store files with a total of 5 GB of data.
> > >>>>> Assuming we are migrating 5000 tenants at a time and normally we
> have
> > >>>>> around 10% of the tenants (500 tenants) which have data
> > >>>>> spread across more than 1000 regions. We have around 10 huge tables
> > >>>> where
> > >>>>> we store the tenant's data for different use cases.
> > >>>>> All the above numbers are on the *conservative* side.
> > >>>>>
> > >>>>> To create a split structure for 1000 regions, we need 10
> iterations of
> > >>>> the
> > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting the
> > >>>> regions.
> > >>>>> Each split takes around 4 minutes. So to create 1000 regions just
> for 1
> > >>>>> tenant and for 1 table, it takes around 40 minutes.
> > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes.
> > >>>>>
> > >>>>> For 500 tenants, this will take around *140 days*. To reduce this
> time
> > >>>>> further, we can also create a split structure for each tenant and
> each
> > >>>>> table in parallel.
> > >>>>> But this would put a lot of pressure on the cluster and also it
> will
> > >>>>> require a lot of operational overhead and still we will end up with
> > >>>>> the whole process taking days, if not months.
> > >>>>>
> > >>>>> Since we are moving our infrastructure to Public Cloud, we
> anticipate
> > >>>> this
> > >>>>> huge migration happening once every month.
> > >>>>>
> > >>>>>
> > >>>>>> Adding a splitRegion method that takes byte[][] for multiple split
> > >>>> points
> > >>>>> would be a nice UX improvement, but not
> > >>>>> strictly necessary.
> > >>>>>
> > >>>>> IMHO for all the reasons stated above, I believe this is necessary.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <
> palomino...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> As it is called 'pre' split, it means that it can only happen when
> > >>>>>> there is no data in table.
> > >>>>>>
> > >>>>>> If there are already data in the table, you can not always create
> > >>>>>> 'empty' regions, as you do not know whether there are already
> data in
> > >>>>>> the given range...
> > >>>>>>
> > >>>>>> And technically, if you want to split a HFile into more than 2
> parts,
> > >>>>>> you need to design new algorithm as now in HBase we only support
> top
> > >>>>>> reference and bottom reference...
> > >>>>>>
> > >>>>>> Thanks.
> > >>>>>>
> > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六
> 02:16写道：
> > >>>>>>>
> > >>>>>>> My main thought is that a migration like this should use bulk
> > >>>> loading,
> > >>>>>>> which should be relatively easy given you already use MR
> > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting
> > >>> problem.
> > >>>>> With
> > >>>>>>> Admin.splitRegion, you can specify a split point. You can use
> that
> > >>> to
> > >>>>>>> iteratively add a bunch of regions wherever you need them in the
> > >>>>>> keyspace.
> > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in the
> > >>>> grand
> > >>>>>>> scheme of a large migration. Adding a splitRegion method that
> takes
> > >>>>>>> byte[][] for multiple split points would be a nice UX
> improvement,
> > >>>> but
> > >>>>>> not
> > >>>>>>> strictly necessary.
> > >>>>>>>
> > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Everyone,
> > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer
> > >>>>> workloads.
> > >>>>>> Most
> > >>>>>>>> of our phoenix tables are multi-tenant and we store the tenantID
> > >>> as
> > >>>>> the
> > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 hbase
> > >>>>>> cluster.
> > >>>>>>>> Due to capacity planning, hardware refresh cycles and most
> > >>> recently
> > >>>>>> move to
> > >>>>>>>> public cloud initiatives, we have to migrate a tenant from one
> > >>>> hbase
> > >>>>>>>> cluster (source cluster) to another hbase cluster (target
> > >>> cluster).
> > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at a
> > >>>> time
> > >>>>>> and
> > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from
> > >>> multiple
> > >>>>>> source
> > >>>>>>>> clusters to a single target cluster. We have our internal tool
> > >>>> which
> > >>>>>> uses
> > >>>>>>>> MapReduce framework to copy the data. Since all of these tenants
> > >>>>> don’t
> > >>>>>> have
> > >>>>>>>> any presence on the target cluster (Note that the table is NOT
> > >>>> empty
> > >>>>>> since
> > >>>>>>>> we have data for other tenants in the target cluster), they
> start
> > >>>>> with
> > >>>>>> one
> > >>>>>>>> region and due to an organic split process, the data gets
> > >>>> distributed
> > >>>>>> among
> > >>>>>>>> different regions and different regionservers. But the organic
> > >>>>>> splitting
> > >>>>>>>> process takes a lot of time and due to the distributed nature of
> > >>>> the
> > >>>>> MR
> > >>>>>>>> framework, it causes hotspotting issues on the target cluster
> > >>> which
> > >>>>>> often
> > >>>>>>>> lasts for days. This causes availability issues where the CPU is
> > >>>>>> saturated
> > >>>>>>>> and/or disk saturation on the regionservers ingesting the data.
> > >>>> Also
> > >>>>>> this
> > >>>>>>>> causes a lot of replication related alerts (Age of last ship,
> > >>>>> LogQueue
> > >>>>>>>> size) which goes on for days.
> > >>>>>>>>
> > >>>>>>>> In order to handle the huge influx of data, we should ideally
> > >>>>>> pre-split the
> > >>>>>>>> table on the target based on the split structure present on the
> > >>>>> source
> > >>>>>>>> cluster. If we pre-split and create empty regions with right
> > >>> region
> > >>>>>>>> boundaries it will help to distribute the load to different
> > >>> regions
> > >>>>> and
> > >>>>>>>> region servers and will prevent hotspotting.
> > >>>>>>>>
> > >>>>>>>> Problems with the above approach:
> > >>>>>>>> 1. Currently we allow pre splitting only while creating a new
> > >>>> table.
> > >>>>>> But in
> > >>>>>>>> our production env, we already have the table created for other
> > >>>>>> tenants. So
> > >>>>>>>> we would like to pre-split an existing table for new tenants.
> > >>>>>>>> 2. Currently we split a given region into just 2 daughter
> > >>> regions.
> > >>>>> But
> > >>>>>> if
> > >>>>>>>> we have the split points information from the source cluster and
> > >>> if
> > >>>>> the
> > >>>>>>>> data for the to-be-migrated tenant is split across 100 regions
> on
> > >>>> the
> > >>>>>>>> source side, we would ideally like to create 100 empty regions
> on
> > >>>> the
> > >>>>>>>> target cluster.
> > >>>>>>>>
> > >>>>>>>> Trying to get early feedback from the community. Do you all
> think
> > >>>>> this
> > >>>>>> is a
> > >>>>>>>> good idea? Open to other suggestions also.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thank you,
> > >>>>>>>> Rushabh.
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best regards,
> > >>> Andrew
> > >>>
> > >>> Unrest, ignorance distilled, nihilistic imbeciles -
> > >>>    It's what we’ve earned
> > >>> Welcome, apocalypse, what’s taken you so long?
> > >>> Bring us the fitting end that we’ve been counting on
> > >>>   - A23, Welcome, Apocalypse
> > >>>
>

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to