Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Duo Zhang Fri, 09 Feb 2024 07:13:42 -0800

Actually the 'split HFile by reading and then writing to multiple
places' approach does not increase the offline region time a lot, it
just increases the overall split processing time.


Anyway, all these things are all just some initial thoughts. If we
think this is a useful new feature, we could open an issue and also
design doc.

Thanks.

Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道：
>
> Agreed, the key is going to be changing the design of references to support 
> this new.
>
> Splits must continue to minimize offline time. Increasing that time at all 
> would not be acceptable.
>
> > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote:
> >
> > As I said above, split is not like merge, just simply changing the
> > Admin API to take a byte[][] does not actually help here.
> >
> > For online splitting a region, our algorithm only supports splitting
> > to two sub regions, and then we do compaction to clean up all the
> > reference files and prepare for the next split.
> >
> > I'm not saying this is impossible, but I think the most difficult
> > challenge here is how to design a new reference file algorithm to
> > support referencing a 'range' of a HFile, not only top half or bottom
> > half.
> > In this way we can support splitting a region directly to more than 2
> > sub regions.
> >
> > Or maybe we could change the way on how we split regions, instead of
> > creating reference files, we directly read a HFile and output multiple
> > HFiles in different ranges, put them in different region directories,
> > and also make the flush write to multiple places(the current region
> > and all the sub regions), and once everything is fine, we offline the
> > old region(also makes it do a final flush), and then online all the
> > sub regions. In this way we have nuch lower overall write
> > amplification, but the problem is it will take a very long time when
> > splitting, and also the fail recovery will be more complicated.
> >
> > Thanks.
> >
> > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道：
> >>
> >> Yep, I forgot about that nuance. I agree we can add a splitRegion overload
> >> which takes a byte[][] for multiple split points.
> >>
> >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org> wrote:
> >>>
> >>> Rushabh already covered this but splitting is not complete until the 
> >>> region
> >>> can be split again. This is a very important nuance. The daughter regions
> >>> are online very quickly, as designed, but then background housekeeping
> >>> (compaction) must copy the data before the daughters become splittable.
> >>> Depending on compaction pressure, compaction queue depth, and the settings
> >>> of various tunables, waiting for some split daughters to become ready to
> >>> split again can take many minutes to hours.
> >>>
> >>> So let's say we have replicas of a table at two sites, site A and site B.
> >>> The region boundaries of this table in A and B will be different. Now 
> >>> let's
> >>> also say that table data is stored with a key prefix mapping to every
> >>> unique tenant. When migrating a tenant, data copy will hotspot on the
> >>> region(s) hosting keys with the tenant's prefix. This is fine if there are
> >>> enough regions to absorb the load. We run into trouble when the region
> >>> boundaries in the sub-keyspace of interest are quite different in B versus
> >>> A. We get hotspotting and impact to operations until organic splitting
> >>> eventually mitigates the hotspotting, but this might also require many
> >>> minutes to hours, with noticeable performance degradation in the meantime.
> >>> To avoid that degradation we pace the sender but then the copy may take so
> >>> long as to miss SLA for the migration. To make the data movement 
> >>> performant
> >>> and stay within SLA we want to apply one or more splits or merges so the
> >>> region boundaries B roughly align to A, avoiding hotspotting. This will
> >>> also make shipping this data by bulk load instead efficient too by
> >>> minimizing the amount of HFile splitting necessary to load them at the
> >>> receiver.
> >>>
> >>> So let's say we have some regions that need to be split N ways, where N is
> >>> order of ~10, by that I mean more than 1 and less than 100, in order to
> >>> (roughly) align region boundaries. We think this calls for an enhancement
> >>> to the split request API where the split should produce a requested number
> >>> of daughter-pairs. Today that is always 1 pair. Instead we might want 2, 
> >>> 5,
> >>> 10, conceivably more. And it would be nice if guideposts for multi-way
> >>> splitting can be sent over in byte[][].
> >>>
> >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <bbeaudrea...@apache.org
> >>>>
> >>> wrote:
> >>>
> >>>> This is the first time I've heard of a region split taking 4 minutes. For
> >>>> us, it's always on the order of seconds. That's true even for a large
> >>> 50+gb
> >>>> region. It might be worth looking into why that's so slow for you.
> >>>>
> >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> >>>> <rushabh.s...@salesforce.com.invalid> wrote:
> >>>>
> >>>>> Thank you Andrew, Bryan and Duo for your responses.
> >>>>>
> >>>>>> My main thought is that a migration like this should use bulk
> >>> loading,
> >>>>>> But also, I think, that data transfer should be in bulk
> >>>>>
> >>>>> We are working on moving to bulk loading.
> >>>>>
> >>>>>> With Admin.splitRegion, you can specify a split point. You can use
> >>> that
> >>>>> to
> >>>>> iteratively add a bunch of regions wherever you need them in the
> >>>> keyspace.
> >>>>> Yes, it's 2 at a time, but it should still be quick enough in the grand
> >>>>> scheme of a large migration.
> >>>>>
> >>>>>
> >>>>> Trying to do some back of the envelope calculations.
> >>>>> In a production environment, it took around 4 minutes to split a
> >>> recently
> >>>>> split region which had 4 store files with a total of 5 GB of data.
> >>>>> Assuming we are migrating 5000 tenants at a time and normally we have
> >>>>> around 10% of the tenants (500 tenants) which have data
> >>>>> spread across more than 1000 regions. We have around 10 huge tables
> >>>> where
> >>>>> we store the tenant's data for different use cases.
> >>>>> All the above numbers are on the *conservative* side.
> >>>>>
> >>>>> To create a split structure for 1000 regions, we need 10 iterations of
> >>>> the
> >>>>> splits (2^10 = 1024). This assumes we are parallely splitting the
> >>>> regions.
> >>>>> Each split takes around 4 minutes. So to create 1000 regions just for 1
> >>>>> tenant and for 1 table, it takes around 40 minutes.
> >>>>> For 10 tables for 1 tenant, it takes around 400 minutes.
> >>>>>
> >>>>> For 500 tenants, this will take around *140 days*. To reduce this time
> >>>>> further, we can also create a split structure for each tenant and each
> >>>>> table in parallel.
> >>>>> But this would put a lot of pressure on the cluster and also it will
> >>>>> require a lot of operational overhead and still we will end up with
> >>>>> the whole process taking days, if not months.
> >>>>>
> >>>>> Since we are moving our infrastructure to Public Cloud, we anticipate
> >>>> this
> >>>>> huge migration happening once every month.
> >>>>>
> >>>>>
> >>>>>> Adding a splitRegion method that takes byte[][] for multiple split
> >>>> points
> >>>>> would be a nice UX improvement, but not
> >>>>> strictly necessary.
> >>>>>
> >>>>> IMHO for all the reasons stated above, I believe this is necessary.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> As it is called 'pre' split, it means that it can only happen when
> >>>>>> there is no data in table.
> >>>>>>
> >>>>>> If there are already data in the table, you can not always create
> >>>>>> 'empty' regions, as you do not know whether there are already data in
> >>>>>> the given range...
> >>>>>>
> >>>>>> And technically, if you want to split a HFile into more than 2 parts,
> >>>>>> you need to design new algorithm as now in HBase we only support top
> >>>>>> reference and bottom reference...
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道：
> >>>>>>>
> >>>>>>> My main thought is that a migration like this should use bulk
> >>>> loading,
> >>>>>>> which should be relatively easy given you already use MR
> >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting
> >>> problem.
> >>>>> With
> >>>>>>> Admin.splitRegion, you can specify a split point. You can use that
> >>> to
> >>>>>>> iteratively add a bunch of regions wherever you need them in the
> >>>>>> keyspace.
> >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in the
> >>>> grand
> >>>>>>> scheme of a large migration. Adding a splitRegion method that takes
> >>>>>>> byte[][] for multiple split points would be a nice UX improvement,
> >>>> but
> >>>>>> not
> >>>>>>> strictly necessary.
> >>>>>>>
> >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote:
> >>>>>>>
> >>>>>>>> Hi Everyone,
> >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer
> >>>>> workloads.
> >>>>>> Most
> >>>>>>>> of our phoenix tables are multi-tenant and we store the tenantID
> >>> as
> >>>>> the
> >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 hbase
> >>>>>> cluster.
> >>>>>>>> Due to capacity planning, hardware refresh cycles and most
> >>> recently
> >>>>>> move to
> >>>>>>>> public cloud initiatives, we have to migrate a tenant from one
> >>>> hbase
> >>>>>>>> cluster (source cluster) to another hbase cluster (target
> >>> cluster).
> >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at a
> >>>> time
> >>>>>> and
> >>>>>>>> hence we have to copy a huge amount of data (in TBs) from
> >>> multiple
> >>>>>> source
> >>>>>>>> clusters to a single target cluster. We have our internal tool
> >>>> which
> >>>>>> uses
> >>>>>>>> MapReduce framework to copy the data. Since all of these tenants
> >>>>> don’t
> >>>>>> have
> >>>>>>>> any presence on the target cluster (Note that the table is NOT
> >>>> empty
> >>>>>> since
> >>>>>>>> we have data for other tenants in the target cluster), they start
> >>>>> with
> >>>>>> one
> >>>>>>>> region and due to an organic split process, the data gets
> >>>> distributed
> >>>>>> among
> >>>>>>>> different regions and different regionservers. But the organic
> >>>>>> splitting
> >>>>>>>> process takes a lot of time and due to the distributed nature of
> >>>> the
> >>>>> MR
> >>>>>>>> framework, it causes hotspotting issues on the target cluster
> >>> which
> >>>>>> often
> >>>>>>>> lasts for days. This causes availability issues where the CPU is
> >>>>>> saturated
> >>>>>>>> and/or disk saturation on the regionservers ingesting the data.
> >>>> Also
> >>>>>> this
> >>>>>>>> causes a lot of replication related alerts (Age of last ship,
> >>>>> LogQueue
> >>>>>>>> size) which goes on for days.
> >>>>>>>>
> >>>>>>>> In order to handle the huge influx of data, we should ideally
> >>>>>> pre-split the
> >>>>>>>> table on the target based on the split structure present on the
> >>>>> source
> >>>>>>>> cluster. If we pre-split and create empty regions with right
> >>> region
> >>>>>>>> boundaries it will help to distribute the load to different
> >>> regions
> >>>>> and
> >>>>>>>> region servers and will prevent hotspotting.
> >>>>>>>>
> >>>>>>>> Problems with the above approach:
> >>>>>>>> 1. Currently we allow pre splitting only while creating a new
> >>>> table.
> >>>>>> But in
> >>>>>>>> our production env, we already have the table created for other
> >>>>>> tenants. So
> >>>>>>>> we would like to pre-split an existing table for new tenants.
> >>>>>>>> 2. Currently we split a given region into just 2 daughter
> >>> regions.
> >>>>> But
> >>>>>> if
> >>>>>>>> we have the split points information from the source cluster and
> >>> if
> >>>>> the
> >>>>>>>> data for the to-be-migrated tenant is split across 100 regions on
> >>>> the
> >>>>>>>> source side, we would ideally like to create 100 empty regions on
> >>>> the
> >>>>>>>> target cluster.
> >>>>>>>>
> >>>>>>>> Trying to get early feedback from the community. Do you all think
> >>>>> this
> >>>>>> is a
> >>>>>>>> good idea? Open to other suggestions also.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> Rushabh.
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Andrew
> >>>
> >>> Unrest, ignorance distilled, nihilistic imbeciles -
> >>>    It's what we’ve earned
> >>> Welcome, apocalypse, what’s taken you so long?
> >>> Bring us the fitting end that we’ve been counting on
> >>>   - A23, Welcome, Apocalypse
> >>>

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to