Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

rajeshb...@apache.org Sat, 04 May 2024 23:40:05 -0700

Hi Everyone,

Very nice discussion around the issues we may face with multi region
splitting. I have put up the details on how to handle the challenges in the
design doc.
Would you please spare your few cycles to share the feedback on the same.


https://docs.google.com/document/d/1CW_5NAEG4NdU8jrxmTg07DpCByiF1va0Tsrze8kMEYo/edit

Thanks,
Rajeshbabu.


On Fri, Feb 9, 2024 at 11:40 PM Viraj Jasani <vjas...@apache.org> wrote:

> I agree with above points that split taking more time should be equivalent
> to actual writes being blocked ever since parent region is brought offline
> and hence it is an issue with the availability. So, more region splits with
> the same approach does mean that we have longer availability issue.
> Daughter regions are created only after the parent region is brought
> offline.
>
> Besides, all the complications with split procedure going wrong and whether
> we rollback or only roll forward -- depending on whether we have reached
> PONR in the procedure -- would all now be multiplied with the number of
> splits we consider.
>
>
> On Fri, Feb 9, 2024 at 9:05 AM Andrew Purtell <apurt...@apache.org> wrote:
>
> > > Or maybe we could change the way on how we split regions, instead of
> > > creating reference files, we directly read a HFile and output multiple
> > > HFiles in different ranges, put them in different region directories
> >
> > > Actually the 'split HFile by reading and then writing to multiple
> > > places' approach does not increase the offline region time a lot, it
> > > just increases the overall split processing time.
> >
> > Those file writes would need to complete before the daughters are brought
> > online, unless we keep the parent open until all the daughter writes are
> > completed, and then we have the problem of what to do with writes to the
> > parent in the meantime. They would need to be sent on to the appropriate
> > daughter too. If we try to avoid this by making the parent read only
> during
> > the split, that would be equivalent to the parent being offline in our
> > production. Not being ready for writes is equivalent to not available.
> >
> > Otherwise I have to believe that writing real HFiles instead of
> references
> > would take longer than writing references, and so assuming parent is
> > offline while daughters are being created as is the case today, the total
> > time offline during the split would have to be longer. When writing
> > reference files there is the file create time and that's it. When writing
> > real HFiles there is the file create time and then the time it takes to
> > stream through ranges of however many GB of data exist in the region,
> many
> > read and write IOs.
> >
> >
> > On Fri, Feb 9, 2024 at 7:13 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> > wrote:
> >
> > > Actually the 'split HFile by reading and then writing to multiple
> > > places' approach does not increase the offline region time a lot, it
> > > just increases the overall split processing time.
> > >
> > > Anyway, all these things are all just some initial thoughts. If we
> > > think this is a useful new feature, we could open an issue and also
> > > design doc.
> > >
> > > Thanks.
> > >
> > > Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道：
> > > >
> > > > Agreed, the key is going to be changing the design of references to
> > > support this new.
> > > >
> > > > Splits must continue to minimize offline time. Increasing that time
> at
> > > all would not be acceptable.
> > > >
> > > > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote:
> > > > >
> > > > > As I said above, split is not like merge, just simply changing the
> > > > > Admin API to take a byte[][] does not actually help here.
> > > > >
> > > > > For online splitting a region, our algorithm only supports
> splitting
> > > > > to two sub regions, and then we do compaction to clean up all the
> > > > > reference files and prepare for the next split.
> > > > >
> > > > > I'm not saying this is impossible, but I think the most difficult
> > > > > challenge here is how to design a new reference file algorithm to
> > > > > support referencing a 'range' of a HFile, not only top half or
> bottom
> > > > > half.
> > > > > In this way we can support splitting a region directly to more
> than 2
> > > > > sub regions.
> > > > >
> > > > > Or maybe we could change the way on how we split regions, instead
> of
> > > > > creating reference files, we directly read a HFile and output
> > multiple
> > > > > HFiles in different ranges, put them in different region
> directories,
> > > > > and also make the flush write to multiple places(the current region
> > > > > and all the sub regions), and once everything is fine, we offline
> the
> > > > > old region(also makes it do a final flush), and then online all the
> > > > > sub regions. In this way we have nuch lower overall write
> > > > > amplification, but the problem is it will take a very long time
> when
> > > > > splitting, and also the fail recovery will be more complicated.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道：
> > > > >>
> > > > >> Yep, I forgot about that nuance. I agree we can add a splitRegion
> > > overload
> > > > >> which takes a byte[][] for multiple split points.
> > > > >>
> > > > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <
> apurt...@apache.org
> > >
> > > wrote:
> > > > >>>
> > > > >>> Rushabh already covered this but splitting is not complete until
> > the
> > > region
> > > > >>> can be split again. This is a very important nuance. The daughter
> > > regions
> > > > >>> are online very quickly, as designed, but then background
> > > housekeeping
> > > > >>> (compaction) must copy the data before the daughters become
> > > splittable.
> > > > >>> Depending on compaction pressure, compaction queue depth, and the
> > > settings
> > > > >>> of various tunables, waiting for some split daughters to become
> > > ready to
> > > > >>> split again can take many minutes to hours.
> > > > >>>
> > > > >>> So let's say we have replicas of a table at two sites, site A and
> > > site B.
> > > > >>> The region boundaries of this table in A and B will be different.
> > > Now let's
> > > > >>> also say that table data is stored with a key prefix mapping to
> > every
> > > > >>> unique tenant. When migrating a tenant, data copy will hotspot on
> > the
> > > > >>> region(s) hosting keys with the tenant's prefix. This is fine if
> > > there are
> > > > >>> enough regions to absorb the load. We run into trouble when the
> > > region
> > > > >>> boundaries in the sub-keyspace of interest are quite different
> in B
> > > versus
> > > > >>> A. We get hotspotting and impact to operations until organic
> > > splitting
> > > > >>> eventually mitigates the hotspotting, but this might also require
> > > many
> > > > >>> minutes to hours, with noticeable performance degradation in the
> > > meantime.
> > > > >>> To avoid that degradation we pace the sender but then the copy
> may
> > > take so
> > > > >>> long as to miss SLA for the migration. To make the data movement
> > > performant
> > > > >>> and stay within SLA we want to apply one or more splits or merges
> > so
> > > the
> > > > >>> region boundaries B roughly align to A, avoiding hotspotting.
> This
> > > will
> > > > >>> also make shipping this data by bulk load instead efficient too
> by
> > > > >>> minimizing the amount of HFile splitting necessary to load them
> at
> > > the
> > > > >>> receiver.
> > > > >>>
> > > > >>> So let's say we have some regions that need to be split N ways,
> > > where N is
> > > > >>> order of ~10, by that I mean more than 1 and less than 100, in
> > order
> > > to
> > > > >>> (roughly) align region boundaries. We think this calls for an
> > > enhancement
> > > > >>> to the split request API where the split should produce a
> requested
> > > number
> > > > >>> of daughter-pairs. Today that is always 1 pair. Instead we might
> > > want 2, 5,
> > > > >>> 10, conceivably more. And it would be nice if guideposts for
> > > multi-way
> > > > >>> splitting can be sent over in byte[][].
> > > > >>>
> > > > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <
> > > bbeaudrea...@apache.org
> > > > >>>>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> This is the first time I've heard of a region split taking 4
> > > minutes. For
> > > > >>>> us, it's always on the order of seconds. That's true even for a
> > > large
> > > > >>> 50+gb
> > > > >>>> region. It might be worth looking into why that's so slow for
> you.
> > > > >>>>
> > > > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
> > > > >>>> <rushabh.s...@salesforce.com.invalid> wrote:
> > > > >>>>
> > > > >>>>> Thank you Andrew, Bryan and Duo for your responses.
> > > > >>>>>
> > > > >>>>>> My main thought is that a migration like this should use bulk
> > > > >>> loading,
> > > > >>>>>> But also, I think, that data transfer should be in bulk
> > > > >>>>>
> > > > >>>>> We are working on moving to bulk loading.
> > > > >>>>>
> > > > >>>>>> With Admin.splitRegion, you can specify a split point. You can
> > use
> > > > >>> that
> > > > >>>>> to
> > > > >>>>> iteratively add a bunch of regions wherever you need them in
> the
> > > > >>>> keyspace.
> > > > >>>>> Yes, it's 2 at a time, but it should still be quick enough in
> the
> > > grand
> > > > >>>>> scheme of a large migration.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Trying to do some back of the envelope calculations.
> > > > >>>>> In a production environment, it took around 4 minutes to split
> a
> > > > >>> recently
> > > > >>>>> split region which had 4 store files with a total of 5 GB of
> > data.
> > > > >>>>> Assuming we are migrating 5000 tenants at a time and normally
> we
> > > have
> > > > >>>>> around 10% of the tenants (500 tenants) which have data
> > > > >>>>> spread across more than 1000 regions. We have around 10 huge
> > tables
> > > > >>>> where
> > > > >>>>> we store the tenant's data for different use cases.
> > > > >>>>> All the above numbers are on the *conservative* side.
> > > > >>>>>
> > > > >>>>> To create a split structure for 1000 regions, we need 10
> > > iterations of
> > > > >>>> the
> > > > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting
> the
> > > > >>>> regions.
> > > > >>>>> Each split takes around 4 minutes. So to create 1000 regions
> just
> > > for 1
> > > > >>>>> tenant and for 1 table, it takes around 40 minutes.
> > > > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes.
> > > > >>>>>
> > > > >>>>> For 500 tenants, this will take around *140 days*. To reduce
> this
> > > time
> > > > >>>>> further, we can also create a split structure for each tenant
> and
> > > each
> > > > >>>>> table in parallel.
> > > > >>>>> But this would put a lot of pressure on the cluster and also it
> > > will
> > > > >>>>> require a lot of operational overhead and still we will end up
> > with
> > > > >>>>> the whole process taking days, if not months.
> > > > >>>>>
> > > > >>>>> Since we are moving our infrastructure to Public Cloud, we
> > > anticipate
> > > > >>>> this
> > > > >>>>> huge migration happening once every month.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> Adding a splitRegion method that takes byte[][] for multiple
> > split
> > > > >>>> points
> > > > >>>>> would be a nice UX improvement, but not
> > > > >>>>> strictly necessary.
> > > > >>>>>
> > > > >>>>> IMHO for all the reasons stated above, I believe this is
> > necessary.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <
> > > palomino...@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> As it is called 'pre' split, it means that it can only happen
> > when
> > > > >>>>>> there is no data in table.
> > > > >>>>>>
> > > > >>>>>> If there are already data in the table, you can not always
> > create
> > > > >>>>>> 'empty' regions, as you do not know whether there are already
> > > data in
> > > > >>>>>> the given range...
> > > > >>>>>>
> > > > >>>>>> And technically, if you want to split a HFile into more than 2
> > > parts,
> > > > >>>>>> you need to design new algorithm as now in HBase we only
> support
> > > top
> > > > >>>>>> reference and bottom reference...
> > > > >>>>>>
> > > > >>>>>> Thanks.
> > > > >>>>>>
> > > > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六
> > > 02:16写道：
> > > > >>>>>>>
> > > > >>>>>>> My main thought is that a migration like this should use bulk
> > > > >>>> loading,
> > > > >>>>>>> which should be relatively easy given you already use MR
> > > > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting
> > > > >>> problem.
> > > > >>>>> With
> > > > >>>>>>> Admin.splitRegion, you can specify a split point. You can use
> > > that
> > > > >>> to
> > > > >>>>>>> iteratively add a bunch of regions wherever you need them in
> > the
> > > > >>>>>> keyspace.
> > > > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in
> > the
> > > > >>>> grand
> > > > >>>>>>> scheme of a large migration. Adding a splitRegion method that
> > > takes
> > > > >>>>>>> byte[][] for multiple split points would be a nice UX
> > > improvement,
> > > > >>>> but
> > > > >>>>>> not
> > > > >>>>>>> strictly necessary.
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
> > > > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi Everyone,
> > > > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer
> > > > >>>>> workloads.
> > > > >>>>>> Most
> > > > >>>>>>>> of our phoenix tables are multi-tenant and we store the
> > tenantID
> > > > >>> as
> > > > >>>>> the
> > > > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1
> > hbase
> > > > >>>>>> cluster.
> > > > >>>>>>>> Due to capacity planning, hardware refresh cycles and most
> > > > >>> recently
> > > > >>>>>> move to
> > > > >>>>>>>> public cloud initiatives, we have to migrate a tenant from
> one
> > > > >>>> hbase
> > > > >>>>>>>> cluster (source cluster) to another hbase cluster (target
> > > > >>> cluster).
> > > > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands)
> at
> > a
> > > > >>>> time
> > > > >>>>>> and
> > > > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from
> > > > >>> multiple
> > > > >>>>>> source
> > > > >>>>>>>> clusters to a single target cluster. We have our internal
> tool
> > > > >>>> which
> > > > >>>>>> uses
> > > > >>>>>>>> MapReduce framework to copy the data. Since all of these
> > tenants
> > > > >>>>> don’t
> > > > >>>>>> have
> > > > >>>>>>>> any presence on the target cluster (Note that the table is
> NOT
> > > > >>>> empty
> > > > >>>>>> since
> > > > >>>>>>>> we have data for other tenants in the target cluster), they
> > > start
> > > > >>>>> with
> > > > >>>>>> one
> > > > >>>>>>>> region and due to an organic split process, the data gets
> > > > >>>> distributed
> > > > >>>>>> among
> > > > >>>>>>>> different regions and different regionservers. But the
> organic
> > > > >>>>>> splitting
> > > > >>>>>>>> process takes a lot of time and due to the distributed
> nature
> > of
> > > > >>>> the
> > > > >>>>> MR
> > > > >>>>>>>> framework, it causes hotspotting issues on the target
> cluster
> > > > >>> which
> > > > >>>>>> often
> > > > >>>>>>>> lasts for days. This causes availability issues where the
> CPU
> > is
> > > > >>>>>> saturated
> > > > >>>>>>>> and/or disk saturation on the regionservers ingesting the
> > data.
> > > > >>>> Also
> > > > >>>>>> this
> > > > >>>>>>>> causes a lot of replication related alerts (Age of last
> ship,
> > > > >>>>> LogQueue
> > > > >>>>>>>> size) which goes on for days.
> > > > >>>>>>>>
> > > > >>>>>>>> In order to handle the huge influx of data, we should
> ideally
> > > > >>>>>> pre-split the
> > > > >>>>>>>> table on the target based on the split structure present on
> > the
> > > > >>>>> source
> > > > >>>>>>>> cluster. If we pre-split and create empty regions with right
> > > > >>> region
> > > > >>>>>>>> boundaries it will help to distribute the load to different
> > > > >>> regions
> > > > >>>>> and
> > > > >>>>>>>> region servers and will prevent hotspotting.
> > > > >>>>>>>>
> > > > >>>>>>>> Problems with the above approach:
> > > > >>>>>>>> 1. Currently we allow pre splitting only while creating a
> new
> > > > >>>> table.
> > > > >>>>>> But in
> > > > >>>>>>>> our production env, we already have the table created for
> > other
> > > > >>>>>> tenants. So
> > > > >>>>>>>> we would like to pre-split an existing table for new
> tenants.
> > > > >>>>>>>> 2. Currently we split a given region into just 2 daughter
> > > > >>> regions.
> > > > >>>>> But
> > > > >>>>>> if
> > > > >>>>>>>> we have the split points information from the source cluster
> > and
> > > > >>> if
> > > > >>>>> the
> > > > >>>>>>>> data for the to-be-migrated tenant is split across 100
> regions
> > > on
> > > > >>>> the
> > > > >>>>>>>> source side, we would ideally like to create 100 empty
> regions
> > > on
> > > > >>>> the
> > > > >>>>>>>> target cluster.
> > > > >>>>>>>>
> > > > >>>>>>>> Trying to get early feedback from the community. Do you all
> > > think
> > > > >>>>> this
> > > > >>>>>> is a
> > > > >>>>>>>> good idea? Open to other suggestions also.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Thank you,
> > > > >>>>>>>> Rushabh.
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Best regards,
> > > > >>> Andrew
> > > > >>>
> > > > >>> Unrest, ignorance distilled, nihilistic imbeciles -
> > > > >>>    It's what we’ve earned
> > > > >>> Welcome, apocalypse, what’s taken you so long?
> > > > >>> Bring us the fitting end that we’ve been counting on
> > > > >>>   - A23, Welcome, Apocalypse
> > > > >>>
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Unrest, ignorance distilled, nihilistic imbeciles -
> >     It's what we’ve earned
> > Welcome, apocalypse, what’s taken you so long?
> > Bring us the fitting end that we’ve been counting on
> >    - A23, Welcome, Apocalypse
> >
>

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to