Actually the 'split HFile by reading and then writing to multiple places' approach does not increase the offline region time a lot, it just increases the overall split processing time.
Anyway, all these things are all just some initial thoughts. If we think this is a useful new feature, we could open an issue and also design doc. Thanks. Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道: > > Agreed, the key is going to be changing the design of references to support > this new. > > Splits must continue to minimize offline time. Increasing that time at all > would not be acceptable. > > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote: > > > > As I said above, split is not like merge, just simply changing the > > Admin API to take a byte[][] does not actually help here. > > > > For online splitting a region, our algorithm only supports splitting > > to two sub regions, and then we do compaction to clean up all the > > reference files and prepare for the next split. > > > > I'm not saying this is impossible, but I think the most difficult > > challenge here is how to design a new reference file algorithm to > > support referencing a 'range' of a HFile, not only top half or bottom > > half. > > In this way we can support splitting a region directly to more than 2 > > sub regions. > > > > Or maybe we could change the way on how we split regions, instead of > > creating reference files, we directly read a HFile and output multiple > > HFiles in different ranges, put them in different region directories, > > and also make the flush write to multiple places(the current region > > and all the sub regions), and once everything is fine, we offline the > > old region(also makes it do a final flush), and then online all the > > sub regions. In this way we have nuch lower overall write > > amplification, but the problem is it will take a very long time when > > splitting, and also the fail recovery will be more complicated. > > > > Thanks. > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道: > >> > >> Yep, I forgot about that nuance. I agree we can add a splitRegion overload > >> which takes a byte[][] for multiple split points. > >> > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org> wrote: > >>> > >>> Rushabh already covered this but splitting is not complete until the > >>> region > >>> can be split again. This is a very important nuance. The daughter regions > >>> are online very quickly, as designed, but then background housekeeping > >>> (compaction) must copy the data before the daughters become splittable. > >>> Depending on compaction pressure, compaction queue depth, and the settings > >>> of various tunables, waiting for some split daughters to become ready to > >>> split again can take many minutes to hours. > >>> > >>> So let's say we have replicas of a table at two sites, site A and site B. > >>> The region boundaries of this table in A and B will be different. Now > >>> let's > >>> also say that table data is stored with a key prefix mapping to every > >>> unique tenant. When migrating a tenant, data copy will hotspot on the > >>> region(s) hosting keys with the tenant's prefix. This is fine if there are > >>> enough regions to absorb the load. We run into trouble when the region > >>> boundaries in the sub-keyspace of interest are quite different in B versus > >>> A. We get hotspotting and impact to operations until organic splitting > >>> eventually mitigates the hotspotting, but this might also require many > >>> minutes to hours, with noticeable performance degradation in the meantime. > >>> To avoid that degradation we pace the sender but then the copy may take so > >>> long as to miss SLA for the migration. To make the data movement > >>> performant > >>> and stay within SLA we want to apply one or more splits or merges so the > >>> region boundaries B roughly align to A, avoiding hotspotting. This will > >>> also make shipping this data by bulk load instead efficient too by > >>> minimizing the amount of HFile splitting necessary to load them at the > >>> receiver. > >>> > >>> So let's say we have some regions that need to be split N ways, where N is > >>> order of ~10, by that I mean more than 1 and less than 100, in order to > >>> (roughly) align region boundaries. We think this calls for an enhancement > >>> to the split request API where the split should produce a requested number > >>> of daughter-pairs. Today that is always 1 pair. Instead we might want 2, > >>> 5, > >>> 10, conceivably more. And it would be nice if guideposts for multi-way > >>> splitting can be sent over in byte[][]. > >>> > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <bbeaudrea...@apache.org > >>>> > >>> wrote: > >>> > >>>> This is the first time I've heard of a region split taking 4 minutes. For > >>>> us, it's always on the order of seconds. That's true even for a large > >>> 50+gb > >>>> region. It might be worth looking into why that's so slow for you. > >>>> > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > >>>> <rushabh.s...@salesforce.com.invalid> wrote: > >>>> > >>>>> Thank you Andrew, Bryan and Duo for your responses. > >>>>> > >>>>>> My main thought is that a migration like this should use bulk > >>> loading, > >>>>>> But also, I think, that data transfer should be in bulk > >>>>> > >>>>> We are working on moving to bulk loading. > >>>>> > >>>>>> With Admin.splitRegion, you can specify a split point. You can use > >>> that > >>>>> to > >>>>> iteratively add a bunch of regions wherever you need them in the > >>>> keyspace. > >>>>> Yes, it's 2 at a time, but it should still be quick enough in the grand > >>>>> scheme of a large migration. > >>>>> > >>>>> > >>>>> Trying to do some back of the envelope calculations. > >>>>> In a production environment, it took around 4 minutes to split a > >>> recently > >>>>> split region which had 4 store files with a total of 5 GB of data. > >>>>> Assuming we are migrating 5000 tenants at a time and normally we have > >>>>> around 10% of the tenants (500 tenants) which have data > >>>>> spread across more than 1000 regions. We have around 10 huge tables > >>>> where > >>>>> we store the tenant's data for different use cases. > >>>>> All the above numbers are on the *conservative* side. > >>>>> > >>>>> To create a split structure for 1000 regions, we need 10 iterations of > >>>> the > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting the > >>>> regions. > >>>>> Each split takes around 4 minutes. So to create 1000 regions just for 1 > >>>>> tenant and for 1 table, it takes around 40 minutes. > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes. > >>>>> > >>>>> For 500 tenants, this will take around *140 days*. To reduce this time > >>>>> further, we can also create a split structure for each tenant and each > >>>>> table in parallel. > >>>>> But this would put a lot of pressure on the cluster and also it will > >>>>> require a lot of operational overhead and still we will end up with > >>>>> the whole process taking days, if not months. > >>>>> > >>>>> Since we are moving our infrastructure to Public Cloud, we anticipate > >>>> this > >>>>> huge migration happening once every month. > >>>>> > >>>>> > >>>>>> Adding a splitRegion method that takes byte[][] for multiple split > >>>> points > >>>>> would be a nice UX improvement, but not > >>>>> strictly necessary. > >>>>> > >>>>> IMHO for all the reasons stated above, I believe this is necessary. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> As it is called 'pre' split, it means that it can only happen when > >>>>>> there is no data in table. > >>>>>> > >>>>>> If there are already data in the table, you can not always create > >>>>>> 'empty' regions, as you do not know whether there are already data in > >>>>>> the given range... > >>>>>> > >>>>>> And technically, if you want to split a HFile into more than 2 parts, > >>>>>> you need to design new algorithm as now in HBase we only support top > >>>>>> reference and bottom reference... > >>>>>> > >>>>>> Thanks. > >>>>>> > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道: > >>>>>>> > >>>>>>> My main thought is that a migration like this should use bulk > >>>> loading, > >>>>>>> which should be relatively easy given you already use MR > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting > >>> problem. > >>>>> With > >>>>>>> Admin.splitRegion, you can specify a split point. You can use that > >>> to > >>>>>>> iteratively add a bunch of regions wherever you need them in the > >>>>>> keyspace. > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in the > >>>> grand > >>>>>>> scheme of a large migration. Adding a splitRegion method that takes > >>>>>>> byte[][] for multiple split points would be a nice UX improvement, > >>>> but > >>>>>> not > >>>>>>> strictly necessary. > >>>>>>> > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote: > >>>>>>> > >>>>>>>> Hi Everyone, > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer > >>>>> workloads. > >>>>>> Most > >>>>>>>> of our phoenix tables are multi-tenant and we store the tenantID > >>> as > >>>>> the > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 hbase > >>>>>> cluster. > >>>>>>>> Due to capacity planning, hardware refresh cycles and most > >>> recently > >>>>>> move to > >>>>>>>> public cloud initiatives, we have to migrate a tenant from one > >>>> hbase > >>>>>>>> cluster (source cluster) to another hbase cluster (target > >>> cluster). > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at a > >>>> time > >>>>>> and > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from > >>> multiple > >>>>>> source > >>>>>>>> clusters to a single target cluster. We have our internal tool > >>>> which > >>>>>> uses > >>>>>>>> MapReduce framework to copy the data. Since all of these tenants > >>>>> don’t > >>>>>> have > >>>>>>>> any presence on the target cluster (Note that the table is NOT > >>>> empty > >>>>>> since > >>>>>>>> we have data for other tenants in the target cluster), they start > >>>>> with > >>>>>> one > >>>>>>>> region and due to an organic split process, the data gets > >>>> distributed > >>>>>> among > >>>>>>>> different regions and different regionservers. But the organic > >>>>>> splitting > >>>>>>>> process takes a lot of time and due to the distributed nature of > >>>> the > >>>>> MR > >>>>>>>> framework, it causes hotspotting issues on the target cluster > >>> which > >>>>>> often > >>>>>>>> lasts for days. This causes availability issues where the CPU is > >>>>>> saturated > >>>>>>>> and/or disk saturation on the regionservers ingesting the data. > >>>> Also > >>>>>> this > >>>>>>>> causes a lot of replication related alerts (Age of last ship, > >>>>> LogQueue > >>>>>>>> size) which goes on for days. > >>>>>>>> > >>>>>>>> In order to handle the huge influx of data, we should ideally > >>>>>> pre-split the > >>>>>>>> table on the target based on the split structure present on the > >>>>> source > >>>>>>>> cluster. If we pre-split and create empty regions with right > >>> region > >>>>>>>> boundaries it will help to distribute the load to different > >>> regions > >>>>> and > >>>>>>>> region servers and will prevent hotspotting. > >>>>>>>> > >>>>>>>> Problems with the above approach: > >>>>>>>> 1. Currently we allow pre splitting only while creating a new > >>>> table. > >>>>>> But in > >>>>>>>> our production env, we already have the table created for other > >>>>>> tenants. So > >>>>>>>> we would like to pre-split an existing table for new tenants. > >>>>>>>> 2. Currently we split a given region into just 2 daughter > >>> regions. > >>>>> But > >>>>>> if > >>>>>>>> we have the split points information from the source cluster and > >>> if > >>>>> the > >>>>>>>> data for the to-be-migrated tenant is split across 100 regions on > >>>> the > >>>>>>>> source side, we would ideally like to create 100 empty regions on > >>>> the > >>>>>>>> target cluster. > >>>>>>>> > >>>>>>>> Trying to get early feedback from the community. Do you all think > >>>>> this > >>>>>> is a > >>>>>>>> good idea? Open to other suggestions also. > >>>>>>>> > >>>>>>>> > >>>>>>>> Thank you, > >>>>>>>> Rushabh. > >>>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > >>> -- > >>> Best regards, > >>> Andrew > >>> > >>> Unrest, ignorance distilled, nihilistic imbeciles - > >>> It's what we’ve earned > >>> Welcome, apocalypse, what’s taken you so long? > >>> Bring us the fitting end that we’ve been counting on > >>> - A23, Welcome, Apocalypse > >>>