Thank you Andrew, Duo and Bryan for the helpful discussion. I will create a jira soon.
On Fri, Feb 9, 2024 at 7:13 AM 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > Actually the 'split HFile by reading and then writing to multiple > places' approach does not increase the offline region time a lot, it > just increases the overall split processing time. > > Anyway, all these things are all just some initial thoughts. If we > think this is a useful new feature, we could open an issue and also > design doc. > > Thanks. > > Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道: > > > > Agreed, the key is going to be changing the design of references to > support this new. > > > > Splits must continue to minimize offline time. Increasing that time at > all would not be acceptable. > > > > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote: > > > > > > As I said above, split is not like merge, just simply changing the > > > Admin API to take a byte[][] does not actually help here. > > > > > > For online splitting a region, our algorithm only supports splitting > > > to two sub regions, and then we do compaction to clean up all the > > > reference files and prepare for the next split. > > > > > > I'm not saying this is impossible, but I think the most difficult > > > challenge here is how to design a new reference file algorithm to > > > support referencing a 'range' of a HFile, not only top half or bottom > > > half. > > > In this way we can support splitting a region directly to more than 2 > > > sub regions. > > > > > > Or maybe we could change the way on how we split regions, instead of > > > creating reference files, we directly read a HFile and output multiple > > > HFiles in different ranges, put them in different region directories, > > > and also make the flush write to multiple places(the current region > > > and all the sub regions), and once everything is fine, we offline the > > > old region(also makes it do a final flush), and then online all the > > > sub regions. In this way we have nuch lower overall write > > > amplification, but the problem is it will take a very long time when > > > splitting, and also the fail recovery will be more complicated. > > > > > > Thanks. > > > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道: > > >> > > >> Yep, I forgot about that nuance. I agree we can add a splitRegion > overload > > >> which takes a byte[][] for multiple split points. > > >> > > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org> > wrote: > > >>> > > >>> Rushabh already covered this but splitting is not complete until the > region > > >>> can be split again. This is a very important nuance. The daughter > regions > > >>> are online very quickly, as designed, but then background > housekeeping > > >>> (compaction) must copy the data before the daughters become > splittable. > > >>> Depending on compaction pressure, compaction queue depth, and the > settings > > >>> of various tunables, waiting for some split daughters to become > ready to > > >>> split again can take many minutes to hours. > > >>> > > >>> So let's say we have replicas of a table at two sites, site A and > site B. > > >>> The region boundaries of this table in A and B will be different. > Now let's > > >>> also say that table data is stored with a key prefix mapping to every > > >>> unique tenant. When migrating a tenant, data copy will hotspot on the > > >>> region(s) hosting keys with the tenant's prefix. This is fine if > there are > > >>> enough regions to absorb the load. We run into trouble when the > region > > >>> boundaries in the sub-keyspace of interest are quite different in B > versus > > >>> A. We get hotspotting and impact to operations until organic > splitting > > >>> eventually mitigates the hotspotting, but this might also require > many > > >>> minutes to hours, with noticeable performance degradation in the > meantime. > > >>> To avoid that degradation we pace the sender but then the copy may > take so > > >>> long as to miss SLA for the migration. To make the data movement > performant > > >>> and stay within SLA we want to apply one or more splits or merges so > the > > >>> region boundaries B roughly align to A, avoiding hotspotting. This > will > > >>> also make shipping this data by bulk load instead efficient too by > > >>> minimizing the amount of HFile splitting necessary to load them at > the > > >>> receiver. > > >>> > > >>> So let's say we have some regions that need to be split N ways, > where N is > > >>> order of ~10, by that I mean more than 1 and less than 100, in order > to > > >>> (roughly) align region boundaries. We think this calls for an > enhancement > > >>> to the split request API where the split should produce a requested > number > > >>> of daughter-pairs. Today that is always 1 pair. Instead we might > want 2, 5, > > >>> 10, conceivably more. And it would be nice if guideposts for > multi-way > > >>> splitting can be sent over in byte[][]. > > >>> > > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault < > bbeaudrea...@apache.org > > >>>> > > >>> wrote: > > >>> > > >>>> This is the first time I've heard of a region split taking 4 > minutes. For > > >>>> us, it's always on the order of seconds. That's true even for a > large > > >>> 50+gb > > >>>> region. It might be worth looking into why that's so slow for you. > > >>>> > > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > > >>>> <rushabh.s...@salesforce.com.invalid> wrote: > > >>>> > > >>>>> Thank you Andrew, Bryan and Duo for your responses. > > >>>>> > > >>>>>> My main thought is that a migration like this should use bulk > > >>> loading, > > >>>>>> But also, I think, that data transfer should be in bulk > > >>>>> > > >>>>> We are working on moving to bulk loading. > > >>>>> > > >>>>>> With Admin.splitRegion, you can specify a split point. You can use > > >>> that > > >>>>> to > > >>>>> iteratively add a bunch of regions wherever you need them in the > > >>>> keyspace. > > >>>>> Yes, it's 2 at a time, but it should still be quick enough in the > grand > > >>>>> scheme of a large migration. > > >>>>> > > >>>>> > > >>>>> Trying to do some back of the envelope calculations. > > >>>>> In a production environment, it took around 4 minutes to split a > > >>> recently > > >>>>> split region which had 4 store files with a total of 5 GB of data. > > >>>>> Assuming we are migrating 5000 tenants at a time and normally we > have > > >>>>> around 10% of the tenants (500 tenants) which have data > > >>>>> spread across more than 1000 regions. We have around 10 huge tables > > >>>> where > > >>>>> we store the tenant's data for different use cases. > > >>>>> All the above numbers are on the *conservative* side. > > >>>>> > > >>>>> To create a split structure for 1000 regions, we need 10 > iterations of > > >>>> the > > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting the > > >>>> regions. > > >>>>> Each split takes around 4 minutes. So to create 1000 regions just > for 1 > > >>>>> tenant and for 1 table, it takes around 40 minutes. > > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes. > > >>>>> > > >>>>> For 500 tenants, this will take around *140 days*. To reduce this > time > > >>>>> further, we can also create a split structure for each tenant and > each > > >>>>> table in parallel. > > >>>>> But this would put a lot of pressure on the cluster and also it > will > > >>>>> require a lot of operational overhead and still we will end up with > > >>>>> the whole process taking days, if not months. > > >>>>> > > >>>>> Since we are moving our infrastructure to Public Cloud, we > anticipate > > >>>> this > > >>>>> huge migration happening once every month. > > >>>>> > > >>>>> > > >>>>>> Adding a splitRegion method that takes byte[][] for multiple split > > >>>> points > > >>>>> would be a nice UX improvement, but not > > >>>>> strictly necessary. > > >>>>> > > >>>>> IMHO for all the reasons stated above, I believe this is necessary. > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) < > palomino...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> As it is called 'pre' split, it means that it can only happen when > > >>>>>> there is no data in table. > > >>>>>> > > >>>>>> If there are already data in the table, you can not always create > > >>>>>> 'empty' regions, as you do not know whether there are already > data in > > >>>>>> the given range... > > >>>>>> > > >>>>>> And technically, if you want to split a HFile into more than 2 > parts, > > >>>>>> you need to design new algorithm as now in HBase we only support > top > > >>>>>> reference and bottom reference... > > >>>>>> > > >>>>>> Thanks. > > >>>>>> > > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 > 02:16写道: > > >>>>>>> > > >>>>>>> My main thought is that a migration like this should use bulk > > >>>> loading, > > >>>>>>> which should be relatively easy given you already use MR > > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting > > >>> problem. > > >>>>> With > > >>>>>>> Admin.splitRegion, you can specify a split point. You can use > that > > >>> to > > >>>>>>> iteratively add a bunch of regions wherever you need them in the > > >>>>>> keyspace. > > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in the > > >>>> grand > > >>>>>>> scheme of a large migration. Adding a splitRegion method that > takes > > >>>>>>> byte[][] for multiple split points would be a nice UX > improvement, > > >>>> but > > >>>>>> not > > >>>>>>> strictly necessary. > > >>>>>>> > > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote: > > >>>>>>> > > >>>>>>>> Hi Everyone, > > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer > > >>>>> workloads. > > >>>>>> Most > > >>>>>>>> of our phoenix tables are multi-tenant and we store the tenantID > > >>> as > > >>>>> the > > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 hbase > > >>>>>> cluster. > > >>>>>>>> Due to capacity planning, hardware refresh cycles and most > > >>> recently > > >>>>>> move to > > >>>>>>>> public cloud initiatives, we have to migrate a tenant from one > > >>>> hbase > > >>>>>>>> cluster (source cluster) to another hbase cluster (target > > >>> cluster). > > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at a > > >>>> time > > >>>>>> and > > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from > > >>> multiple > > >>>>>> source > > >>>>>>>> clusters to a single target cluster. We have our internal tool > > >>>> which > > >>>>>> uses > > >>>>>>>> MapReduce framework to copy the data. Since all of these tenants > > >>>>> don’t > > >>>>>> have > > >>>>>>>> any presence on the target cluster (Note that the table is NOT > > >>>> empty > > >>>>>> since > > >>>>>>>> we have data for other tenants in the target cluster), they > start > > >>>>> with > > >>>>>> one > > >>>>>>>> region and due to an organic split process, the data gets > > >>>> distributed > > >>>>>> among > > >>>>>>>> different regions and different regionservers. But the organic > > >>>>>> splitting > > >>>>>>>> process takes a lot of time and due to the distributed nature of > > >>>> the > > >>>>> MR > > >>>>>>>> framework, it causes hotspotting issues on the target cluster > > >>> which > > >>>>>> often > > >>>>>>>> lasts for days. This causes availability issues where the CPU is > > >>>>>> saturated > > >>>>>>>> and/or disk saturation on the regionservers ingesting the data. > > >>>> Also > > >>>>>> this > > >>>>>>>> causes a lot of replication related alerts (Age of last ship, > > >>>>> LogQueue > > >>>>>>>> size) which goes on for days. > > >>>>>>>> > > >>>>>>>> In order to handle the huge influx of data, we should ideally > > >>>>>> pre-split the > > >>>>>>>> table on the target based on the split structure present on the > > >>>>> source > > >>>>>>>> cluster. If we pre-split and create empty regions with right > > >>> region > > >>>>>>>> boundaries it will help to distribute the load to different > > >>> regions > > >>>>> and > > >>>>>>>> region servers and will prevent hotspotting. > > >>>>>>>> > > >>>>>>>> Problems with the above approach: > > >>>>>>>> 1. Currently we allow pre splitting only while creating a new > > >>>> table. > > >>>>>> But in > > >>>>>>>> our production env, we already have the table created for other > > >>>>>> tenants. So > > >>>>>>>> we would like to pre-split an existing table for new tenants. > > >>>>>>>> 2. Currently we split a given region into just 2 daughter > > >>> regions. > > >>>>> But > > >>>>>> if > > >>>>>>>> we have the split points information from the source cluster and > > >>> if > > >>>>> the > > >>>>>>>> data for the to-be-migrated tenant is split across 100 regions > on > > >>>> the > > >>>>>>>> source side, we would ideally like to create 100 empty regions > on > > >>>> the > > >>>>>>>> target cluster. > > >>>>>>>> > > >>>>>>>> Trying to get early feedback from the community. Do you all > think > > >>>>> this > > >>>>>> is a > > >>>>>>>> good idea? Open to other suggestions also. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Thank you, > > >>>>>>>> Rushabh. > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Best regards, > > >>> Andrew > > >>> > > >>> Unrest, ignorance distilled, nihilistic imbeciles - > > >>> It's what we’ve earned > > >>> Welcome, apocalypse, what’s taken you so long? > > >>> Bring us the fitting end that we’ve been counting on > > >>> - A23, Welcome, Apocalypse > > >>> >