I agree with above points that split taking more time should be equivalent to actual writes being blocked ever since parent region is brought offline and hence it is an issue with the availability. So, more region splits with the same approach does mean that we have longer availability issue. Daughter regions are created only after the parent region is brought offline.
Besides, all the complications with split procedure going wrong and whether we rollback or only roll forward -- depending on whether we have reached PONR in the procedure -- would all now be multiplied with the number of splits we consider. On Fri, Feb 9, 2024 at 9:05 AM Andrew Purtell <apurt...@apache.org> wrote: > > Or maybe we could change the way on how we split regions, instead of > > creating reference files, we directly read a HFile and output multiple > > HFiles in different ranges, put them in different region directories > > > Actually the 'split HFile by reading and then writing to multiple > > places' approach does not increase the offline region time a lot, it > > just increases the overall split processing time. > > Those file writes would need to complete before the daughters are brought > online, unless we keep the parent open until all the daughter writes are > completed, and then we have the problem of what to do with writes to the > parent in the meantime. They would need to be sent on to the appropriate > daughter too. If we try to avoid this by making the parent read only during > the split, that would be equivalent to the parent being offline in our > production. Not being ready for writes is equivalent to not available. > > Otherwise I have to believe that writing real HFiles instead of references > would take longer than writing references, and so assuming parent is > offline while daughters are being created as is the case today, the total > time offline during the split would have to be longer. When writing > reference files there is the file create time and that's it. When writing > real HFiles there is the file create time and then the time it takes to > stream through ranges of however many GB of data exist in the region, many > read and write IOs. > > > On Fri, Feb 9, 2024 at 7:13 AM 张铎(Duo Zhang) <palomino...@gmail.com> > wrote: > > > Actually the 'split HFile by reading and then writing to multiple > > places' approach does not increase the offline region time a lot, it > > just increases the overall split processing time. > > > > Anyway, all these things are all just some initial thoughts. If we > > think this is a useful new feature, we could open an issue and also > > design doc. > > > > Thanks. > > > > Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道: > > > > > > Agreed, the key is going to be changing the design of references to > > support this new. > > > > > > Splits must continue to minimize offline time. Increasing that time at > > all would not be acceptable. > > > > > > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote: > > > > > > > > As I said above, split is not like merge, just simply changing the > > > > Admin API to take a byte[][] does not actually help here. > > > > > > > > For online splitting a region, our algorithm only supports splitting > > > > to two sub regions, and then we do compaction to clean up all the > > > > reference files and prepare for the next split. > > > > > > > > I'm not saying this is impossible, but I think the most difficult > > > > challenge here is how to design a new reference file algorithm to > > > > support referencing a 'range' of a HFile, not only top half or bottom > > > > half. > > > > In this way we can support splitting a region directly to more than 2 > > > > sub regions. > > > > > > > > Or maybe we could change the way on how we split regions, instead of > > > > creating reference files, we directly read a HFile and output > multiple > > > > HFiles in different ranges, put them in different region directories, > > > > and also make the flush write to multiple places(the current region > > > > and all the sub regions), and once everything is fine, we offline the > > > > old region(also makes it do a final flush), and then online all the > > > > sub regions. In this way we have nuch lower overall write > > > > amplification, but the problem is it will take a very long time when > > > > splitting, and also the fail recovery will be more complicated. > > > > > > > > Thanks. > > > > > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道: > > > >> > > > >> Yep, I forgot about that nuance. I agree we can add a splitRegion > > overload > > > >> which takes a byte[][] for multiple split points. > > > >> > > > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org > > > > wrote: > > > >>> > > > >>> Rushabh already covered this but splitting is not complete until > the > > region > > > >>> can be split again. This is a very important nuance. The daughter > > regions > > > >>> are online very quickly, as designed, but then background > > housekeeping > > > >>> (compaction) must copy the data before the daughters become > > splittable. > > > >>> Depending on compaction pressure, compaction queue depth, and the > > settings > > > >>> of various tunables, waiting for some split daughters to become > > ready to > > > >>> split again can take many minutes to hours. > > > >>> > > > >>> So let's say we have replicas of a table at two sites, site A and > > site B. > > > >>> The region boundaries of this table in A and B will be different. > > Now let's > > > >>> also say that table data is stored with a key prefix mapping to > every > > > >>> unique tenant. When migrating a tenant, data copy will hotspot on > the > > > >>> region(s) hosting keys with the tenant's prefix. This is fine if > > there are > > > >>> enough regions to absorb the load. We run into trouble when the > > region > > > >>> boundaries in the sub-keyspace of interest are quite different in B > > versus > > > >>> A. We get hotspotting and impact to operations until organic > > splitting > > > >>> eventually mitigates the hotspotting, but this might also require > > many > > > >>> minutes to hours, with noticeable performance degradation in the > > meantime. > > > >>> To avoid that degradation we pace the sender but then the copy may > > take so > > > >>> long as to miss SLA for the migration. To make the data movement > > performant > > > >>> and stay within SLA we want to apply one or more splits or merges > so > > the > > > >>> region boundaries B roughly align to A, avoiding hotspotting. This > > will > > > >>> also make shipping this data by bulk load instead efficient too by > > > >>> minimizing the amount of HFile splitting necessary to load them at > > the > > > >>> receiver. > > > >>> > > > >>> So let's say we have some regions that need to be split N ways, > > where N is > > > >>> order of ~10, by that I mean more than 1 and less than 100, in > order > > to > > > >>> (roughly) align region boundaries. We think this calls for an > > enhancement > > > >>> to the split request API where the split should produce a requested > > number > > > >>> of daughter-pairs. Today that is always 1 pair. Instead we might > > want 2, 5, > > > >>> 10, conceivably more. And it would be nice if guideposts for > > multi-way > > > >>> splitting can be sent over in byte[][]. > > > >>> > > > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault < > > bbeaudrea...@apache.org > > > >>>> > > > >>> wrote: > > > >>> > > > >>>> This is the first time I've heard of a region split taking 4 > > minutes. For > > > >>>> us, it's always on the order of seconds. That's true even for a > > large > > > >>> 50+gb > > > >>>> region. It might be worth looking into why that's so slow for you. > > > >>>> > > > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > > > >>>> <rushabh.s...@salesforce.com.invalid> wrote: > > > >>>> > > > >>>>> Thank you Andrew, Bryan and Duo for your responses. > > > >>>>> > > > >>>>>> My main thought is that a migration like this should use bulk > > > >>> loading, > > > >>>>>> But also, I think, that data transfer should be in bulk > > > >>>>> > > > >>>>> We are working on moving to bulk loading. > > > >>>>> > > > >>>>>> With Admin.splitRegion, you can specify a split point. You can > use > > > >>> that > > > >>>>> to > > > >>>>> iteratively add a bunch of regions wherever you need them in the > > > >>>> keyspace. > > > >>>>> Yes, it's 2 at a time, but it should still be quick enough in the > > grand > > > >>>>> scheme of a large migration. > > > >>>>> > > > >>>>> > > > >>>>> Trying to do some back of the envelope calculations. > > > >>>>> In a production environment, it took around 4 minutes to split a > > > >>> recently > > > >>>>> split region which had 4 store files with a total of 5 GB of > data. > > > >>>>> Assuming we are migrating 5000 tenants at a time and normally we > > have > > > >>>>> around 10% of the tenants (500 tenants) which have data > > > >>>>> spread across more than 1000 regions. We have around 10 huge > tables > > > >>>> where > > > >>>>> we store the tenant's data for different use cases. > > > >>>>> All the above numbers are on the *conservative* side. > > > >>>>> > > > >>>>> To create a split structure for 1000 regions, we need 10 > > iterations of > > > >>>> the > > > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting the > > > >>>> regions. > > > >>>>> Each split takes around 4 minutes. So to create 1000 regions just > > for 1 > > > >>>>> tenant and for 1 table, it takes around 40 minutes. > > > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes. > > > >>>>> > > > >>>>> For 500 tenants, this will take around *140 days*. To reduce this > > time > > > >>>>> further, we can also create a split structure for each tenant and > > each > > > >>>>> table in parallel. > > > >>>>> But this would put a lot of pressure on the cluster and also it > > will > > > >>>>> require a lot of operational overhead and still we will end up > with > > > >>>>> the whole process taking days, if not months. > > > >>>>> > > > >>>>> Since we are moving our infrastructure to Public Cloud, we > > anticipate > > > >>>> this > > > >>>>> huge migration happening once every month. > > > >>>>> > > > >>>>> > > > >>>>>> Adding a splitRegion method that takes byte[][] for multiple > split > > > >>>> points > > > >>>>> would be a nice UX improvement, but not > > > >>>>> strictly necessary. > > > >>>>> > > > >>>>> IMHO for all the reasons stated above, I believe this is > necessary. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) < > > palomino...@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> As it is called 'pre' split, it means that it can only happen > when > > > >>>>>> there is no data in table. > > > >>>>>> > > > >>>>>> If there are already data in the table, you can not always > create > > > >>>>>> 'empty' regions, as you do not know whether there are already > > data in > > > >>>>>> the given range... > > > >>>>>> > > > >>>>>> And technically, if you want to split a HFile into more than 2 > > parts, > > > >>>>>> you need to design new algorithm as now in HBase we only support > > top > > > >>>>>> reference and bottom reference... > > > >>>>>> > > > >>>>>> Thanks. > > > >>>>>> > > > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 > > 02:16写道: > > > >>>>>>> > > > >>>>>>> My main thought is that a migration like this should use bulk > > > >>>> loading, > > > >>>>>>> which should be relatively easy given you already use MR > > > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting > > > >>> problem. > > > >>>>> With > > > >>>>>>> Admin.splitRegion, you can specify a split point. You can use > > that > > > >>> to > > > >>>>>>> iteratively add a bunch of regions wherever you need them in > the > > > >>>>>> keyspace. > > > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in > the > > > >>>> grand > > > >>>>>>> scheme of a large migration. Adding a splitRegion method that > > takes > > > >>>>>>> byte[][] for multiple split points would be a nice UX > > improvement, > > > >>>> but > > > >>>>>> not > > > >>>>>>> strictly necessary. > > > >>>>>>> > > > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > > > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote: > > > >>>>>>> > > > >>>>>>>> Hi Everyone, > > > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer > > > >>>>> workloads. > > > >>>>>> Most > > > >>>>>>>> of our phoenix tables are multi-tenant and we store the > tenantID > > > >>> as > > > >>>>> the > > > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 > hbase > > > >>>>>> cluster. > > > >>>>>>>> Due to capacity planning, hardware refresh cycles and most > > > >>> recently > > > >>>>>> move to > > > >>>>>>>> public cloud initiatives, we have to migrate a tenant from one > > > >>>> hbase > > > >>>>>>>> cluster (source cluster) to another hbase cluster (target > > > >>> cluster). > > > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at > a > > > >>>> time > > > >>>>>> and > > > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from > > > >>> multiple > > > >>>>>> source > > > >>>>>>>> clusters to a single target cluster. We have our internal tool > > > >>>> which > > > >>>>>> uses > > > >>>>>>>> MapReduce framework to copy the data. Since all of these > tenants > > > >>>>> don’t > > > >>>>>> have > > > >>>>>>>> any presence on the target cluster (Note that the table is NOT > > > >>>> empty > > > >>>>>> since > > > >>>>>>>> we have data for other tenants in the target cluster), they > > start > > > >>>>> with > > > >>>>>> one > > > >>>>>>>> region and due to an organic split process, the data gets > > > >>>> distributed > > > >>>>>> among > > > >>>>>>>> different regions and different regionservers. But the organic > > > >>>>>> splitting > > > >>>>>>>> process takes a lot of time and due to the distributed nature > of > > > >>>> the > > > >>>>> MR > > > >>>>>>>> framework, it causes hotspotting issues on the target cluster > > > >>> which > > > >>>>>> often > > > >>>>>>>> lasts for days. This causes availability issues where the CPU > is > > > >>>>>> saturated > > > >>>>>>>> and/or disk saturation on the regionservers ingesting the > data. > > > >>>> Also > > > >>>>>> this > > > >>>>>>>> causes a lot of replication related alerts (Age of last ship, > > > >>>>> LogQueue > > > >>>>>>>> size) which goes on for days. > > > >>>>>>>> > > > >>>>>>>> In order to handle the huge influx of data, we should ideally > > > >>>>>> pre-split the > > > >>>>>>>> table on the target based on the split structure present on > the > > > >>>>> source > > > >>>>>>>> cluster. If we pre-split and create empty regions with right > > > >>> region > > > >>>>>>>> boundaries it will help to distribute the load to different > > > >>> regions > > > >>>>> and > > > >>>>>>>> region servers and will prevent hotspotting. > > > >>>>>>>> > > > >>>>>>>> Problems with the above approach: > > > >>>>>>>> 1. Currently we allow pre splitting only while creating a new > > > >>>> table. > > > >>>>>> But in > > > >>>>>>>> our production env, we already have the table created for > other > > > >>>>>> tenants. So > > > >>>>>>>> we would like to pre-split an existing table for new tenants. > > > >>>>>>>> 2. Currently we split a given region into just 2 daughter > > > >>> regions. > > > >>>>> But > > > >>>>>> if > > > >>>>>>>> we have the split points information from the source cluster > and > > > >>> if > > > >>>>> the > > > >>>>>>>> data for the to-be-migrated tenant is split across 100 regions > > on > > > >>>> the > > > >>>>>>>> source side, we would ideally like to create 100 empty regions > > on > > > >>>> the > > > >>>>>>>> target cluster. > > > >>>>>>>> > > > >>>>>>>> Trying to get early feedback from the community. Do you all > > think > > > >>>>> this > > > >>>>>> is a > > > >>>>>>>> good idea? Open to other suggestions also. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Thank you, > > > >>>>>>>> Rushabh. > > > >>>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >>> > > > >>> -- > > > >>> Best regards, > > > >>> Andrew > > > >>> > > > >>> Unrest, ignorance distilled, nihilistic imbeciles - > > > >>> It's what we’ve earned > > > >>> Welcome, apocalypse, what’s taken you so long? > > > >>> Bring us the fitting end that we’ve been counting on > > > >>> - A23, Welcome, Apocalypse > > > >>> > > > > > -- > Best regards, > Andrew > > Unrest, ignorance distilled, nihilistic imbeciles - > It's what we’ve earned > Welcome, apocalypse, what’s taken you so long? > Bring us the fitting end that we’ve been counting on > - A23, Welcome, Apocalypse >