Hi Everyone, Very nice discussion around the issues we may face with multi region splitting. I have put up the details on how to handle the challenges in the design doc. Would you please spare your few cycles to share the feedback on the same.
https://docs.google.com/document/d/1CW_5NAEG4NdU8jrxmTg07DpCByiF1va0Tsrze8kMEYo/edit Thanks, Rajeshbabu. On Fri, Feb 9, 2024 at 11:40 PM Viraj Jasani <vjas...@apache.org> wrote: > I agree with above points that split taking more time should be equivalent > to actual writes being blocked ever since parent region is brought offline > and hence it is an issue with the availability. So, more region splits with > the same approach does mean that we have longer availability issue. > Daughter regions are created only after the parent region is brought > offline. > > Besides, all the complications with split procedure going wrong and whether > we rollback or only roll forward -- depending on whether we have reached > PONR in the procedure -- would all now be multiplied with the number of > splits we consider. > > > On Fri, Feb 9, 2024 at 9:05 AM Andrew Purtell <apurt...@apache.org> wrote: > > > > Or maybe we could change the way on how we split regions, instead of > > > creating reference files, we directly read a HFile and output multiple > > > HFiles in different ranges, put them in different region directories > > > > > Actually the 'split HFile by reading and then writing to multiple > > > places' approach does not increase the offline region time a lot, it > > > just increases the overall split processing time. > > > > Those file writes would need to complete before the daughters are brought > > online, unless we keep the parent open until all the daughter writes are > > completed, and then we have the problem of what to do with writes to the > > parent in the meantime. They would need to be sent on to the appropriate > > daughter too. If we try to avoid this by making the parent read only > during > > the split, that would be equivalent to the parent being offline in our > > production. Not being ready for writes is equivalent to not available. > > > > Otherwise I have to believe that writing real HFiles instead of > references > > would take longer than writing references, and so assuming parent is > > offline while daughters are being created as is the case today, the total > > time offline during the split would have to be longer. When writing > > reference files there is the file create time and that's it. When writing > > real HFiles there is the file create time and then the time it takes to > > stream through ranges of however many GB of data exist in the region, > many > > read and write IOs. > > > > > > On Fri, Feb 9, 2024 at 7:13 AM 张铎(Duo Zhang) <palomino...@gmail.com> > > wrote: > > > > > Actually the 'split HFile by reading and then writing to multiple > > > places' approach does not increase the offline region time a lot, it > > > just increases the overall split processing time. > > > > > > Anyway, all these things are all just some initial thoughts. If we > > > think this is a useful new feature, we could open an issue and also > > > design doc. > > > > > > Thanks. > > > > > > Andrew Purtell <andrew.purt...@gmail.com> 于2024年2月9日周五 11:09写道: > > > > > > > > Agreed, the key is going to be changing the design of references to > > > support this new. > > > > > > > > Splits must continue to minimize offline time. Increasing that time > at > > > all would not be acceptable. > > > > > > > > > On Feb 8, 2024, at 5:46 PM, 张铎 <palomino...@gmail.com> wrote: > > > > > > > > > > As I said above, split is not like merge, just simply changing the > > > > > Admin API to take a byte[][] does not actually help here. > > > > > > > > > > For online splitting a region, our algorithm only supports > splitting > > > > > to two sub regions, and then we do compaction to clean up all the > > > > > reference files and prepare for the next split. > > > > > > > > > > I'm not saying this is impossible, but I think the most difficult > > > > > challenge here is how to design a new reference file algorithm to > > > > > support referencing a 'range' of a HFile, not only top half or > bottom > > > > > half. > > > > > In this way we can support splitting a region directly to more > than 2 > > > > > sub regions. > > > > > > > > > > Or maybe we could change the way on how we split regions, instead > of > > > > > creating reference files, we directly read a HFile and output > > multiple > > > > > HFiles in different ranges, put them in different region > directories, > > > > > and also make the flush write to multiple places(the current region > > > > > and all the sub regions), and once everything is fine, we offline > the > > > > > old region(also makes it do a final flush), and then online all the > > > > > sub regions. In this way we have nuch lower overall write > > > > > amplification, but the problem is it will take a very long time > when > > > > > splitting, and also the fail recovery will be more complicated. > > > > > > > > > > Thanks. > > > > > > > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年2月9日周五 09:27写道: > > > > >> > > > > >> Yep, I forgot about that nuance. I agree we can add a splitRegion > > > overload > > > > >> which takes a byte[][] for multiple split points. > > > > >> > > > > >>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell < > apurt...@apache.org > > > > > > wrote: > > > > >>> > > > > >>> Rushabh already covered this but splitting is not complete until > > the > > > region > > > > >>> can be split again. This is a very important nuance. The daughter > > > regions > > > > >>> are online very quickly, as designed, but then background > > > housekeeping > > > > >>> (compaction) must copy the data before the daughters become > > > splittable. > > > > >>> Depending on compaction pressure, compaction queue depth, and the > > > settings > > > > >>> of various tunables, waiting for some split daughters to become > > > ready to > > > > >>> split again can take many minutes to hours. > > > > >>> > > > > >>> So let's say we have replicas of a table at two sites, site A and > > > site B. > > > > >>> The region boundaries of this table in A and B will be different. > > > Now let's > > > > >>> also say that table data is stored with a key prefix mapping to > > every > > > > >>> unique tenant. When migrating a tenant, data copy will hotspot on > > the > > > > >>> region(s) hosting keys with the tenant's prefix. This is fine if > > > there are > > > > >>> enough regions to absorb the load. We run into trouble when the > > > region > > > > >>> boundaries in the sub-keyspace of interest are quite different > in B > > > versus > > > > >>> A. We get hotspotting and impact to operations until organic > > > splitting > > > > >>> eventually mitigates the hotspotting, but this might also require > > > many > > > > >>> minutes to hours, with noticeable performance degradation in the > > > meantime. > > > > >>> To avoid that degradation we pace the sender but then the copy > may > > > take so > > > > >>> long as to miss SLA for the migration. To make the data movement > > > performant > > > > >>> and stay within SLA we want to apply one or more splits or merges > > so > > > the > > > > >>> region boundaries B roughly align to A, avoiding hotspotting. > This > > > will > > > > >>> also make shipping this data by bulk load instead efficient too > by > > > > >>> minimizing the amount of HFile splitting necessary to load them > at > > > the > > > > >>> receiver. > > > > >>> > > > > >>> So let's say we have some regions that need to be split N ways, > > > where N is > > > > >>> order of ~10, by that I mean more than 1 and less than 100, in > > order > > > to > > > > >>> (roughly) align region boundaries. We think this calls for an > > > enhancement > > > > >>> to the split request API where the split should produce a > requested > > > number > > > > >>> of daughter-pairs. Today that is always 1 pair. Instead we might > > > want 2, 5, > > > > >>> 10, conceivably more. And it would be nice if guideposts for > > > multi-way > > > > >>> splitting can be sent over in byte[][]. > > > > >>> > > > > >>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault < > > > bbeaudrea...@apache.org > > > > >>>> > > > > >>> wrote: > > > > >>> > > > > >>>> This is the first time I've heard of a region split taking 4 > > > minutes. For > > > > >>>> us, it's always on the order of seconds. That's true even for a > > > large > > > > >>> 50+gb > > > > >>>> region. It might be worth looking into why that's so slow for > you. > > > > >>>> > > > > >>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > > > > >>>> <rushabh.s...@salesforce.com.invalid> wrote: > > > > >>>> > > > > >>>>> Thank you Andrew, Bryan and Duo for your responses. > > > > >>>>> > > > > >>>>>> My main thought is that a migration like this should use bulk > > > > >>> loading, > > > > >>>>>> But also, I think, that data transfer should be in bulk > > > > >>>>> > > > > >>>>> We are working on moving to bulk loading. > > > > >>>>> > > > > >>>>>> With Admin.splitRegion, you can specify a split point. You can > > use > > > > >>> that > > > > >>>>> to > > > > >>>>> iteratively add a bunch of regions wherever you need them in > the > > > > >>>> keyspace. > > > > >>>>> Yes, it's 2 at a time, but it should still be quick enough in > the > > > grand > > > > >>>>> scheme of a large migration. > > > > >>>>> > > > > >>>>> > > > > >>>>> Trying to do some back of the envelope calculations. > > > > >>>>> In a production environment, it took around 4 minutes to split > a > > > > >>> recently > > > > >>>>> split region which had 4 store files with a total of 5 GB of > > data. > > > > >>>>> Assuming we are migrating 5000 tenants at a time and normally > we > > > have > > > > >>>>> around 10% of the tenants (500 tenants) which have data > > > > >>>>> spread across more than 1000 regions. We have around 10 huge > > tables > > > > >>>> where > > > > >>>>> we store the tenant's data for different use cases. > > > > >>>>> All the above numbers are on the *conservative* side. > > > > >>>>> > > > > >>>>> To create a split structure for 1000 regions, we need 10 > > > iterations of > > > > >>>> the > > > > >>>>> splits (2^10 = 1024). This assumes we are parallely splitting > the > > > > >>>> regions. > > > > >>>>> Each split takes around 4 minutes. So to create 1000 regions > just > > > for 1 > > > > >>>>> tenant and for 1 table, it takes around 40 minutes. > > > > >>>>> For 10 tables for 1 tenant, it takes around 400 minutes. > > > > >>>>> > > > > >>>>> For 500 tenants, this will take around *140 days*. To reduce > this > > > time > > > > >>>>> further, we can also create a split structure for each tenant > and > > > each > > > > >>>>> table in parallel. > > > > >>>>> But this would put a lot of pressure on the cluster and also it > > > will > > > > >>>>> require a lot of operational overhead and still we will end up > > with > > > > >>>>> the whole process taking days, if not months. > > > > >>>>> > > > > >>>>> Since we are moving our infrastructure to Public Cloud, we > > > anticipate > > > > >>>> this > > > > >>>>> huge migration happening once every month. > > > > >>>>> > > > > >>>>> > > > > >>>>>> Adding a splitRegion method that takes byte[][] for multiple > > split > > > > >>>> points > > > > >>>>> would be a nice UX improvement, but not > > > > >>>>> strictly necessary. > > > > >>>>> > > > > >>>>> IMHO for all the reasons stated above, I believe this is > > necessary. > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) < > > > palomino...@gmail.com> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> As it is called 'pre' split, it means that it can only happen > > when > > > > >>>>>> there is no data in table. > > > > >>>>>> > > > > >>>>>> If there are already data in the table, you can not always > > create > > > > >>>>>> 'empty' regions, as you do not know whether there are already > > > data in > > > > >>>>>> the given range... > > > > >>>>>> > > > > >>>>>> And technically, if you want to split a HFile into more than 2 > > > parts, > > > > >>>>>> you need to design new algorithm as now in HBase we only > support > > > top > > > > >>>>>> reference and bottom reference... > > > > >>>>>> > > > > >>>>>> Thanks. > > > > >>>>>> > > > > >>>>>> Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 > > > 02:16写道: > > > > >>>>>>> > > > > >>>>>>> My main thought is that a migration like this should use bulk > > > > >>>> loading, > > > > >>>>>>> which should be relatively easy given you already use MR > > > > >>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting > > > > >>> problem. > > > > >>>>> With > > > > >>>>>>> Admin.splitRegion, you can specify a split point. You can use > > > that > > > > >>> to > > > > >>>>>>> iteratively add a bunch of regions wherever you need them in > > the > > > > >>>>>> keyspace. > > > > >>>>>>> Yes, it's 2 at a time, but it should still be quick enough in > > the > > > > >>>> grand > > > > >>>>>>> scheme of a large migration. Adding a splitRegion method that > > > takes > > > > >>>>>>> byte[][] for multiple split points would be a nice UX > > > improvement, > > > > >>>> but > > > > >>>>>> not > > > > >>>>>>> strictly necessary. > > > > >>>>>>> > > > > >>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > > > > >>>>>>> <rushabh.s...@salesforce.com.invalid> wrote: > > > > >>>>>>> > > > > >>>>>>>> Hi Everyone, > > > > >>>>>>>> At my workplace, we use HBase + Phoenix to run our customer > > > > >>>>> workloads. > > > > >>>>>> Most > > > > >>>>>>>> of our phoenix tables are multi-tenant and we store the > > tenantID > > > > >>> as > > > > >>>>> the > > > > >>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 > > hbase > > > > >>>>>> cluster. > > > > >>>>>>>> Due to capacity planning, hardware refresh cycles and most > > > > >>> recently > > > > >>>>>> move to > > > > >>>>>>>> public cloud initiatives, we have to migrate a tenant from > one > > > > >>>> hbase > > > > >>>>>>>> cluster (source cluster) to another hbase cluster (target > > > > >>> cluster). > > > > >>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) > at > > a > > > > >>>> time > > > > >>>>>> and > > > > >>>>>>>> hence we have to copy a huge amount of data (in TBs) from > > > > >>> multiple > > > > >>>>>> source > > > > >>>>>>>> clusters to a single target cluster. We have our internal > tool > > > > >>>> which > > > > >>>>>> uses > > > > >>>>>>>> MapReduce framework to copy the data. Since all of these > > tenants > > > > >>>>> don’t > > > > >>>>>> have > > > > >>>>>>>> any presence on the target cluster (Note that the table is > NOT > > > > >>>> empty > > > > >>>>>> since > > > > >>>>>>>> we have data for other tenants in the target cluster), they > > > start > > > > >>>>> with > > > > >>>>>> one > > > > >>>>>>>> region and due to an organic split process, the data gets > > > > >>>> distributed > > > > >>>>>> among > > > > >>>>>>>> different regions and different regionservers. But the > organic > > > > >>>>>> splitting > > > > >>>>>>>> process takes a lot of time and due to the distributed > nature > > of > > > > >>>> the > > > > >>>>> MR > > > > >>>>>>>> framework, it causes hotspotting issues on the target > cluster > > > > >>> which > > > > >>>>>> often > > > > >>>>>>>> lasts for days. This causes availability issues where the > CPU > > is > > > > >>>>>> saturated > > > > >>>>>>>> and/or disk saturation on the regionservers ingesting the > > data. > > > > >>>> Also > > > > >>>>>> this > > > > >>>>>>>> causes a lot of replication related alerts (Age of last > ship, > > > > >>>>> LogQueue > > > > >>>>>>>> size) which goes on for days. > > > > >>>>>>>> > > > > >>>>>>>> In order to handle the huge influx of data, we should > ideally > > > > >>>>>> pre-split the > > > > >>>>>>>> table on the target based on the split structure present on > > the > > > > >>>>> source > > > > >>>>>>>> cluster. If we pre-split and create empty regions with right > > > > >>> region > > > > >>>>>>>> boundaries it will help to distribute the load to different > > > > >>> regions > > > > >>>>> and > > > > >>>>>>>> region servers and will prevent hotspotting. > > > > >>>>>>>> > > > > >>>>>>>> Problems with the above approach: > > > > >>>>>>>> 1. Currently we allow pre splitting only while creating a > new > > > > >>>> table. > > > > >>>>>> But in > > > > >>>>>>>> our production env, we already have the table created for > > other > > > > >>>>>> tenants. So > > > > >>>>>>>> we would like to pre-split an existing table for new > tenants. > > > > >>>>>>>> 2. Currently we split a given region into just 2 daughter > > > > >>> regions. > > > > >>>>> But > > > > >>>>>> if > > > > >>>>>>>> we have the split points information from the source cluster > > and > > > > >>> if > > > > >>>>> the > > > > >>>>>>>> data for the to-be-migrated tenant is split across 100 > regions > > > on > > > > >>>> the > > > > >>>>>>>> source side, we would ideally like to create 100 empty > regions > > > on > > > > >>>> the > > > > >>>>>>>> target cluster. > > > > >>>>>>>> > > > > >>>>>>>> Trying to get early feedback from the community. Do you all > > > think > > > > >>>>> this > > > > >>>>>> is a > > > > >>>>>>>> good idea? Open to other suggestions also. > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> Thank you, > > > > >>>>>>>> Rushabh. > > > > >>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Best regards, > > > > >>> Andrew > > > > >>> > > > > >>> Unrest, ignorance distilled, nihilistic imbeciles - > > > > >>> It's what we’ve earned > > > > >>> Welcome, apocalypse, what’s taken you so long? > > > > >>> Bring us the fitting end that we’ve been counting on > > > > >>> - A23, Welcome, Apocalypse > > > > >>> > > > > > > > > > -- > > Best regards, > > Andrew > > > > Unrest, ignorance distilled, nihilistic imbeciles - > > It's what we’ve earned > > Welcome, apocalypse, what’s taken you so long? > > Bring us the fitting end that we’ve been counting on > > - A23, Welcome, Apocalypse > > >