Yep, I forgot about that nuance. I agree we can add a splitRegion overload which takes a byte[][] for multiple split points.
On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <apurt...@apache.org> wrote: > Rushabh already covered this but splitting is not complete until the region > can be split again. This is a very important nuance. The daughter regions > are online very quickly, as designed, but then background housekeeping > (compaction) must copy the data before the daughters become splittable. > Depending on compaction pressure, compaction queue depth, and the settings > of various tunables, waiting for some split daughters to become ready to > split again can take many minutes to hours. > > So let's say we have replicas of a table at two sites, site A and site B. > The region boundaries of this table in A and B will be different. Now let's > also say that table data is stored with a key prefix mapping to every > unique tenant. When migrating a tenant, data copy will hotspot on the > region(s) hosting keys with the tenant's prefix. This is fine if there are > enough regions to absorb the load. We run into trouble when the region > boundaries in the sub-keyspace of interest are quite different in B versus > A. We get hotspotting and impact to operations until organic splitting > eventually mitigates the hotspotting, but this might also require many > minutes to hours, with noticeable performance degradation in the meantime. > To avoid that degradation we pace the sender but then the copy may take so > long as to miss SLA for the migration. To make the data movement performant > and stay within SLA we want to apply one or more splits or merges so the > region boundaries B roughly align to A, avoiding hotspotting. This will > also make shipping this data by bulk load instead efficient too by > minimizing the amount of HFile splitting necessary to load them at the > receiver. > > So let's say we have some regions that need to be split N ways, where N is > order of ~10, by that I mean more than 1 and less than 100, in order to > (roughly) align region boundaries. We think this calls for an enhancement > to the split request API where the split should produce a requested number > of daughter-pairs. Today that is always 1 pair. Instead we might want 2, 5, > 10, conceivably more. And it would be nice if guideposts for multi-way > splitting can be sent over in byte[][]. > > On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <bbeaudrea...@apache.org > > > wrote: > > > This is the first time I've heard of a region split taking 4 minutes. For > > us, it's always on the order of seconds. That's true even for a large > 50+gb > > region. It might be worth looking into why that's so slow for you. > > > > On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > > <rushabh.s...@salesforce.com.invalid> wrote: > > > > > Thank you Andrew, Bryan and Duo for your responses. > > > > > > > My main thought is that a migration like this should use bulk > loading, > > > > But also, I think, that data transfer should be in bulk > > > > > > We are working on moving to bulk loading. > > > > > > > With Admin.splitRegion, you can specify a split point. You can use > that > > > to > > > iteratively add a bunch of regions wherever you need them in the > > keyspace. > > > Yes, it's 2 at a time, but it should still be quick enough in the grand > > > scheme of a large migration. > > > > > > > > > Trying to do some back of the envelope calculations. > > > In a production environment, it took around 4 minutes to split a > recently > > > split region which had 4 store files with a total of 5 GB of data. > > > Assuming we are migrating 5000 tenants at a time and normally we have > > > around 10% of the tenants (500 tenants) which have data > > > spread across more than 1000 regions. We have around 10 huge tables > > where > > > we store the tenant's data for different use cases. > > > All the above numbers are on the *conservative* side. > > > > > > To create a split structure for 1000 regions, we need 10 iterations of > > the > > > splits (2^10 = 1024). This assumes we are parallely splitting the > > regions. > > > Each split takes around 4 minutes. So to create 1000 regions just for 1 > > > tenant and for 1 table, it takes around 40 minutes. > > > For 10 tables for 1 tenant, it takes around 400 minutes. > > > > > > For 500 tenants, this will take around *140 days*. To reduce this time > > > further, we can also create a split structure for each tenant and each > > > table in parallel. > > > But this would put a lot of pressure on the cluster and also it will > > > require a lot of operational overhead and still we will end up with > > > the whole process taking days, if not months. > > > > > > Since we are moving our infrastructure to Public Cloud, we anticipate > > this > > > huge migration happening once every month. > > > > > > > > > > Adding a splitRegion method that takes byte[][] for multiple split > > points > > > would be a nice UX improvement, but not > > > strictly necessary. > > > > > > IMHO for all the reasons stated above, I believe this is necessary. > > > > > > > > > > > > > > > > > > On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com> > > > wrote: > > > > > > > As it is called 'pre' split, it means that it can only happen when > > > > there is no data in table. > > > > > > > > If there are already data in the table, you can not always create > > > > 'empty' regions, as you do not know whether there are already data in > > > > the given range... > > > > > > > > And technically, if you want to split a HFile into more than 2 parts, > > > > you need to design new algorithm as now in HBase we only support top > > > > reference and bottom reference... > > > > > > > > Thanks. > > > > > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道: > > > > > > > > > > My main thought is that a migration like this should use bulk > > loading, > > > > > which should be relatively easy given you already use MR > > > > > (HFileOutputFormat2). It doesn't solve the region-splitting > problem. > > > With > > > > > Admin.splitRegion, you can specify a split point. You can use that > to > > > > > iteratively add a bunch of regions wherever you need them in the > > > > keyspace. > > > > > Yes, it's 2 at a time, but it should still be quick enough in the > > grand > > > > > scheme of a large migration. Adding a splitRegion method that takes > > > > > byte[][] for multiple split points would be a nice UX improvement, > > but > > > > not > > > > > strictly necessary. > > > > > > > > > > On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > > > > > <rushabh.s...@salesforce.com.invalid> wrote: > > > > > > > > > > > Hi Everyone, > > > > > > At my workplace, we use HBase + Phoenix to run our customer > > > workloads. > > > > Most > > > > > > of our phoenix tables are multi-tenant and we store the tenantID > as > > > the > > > > > > leading part of the rowkey. Each tenant belongs to only 1 hbase > > > > cluster. > > > > > > Due to capacity planning, hardware refresh cycles and most > recently > > > > move to > > > > > > public cloud initiatives, we have to migrate a tenant from one > > hbase > > > > > > cluster (source cluster) to another hbase cluster (target > cluster). > > > > > > Normally we migrate a lot of tenants (in 10s of thousands) at a > > time > > > > and > > > > > > hence we have to copy a huge amount of data (in TBs) from > multiple > > > > source > > > > > > clusters to a single target cluster. We have our internal tool > > which > > > > uses > > > > > > MapReduce framework to copy the data. Since all of these tenants > > > don’t > > > > have > > > > > > any presence on the target cluster (Note that the table is NOT > > empty > > > > since > > > > > > we have data for other tenants in the target cluster), they start > > > with > > > > one > > > > > > region and due to an organic split process, the data gets > > distributed > > > > among > > > > > > different regions and different regionservers. But the organic > > > > splitting > > > > > > process takes a lot of time and due to the distributed nature of > > the > > > MR > > > > > > framework, it causes hotspotting issues on the target cluster > which > > > > often > > > > > > lasts for days. This causes availability issues where the CPU is > > > > saturated > > > > > > and/or disk saturation on the regionservers ingesting the data. > > Also > > > > this > > > > > > causes a lot of replication related alerts (Age of last ship, > > > LogQueue > > > > > > size) which goes on for days. > > > > > > > > > > > > In order to handle the huge influx of data, we should ideally > > > > pre-split the > > > > > > table on the target based on the split structure present on the > > > source > > > > > > cluster. If we pre-split and create empty regions with right > region > > > > > > boundaries it will help to distribute the load to different > regions > > > and > > > > > > region servers and will prevent hotspotting. > > > > > > > > > > > > Problems with the above approach: > > > > > > 1. Currently we allow pre splitting only while creating a new > > table. > > > > But in > > > > > > our production env, we already have the table created for other > > > > tenants. So > > > > > > we would like to pre-split an existing table for new tenants. > > > > > > 2. Currently we split a given region into just 2 daughter > regions. > > > But > > > > if > > > > > > we have the split points information from the source cluster and > if > > > the > > > > > > data for the to-be-migrated tenant is split across 100 regions on > > the > > > > > > source side, we would ideally like to create 100 empty regions on > > the > > > > > > target cluster. > > > > > > > > > > > > Trying to get early feedback from the community. Do you all think > > > this > > > > is a > > > > > > good idea? Open to other suggestions also. > > > > > > > > > > > > > > > > > > Thank you, > > > > > > Rushabh. > > > > > > > > > > > > > > > > > > -- > Best regards, > Andrew > > Unrest, ignorance distilled, nihilistic imbeciles - > It's what we’ve earned > Welcome, apocalypse, what’s taken you so long? > Bring us the fitting end that we’ve been counting on > - A23, Welcome, Apocalypse >