> This is the first time I've heard of a region split taking 4 minutes. For us, it's always on the order of seconds. That's true even for a large 50+gb region. It might be worth looking into why that's so slow for you.
For us also, the split takes less than a second. But to split a recently split region, I have to wait for around 4 minutes for the master to clean all the parent references. It throws this exception. 2024-02-06 20:52:33,389 DEBUG [.default.FPBQ.Fifo.handler=249,queue=15,port=61000] assignment.SplitTableRegionProcedure - Splittable=false state=OPEN, location=<RS-name> 2024-02-06 20:52:37,482 DEBUG [.default.FPBQ.Fifo.handler=249,queue=15,port=61000] ipc.MetricsHBaseServer - Unknown exception type org.apache.hadoop.hbase.DoNotRetryIOException: 23d5a72661ce2027cb7388b694dc235a NOT splittable at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:231) at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:134) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:1031) at org.apache.hadoop.hbase.master.HMaster$3.run(HMaster.java:2198) at org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:132) at org.apache.hadoop.hbase.master.HMaster.splitRegion(HMaster.java:2191) at org.apache.hadoop.hbase.master.MasterRpcServices.splitRegion(MasterRpcServices.java:860) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) On Wed, Feb 7, 2024 at 10:02 AM Bryan Beaudreault <bbeaudrea...@apache.org> wrote: > This is the first time I've heard of a region split taking 4 minutes. For > us, it's always on the order of seconds. That's true even for a large 50+gb > region. It might be worth looking into why that's so slow for you. > > On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah > <rushabh.s...@salesforce.com.invalid> wrote: > > > Thank you Andrew, Bryan and Duo for your responses. > > > > > My main thought is that a migration like this should use bulk loading, > > > But also, I think, that data transfer should be in bulk > > > > We are working on moving to bulk loading. > > > > > With Admin.splitRegion, you can specify a split point. You can use that > > to > > iteratively add a bunch of regions wherever you need them in the > keyspace. > > Yes, it's 2 at a time, but it should still be quick enough in the grand > > scheme of a large migration. > > > > > > Trying to do some back of the envelope calculations. > > In a production environment, it took around 4 minutes to split a recently > > split region which had 4 store files with a total of 5 GB of data. > > Assuming we are migrating 5000 tenants at a time and normally we have > > around 10% of the tenants (500 tenants) which have data > > spread across more than 1000 regions. We have around 10 huge tables > where > > we store the tenant's data for different use cases. > > All the above numbers are on the *conservative* side. > > > > To create a split structure for 1000 regions, we need 10 iterations of > the > > splits (2^10 = 1024). This assumes we are parallely splitting the > regions. > > Each split takes around 4 minutes. So to create 1000 regions just for 1 > > tenant and for 1 table, it takes around 40 minutes. > > For 10 tables for 1 tenant, it takes around 400 minutes. > > > > For 500 tenants, this will take around *140 days*. To reduce this time > > further, we can also create a split structure for each tenant and each > > table in parallel. > > But this would put a lot of pressure on the cluster and also it will > > require a lot of operational overhead and still we will end up with > > the whole process taking days, if not months. > > > > Since we are moving our infrastructure to Public Cloud, we anticipate > this > > huge migration happening once every month. > > > > > > > Adding a splitRegion method that takes byte[][] for multiple split > points > > would be a nice UX improvement, but not > > strictly necessary. > > > > IMHO for all the reasons stated above, I believe this is necessary. > > > > > > > > > > > > On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <palomino...@gmail.com> > > wrote: > > > > > As it is called 'pre' split, it means that it can only happen when > > > there is no data in table. > > > > > > If there are already data in the table, you can not always create > > > 'empty' regions, as you do not know whether there are already data in > > > the given range... > > > > > > And technically, if you want to split a HFile into more than 2 parts, > > > you need to design new algorithm as now in HBase we only support top > > > reference and bottom reference... > > > > > > Thanks. > > > > > > Bryan Beaudreault <bbeaudrea...@apache.org> 于2024年1月27日周六 02:16写道: > > > > > > > > My main thought is that a migration like this should use bulk > loading, > > > > which should be relatively easy given you already use MR > > > > (HFileOutputFormat2). It doesn't solve the region-splitting problem. > > With > > > > Admin.splitRegion, you can specify a split point. You can use that to > > > > iteratively add a bunch of regions wherever you need them in the > > > keyspace. > > > > Yes, it's 2 at a time, but it should still be quick enough in the > grand > > > > scheme of a large migration. Adding a splitRegion method that takes > > > > byte[][] for multiple split points would be a nice UX improvement, > but > > > not > > > > strictly necessary. > > > > > > > > On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah > > > > <rushabh.s...@salesforce.com.invalid> wrote: > > > > > > > > > Hi Everyone, > > > > > At my workplace, we use HBase + Phoenix to run our customer > > workloads. > > > Most > > > > > of our phoenix tables are multi-tenant and we store the tenantID as > > the > > > > > leading part of the rowkey. Each tenant belongs to only 1 hbase > > > cluster. > > > > > Due to capacity planning, hardware refresh cycles and most recently > > > move to > > > > > public cloud initiatives, we have to migrate a tenant from one > hbase > > > > > cluster (source cluster) to another hbase cluster (target cluster). > > > > > Normally we migrate a lot of tenants (in 10s of thousands) at a > time > > > and > > > > > hence we have to copy a huge amount of data (in TBs) from multiple > > > source > > > > > clusters to a single target cluster. We have our internal tool > which > > > uses > > > > > MapReduce framework to copy the data. Since all of these tenants > > don’t > > > have > > > > > any presence on the target cluster (Note that the table is NOT > empty > > > since > > > > > we have data for other tenants in the target cluster), they start > > with > > > one > > > > > region and due to an organic split process, the data gets > distributed > > > among > > > > > different regions and different regionservers. But the organic > > > splitting > > > > > process takes a lot of time and due to the distributed nature of > the > > MR > > > > > framework, it causes hotspotting issues on the target cluster which > > > often > > > > > lasts for days. This causes availability issues where the CPU is > > > saturated > > > > > and/or disk saturation on the regionservers ingesting the data. > Also > > > this > > > > > causes a lot of replication related alerts (Age of last ship, > > LogQueue > > > > > size) which goes on for days. > > > > > > > > > > In order to handle the huge influx of data, we should ideally > > > pre-split the > > > > > table on the target based on the split structure present on the > > source > > > > > cluster. If we pre-split and create empty regions with right region > > > > > boundaries it will help to distribute the load to different regions > > and > > > > > region servers and will prevent hotspotting. > > > > > > > > > > Problems with the above approach: > > > > > 1. Currently we allow pre splitting only while creating a new > table. > > > But in > > > > > our production env, we already have the table created for other > > > tenants. So > > > > > we would like to pre-split an existing table for new tenants. > > > > > 2. Currently we split a given region into just 2 daughter regions. > > But > > > if > > > > > we have the split points information from the source cluster and if > > the > > > > > data for the to-be-migrated tenant is split across 100 regions on > the > > > > > source side, we would ideally like to create 100 empty regions on > the > > > > > target cluster. > > > > > > > > > > Trying to get early feedback from the community. Do you all think > > this > > > is a > > > > > good idea? Open to other suggestions also. > > > > > > > > > > > > > > > Thank you, > > > > > Rushabh. > > > > > > > > > > >