[ https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908275#comment-16908275 ]
Andrew Purtell edited comment on HBASE-21464 at 8/15/19 4:49 PM: ----------------------------------------------------------------- Due to time constraints I didn't find a root cause in the time I had to look into the issue. The bug with meta relocation was resolved by workaround. I would not be opposed to a new effort that reverts this change and attempts to fix (by rewrite, in my opinion) whatever issue exists in the recursive locateRegion and relocateRegion method chains, but have no immediate plans to do this. I also felt rewriting what I wanted to rewrite would have been more risky. Likely the result would not be appropriate for a patch release. However on HBASE-22855 Jianghua suggests a more complex scenario involving multi-process timing and if that's on the right track then the focus here on locate/relocateRegion was not correct. Suggest followup on HBASE-22855. was (Author: apurtell): Due to time constraints I didn't find a root cause in the time I had to look into the issue. The bug with meta relocation was resolved by workaround. I would not be opposed to a new effort that reverts this change and attempts to fix (by rewrite, in my opinion) whatever issue exists in the recursive locateRegion and relocateRegion method chains, but have no immediate plans to do this. > Splitting blocked with meta NSRE during split transaction > --------------------------------------------------------- > > Key: HBASE-21464 > URL: https://issues.apache.org/jira/browse/HBASE-21464 > Project: HBase > Issue Type: Bug > Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7 > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Priority: Blocker > Fix For: 1.4.9 > > Attachments: HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch, > HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch > > > Splitting is blocked during split transaction. The split worker is trying to > update meta but isn't able to relocate it after NSRE: > {noformat} > 2018-11-09 17:50:45,277 INFO > [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434] > client.RpcRetryingCaller: Call exception, tries=13, retries=350, > started=88590 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 > is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row > 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table > 'hbase:meta' at region=hbase:meta,1.1588230740, > hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, > seqNum=0{noformat} > Clients, in this case YCSB, are hung with part of the keyspace missing: > {noformat} > 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] > client.ConnectionManager$HConnectionImplementation: locateRegionInMeta > parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying > after sleep of 20158 because: No server address listed in hbase:meta for > region > test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. > containing row user3301635648728421323{noformat} > Balancing cannot run indefinitely because the split transaction is stuck > {noformat} > 2018-11-09 17:49:55,478 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: > Not running balancer because 3 region(s) in transition: > [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, > server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, > {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, > server=ip-172-31-5-92.us-west-2.compute....{noformat} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)