[
https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705565#comment-16705565
]
Andrew Purtell edited comment on HBASE-21464 at 12/1/18 1:43 AM:
-----------------------------------------------------------------
I don't think recursive region relocation works the way we are all expecting,
that when we NSRE on meta we will always end up in ConnectionManager#locateMeta
with useCache = false. The sum of recursive region relocation code is hard to
understand and should be rewritten. I'm not going to do that today. What I do
have is a patch that works reliably to fix the issue in my test environment
when meta is moved during split activity while preserving the intents of
HBASE-10785 (don't overload zookeeper with lookups by looking up meta every
time) and HBASE-19260 (don't overload zookeeper with unnecessary concurrent
lookups). There is a new limit on cache entry age for meta, hardcoded to 10
seconds (should it be configurable? I don't think it matters much...), to
prevent getting stuck on a stale meta location. Consider it a safety valve we
need while continuing to look at this problem.
How to reproduce:
* Run a load client. I use YCSB with 100 threads. The test table is named
"test".
* In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test';
compact 'test' ; split 'test' ; balancer ; done
You've hit the problem when the result of the shell 'balancer' command is
always false. Go to the master, you'll find a split in progress that can't
finish. Go to the regionserver attempting the split and you'll find the split
worker going back again and again to the regionserver no longer hosting meta
looking for meta.
was (Author: apurtell):
I don't think recursive region relocation works the way we are all expecting,
that when we NSRE on meta we will always end up in
ConnectionManager#locateRegion with useCache = false. The sum of recursive
region relocation code is hard to understand and should be rewritten. I'm not
going to do that today. What I do have is a patch that works reliably to fix
the issue in my test environment when meta is moved during split activity while
preserving the intents of HBASE-10785 (don't overload zookeeper with lookups by
looking up meta every time) and HBASE-19260 (don't overload zookeeper with
unnecessary concurrent lookups). There is a new limit on cache entry age for
meta, hardcoded to 10 seconds (should it be configurable? I don't think it
matters much...), to prevent getting stuck on a stale meta location. Consider
it a safety valve we need while continuing to look at this problem.
How to reproduce:
* Run a load client. I use YCSB with 100 threads. The test table is named
"test".
* In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test';
compact 'test' ; split 'test' ; balancer ; done
You've hit the problem when the result of the shell 'balancer' command is
always false. Go to the master, you'll find a split in progress that can't
finish. Go to the regionserver attempting the split and you'll find the split
worker going back again and again to the regionserver no longer hosting meta
looking for meta.
> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
> Key: HBASE-21464
> URL: https://issues.apache.org/jira/browse/HBASE-21464
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Priority: Blocker
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch,
> HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch
>
>
> Splitting is blocked during split transaction. The split worker is trying to
> update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO
> [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
> client.RpcRetryingCaller: Call exception, tries=13, retries=350,
> started=88590 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1
> is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row
> 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table
> 'hbase:meta' at region=hbase:meta,1.1588230740,
> hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586,
> seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165]
> client.ConnectionManager$HConnectionImplementation: locateRegionInMeta
> parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying
> after sleep of 20158 because: No server address listed in hbase:meta for
> region
> test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081.
> containing row user3301635648728421323{noformat}
> Balancing cannot run indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster:
> Not running balancer because 3 region(s) in transition:
> [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606,
> server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417},
> {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606,
> server=ip-172-31-5-92.us-west-2.compute....{noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)