[ 
https://issues.apache.org/jira/browse/HBASE-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821930#comment-16821930
 ] 

Duo Zhang commented on HBASE-22236:
-----------------------------------

OK the problem is
{code}
  static boolean canUpdateOnError(HRegionLocation loc, HRegionLocation oldLoc) {
    // Do not need to update if no such location, or the location is newer, or 
the location is not
    // the same with us
    return oldLoc != null && oldLoc.getSeqNum() <= loc.getSeqNum() &&
      oldLoc.getServerName().equals(loc.getServerName());
  }
{code}

The oldLoc.getServerName() returns null so we get a NPE. This is the log which 
tells us that the oldLoc.getServerName is null.

{noformat}
2019-04-18 16:54:05,605 DEBUG [Default-IPC-NioEventLoopGroup-8-5] 
client.AsyncRegionLocatorHelper(59): Try updating 
region=async,111,1555606423724.4b28e02c280866c0ac63dc1f20e9c274., 
hostname=asf904.gq1.ygridcore.net,34751,1555606417384, seqNum=9 , the old value 
is region=async,111,1555606444785.9f87a8c0763028897001a6b574f9bcd5., 
hostname=null, seqNum=1, 
error=org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
async,111,1555606423724.4b28e02c280866c0ac63dc1f20e9c274. is not online on 
asf904.gq1.ygridcore.net,34751,1555606417384
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3363)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3340)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1441)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2523)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{noformat}

It can be fixed by adding a null check. But first I want to check why we can 
cache an HRegionLocation with a null location...

> TestAsyncTableGetMultiThreaded sometimes timed out
> --------------------------------------------------
>
>                 Key: HBASE-22236
>                 URL: https://issues.apache.org/jira/browse/HBASE-22236
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>         Attachments: HBASE-22236.patch
>
>
> https://builds.apache.org/job/HBase-Flaky-Tests/job/master/2992/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreaded-output.txt/*view*/
> After this line
> {noformat}
> 2019-04-14 04:44:41,736 INFO  [PEWorker-12] 
> procedure2.ProcedureExecutor(1410): Finished pid=117, state=SUCCESS, 
> hasLock=false; TransitRegionStateProcedure table=hbase:meta, 
> region=1588230740, REOPEN/MOVE in 2.0690sec
> {noformat}
> Seems we just do nothing until the test is timed out.
> And there is no main thread in the output hanging thread, which is a bit 
> strange, although all the get threads are hanging there.
> Let me add some logs for better debugging first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to