[ 
https://issues.apache.org/jira/browse/HBASE-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2614:
-------------------------

    Attachment: 2614.txt

This patch addresses issues seen in the log attached.

In the main, the issue is that a regionserver failed to init completely because 
of:

{code}
java.lang.NullPointerException
    at 
org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:351)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:313)
    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
    at org.apache.hadoop.ipc.Client.call(Client.java:720)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy7.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:746)
    at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.init(MiniHBaseCluster.java:161)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
    at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:169)
    at java.lang.Thread.run(Thread.java:619)
{code}

The was inside its run method trying to get filesystem instance AFTER it had 
registered itself with zk.

Our test utils wait on the regionserver to set its online flag before letting 
things proceed.  There was on provision for failed startup so test harness was 
waiting for ever on an online flag that would never be set.   The thread dumps 
were showing this for the testing harness:

{code}
Thread 770 (Thread-692):
  State: TIMED_WAITING
  Blocked count: 134
  Waited count: 1362
  Stack:
    java.lang.Thread.sleep(Native Method)
    
org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread.waitForServerOnline(JVMClusterUtil.java:60)
    
org.apache.hadoop.hbase.MiniHBaseCluster.startRegionServer(MiniHBaseCluster.java:227)
    
org.apache.hadoop.hbase.master.TestMasterTransitions.testAddingServerBeforeOldIsDead2413(TestMasterTransitions.java:281)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    
org.junit.internal.runners.statements.FailOnTimeout$1.run(FailOnTimeout.java:28)
Thread 539 (org.apache.hadoop.hdfs.server.datanode.dataxcei...@19789a96):
{code}

Now, the RS crashes after its registered w/ zk so it goes to abort, ONLY, in 
the abort, we throw a NPE because presumption is that metrics have been 
initialized, only we haven't got that far in init, so the abort is aborted; 
i.e. we don't call the stop method.  So, the RS is sort of still alive, alive 
enough to keep on hosting the zk client polling master as though the RS were 
still alive.

Meanwhile over on the master, we want to go out because test says time for 
shutdown only we won't shutdown till all regionservers have closed and 
deregistered themselves from the master.   The above RS failed after it repoted 
to the master for duty so master knows about it and the RS is in a zombie state 
that keeps up its zk lease, so master thinks a RS out these is still alive.

Patch makes it so test framework won't wait if RS has shutdownRequested set 
(which it will when aborting), it fixes abort so no presumptions about metrics 
being initialized, and over on master, we'll print out servers we're waiting on 
up in main loop too to make this kind of thing easier debugging going forward.  

> killing server in TestMasterTransitions causes NPEs and test deadlock
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2614
>                 URL: https://issues.apache.org/jira/browse/HBASE-2614
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.21.0
>
>         Attachments: 2614.txt, 
> org.apache.hadoop.hbase.master.TestMasterTransitions-output.txt.gz
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to