[
https://issues.apache.org/jira/browse/HBASE-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-2614:
-------------------------
Attachment: 2614.txt
This patch addresses issues seen in the log attached.
In the main, the issue is that a regionserver failed to init completely because
of:
{code}
java.lang.NullPointerException
at
org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:351)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:313)
at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:746)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.init(MiniHBaseCluster.java:161)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:169)
at java.lang.Thread.run(Thread.java:619)
{code}
The was inside its run method trying to get filesystem instance AFTER it had
registered itself with zk.
Our test utils wait on the regionserver to set its online flag before letting
things proceed. There was on provision for failed startup so test harness was
waiting for ever on an online flag that would never be set. The thread dumps
were showing this for the testing harness:
{code}
Thread 770 (Thread-692):
State: TIMED_WAITING
Blocked count: 134
Waited count: 1362
Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread.waitForServerOnline(JVMClusterUtil.java:60)
org.apache.hadoop.hbase.MiniHBaseCluster.startRegionServer(MiniHBaseCluster.java:227)
org.apache.hadoop.hbase.master.TestMasterTransitions.testAddingServerBeforeOldIsDead2413(TestMasterTransitions.java:281)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
org.junit.internal.runners.statements.FailOnTimeout$1.run(FailOnTimeout.java:28)
Thread 539 (org.apache.hadoop.hdfs.server.datanode.dataxcei...@19789a96):
{code}
Now, the RS crashes after its registered w/ zk so it goes to abort, ONLY, in
the abort, we throw a NPE because presumption is that metrics have been
initialized, only we haven't got that far in init, so the abort is aborted;
i.e. we don't call the stop method. So, the RS is sort of still alive, alive
enough to keep on hosting the zk client polling master as though the RS were
still alive.
Meanwhile over on the master, we want to go out because test says time for
shutdown only we won't shutdown till all regionservers have closed and
deregistered themselves from the master. The above RS failed after it repoted
to the master for duty so master knows about it and the RS is in a zombie state
that keeps up its zk lease, so master thinks a RS out these is still alive.
Patch makes it so test framework won't wait if RS has shutdownRequested set
(which it will when aborting), it fixes abort so no presumptions about metrics
being initialized, and over on master, we'll print out servers we're waiting on
up in main loop too to make this kind of thing easier debugging going forward.
> killing server in TestMasterTransitions causes NPEs and test deadlock
> ---------------------------------------------------------------------
>
> Key: HBASE-2614
> URL: https://issues.apache.org/jira/browse/HBASE-2614
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Assignee: stack
> Fix For: 0.21.0
>
> Attachments: 2614.txt,
> org.apache.hadoop.hbase.master.TestMasterTransitions-output.txt.gz
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.