[
https://issues.apache.org/jira/browse/HBASE-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923754#action_12923754
]
Kannan Muthukkaruppan commented on HBASE-3141:
----------------------------------------------
We ran into this today during a shutdown/startup:
In 0.89, things happen in this order in the master code:
{code}
In the constructor:
(i) this.rpcServer = HBaseRPC.getServer(this, a.getBindAddress()... ) //
instantiate the server..
(ii) Try to become "primary" master, by writing to zookeeper.
In the run loop:
(iii) startServiceThreads() --> this.rpcServer.start()
{code}
Step (ii) blocked indefinitely, as a different master became the primary. At
startup, some Region Servers were trying to report in to this master
incorrectly... because the /hbase/master ZK node from previous shutdown hadn't
quite expired (?) and it still had this master's info.
What if we simply moved (iii) ahead of (ii) (i.e. start the rpcServer in the
constructor itself, before blocking on ZK's /hbase/master node).
Todd's fix seems more elaborate -- is that extra state of "accepting calls"
really necessary?
Hairong has also suggested that we add timeouts on the HBaseRpc.getProxy()
calls. See stack below where the RS was stuck indefinitely on the above master.
{code}
"regionserver60020" prio=10 tid=0x00002aaeb4e5d000 nid=0x1cae in Object.wait()
[0x000000004264e000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaab7560fa8> (a
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
- locked <0x00002aaab7560fa8> (a
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:408)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:384)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:431)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:342)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1210)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1227)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:432)
at java.lang.Thread.run(Thread.java:619)
{code}
> Master RPC server needs to be started before an RS can check in
> ---------------------------------------------------------------
>
> Key: HBASE-3141
> URL: https://issues.apache.org/jira/browse/HBASE-3141
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: Jonathan Gray
> Priority: Critical
> Fix For: 0.90.0
>
>
> Starting up an RPC server is done in two steps. In the constructor, we
> instantiate the RPC server. Then in startServiceThreads() we start() it.
> If someone RPCs in between the instantiation and the start(), it seems that
> bad things can happen. We need to make sure this can't happen and there
> aren't any races here.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.