[ 
https://issues.apache.org/jira/browse/HBASE-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jieshan Bean updated HBASE-3989:
--------------------------------

    Attachment: HBASE-3989-RegionServer.patch

> Error occured while RegionServer report to Master "we are up" should get 
> master address again
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3989
>                 URL: https://issues.apache.org/jira/browse/HBASE-3989
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3989-RegionServer.patch
>
>
> I happened to fall across a problem, after some further analysis, I found the 
> problem(The logs was attached at the end of the email)
> Consider the following scenario which is similar with my problem :
> 1. Due to some unclear reason,  the report to master got error. And retrying 
> several times, but also failed.
> 2. During this time , the standby master becomes the active one. So the 
> endless loop is still running, and it won't success, for the master address 
> has updated, but it didn't know. And won't know again.
>    while (!stopped && (masterAddress = getMaster()) == null) {
>       sleeper.sleep();
>       LOG.warn("Unable to get master for initialization");
>     }
>     MapWritable result = null;
>     long lastMsg = 0;
>     while (!stopped) {
>       try {
>         this.requestCount.set(0);
>         lastMsg = System.currentTimeMillis();
>         ZKUtil.setAddressAndWatch(zooKeeper,
>           ZKUtil.joinZNode(zooKeeper.rsZNode, ZKUtil.getNodeName(serverInfo)),
>           this.serverInfo.getServerAddress());
>         this.serverInfo.setLoad(buildServerLoad());
>         LOG.info("Telling master at " + masterAddress + " that we are up");
>         result = this.hbaseMaster.regionServerStartup(this.serverInfo,
>             EnvironmentEdgeManager.currentTimeMillis());
>         break;
>       } catch (RemoteException e) {
>         IOException ioe = e.unwrapRemoteException();
>         if (ioe instanceof ClockOutOfSyncException) {
>           LOG.fatal("Master rejected startup because clock is out of sync",
>               ioe);
>           // Re-throw IOE will cause RS to abort
>           throw ioe;
>         } else {
>           LOG.warn("remote error telling master we are up", e);
>         }
>       } catch (IOException e) {
>         LOG.warn("error telling master we are up", e);
>       } catch (KeeperException e) {
>         LOG.warn("error putting up ephemeral node in zookeeper", e);
>       }
>       sleeper.sleep(lastMsg);
>     }
> Here's the logs:
> 2011-06-13 11:25:12,236 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we 
> are up
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:419)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>       at $Proxy5.regionServerStartup(Unknown Source)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1511)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1479)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>       at java.lang.Thread.run(Thread.java:662)
> 2011-06-13 11:25:15,231 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> 157-5-111-22:20000 that we are up
> 2011-06-13 11:25:15,232 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we 
> are up
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:419)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>       at $Proxy5.regionServerStartup(Unknown Source)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1511)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1479)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>       at java.lang.Thread.run(Thread.java:662)
> 2011-06-13 11:25:18,225 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> 157-5-111-22:20000 that we are up
> So I think, while the error orrured, we should re-get the master address. 
> This problem could be solved.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to