[jira] [Commented] (HBASE-3989) Error occured while RegionServer report to Master "we are up" should get master address again

Jieshan Bean (JIRA) Thu, 16 Jun 2011 19:38:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050871#comment-13050871
 ]


Jieshan Bean commented on HBASE-3989:
-------------------------------------

Thanks stack.

+ What if the reason we failed was NOT because the master went away. Is it ok 
calling getMaster again setting up the rpc proxy though we already have a 
connection?

  If the reason we failed was not because the master went away, I think it's 
also worth to do that(If it indeed the problem due to master address got 
changed, that will be an endless loop here..). Maybe there's other possible 
reasons, but we're not sure about whether the master address has been changed. 
Check it again just to avoid this problem. 

+ Do we need this check also over in tryRegionServerReport? Won't it suffer 
same issue?

  In tryRegionServerReport, it checked the master address again while report 
failed. Isn't it?
  (I don't know whether I have got about you question correctly)
{noformat}
  try {
        msgs = this.hbaseMaster.regionServerReport(this.serverInfo,
          outboundMessages.toArray(HMsg.EMPTY_HMSG_ARRAY),
          getMostLoadedRegions());
        break;
      } catch (IOException ioe) {
        if (ioe instanceof RemoteException) {
          ioe = ((RemoteException)ioe).unwrapRemoteException();
        }
        if (ioe instanceof YouAreDeadException) {
          // This will be caught and handled as a fatal error in run()
          throw ioe;
        }
        // Couldn't connect to the master, get location from zk and reconnect
        // Method blocks until new master is found or we are stopped
        getMaster();
      }
{noformat}

> Error occured while RegionServer report to Master "we are up" should get 
> master address again
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3989
>                 URL: https://issues.apache.org/jira/browse/HBASE-3989
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3989-RegionServer.patch
>
>
> I happened to fall across a problem, after some further analysis, I found the 
> problem(The logs was attached at the end of the email)
> Consider the following scenario which is similar with my problem :
> 1. Due to some unclear reason,  the report to master got error. And retrying 
> several times, but also failed.
> 2. During this time , the standby master becomes the active one. So the 
> endless loop is still running, and it won't success, for the master address 
> has updated, but it didn't know. And won't know again.
>    while (!stopped && (masterAddress = getMaster()) == null) {
>       sleeper.sleep();
>       LOG.warn("Unable to get master for initialization");
>     }
>     MapWritable result = null;
>     long lastMsg = 0;
>     while (!stopped) {
>       try {
>         this.requestCount.set(0);
>         lastMsg = System.currentTimeMillis();
>         ZKUtil.setAddressAndWatch(zooKeeper,
>           ZKUtil.joinZNode(zooKeeper.rsZNode, ZKUtil.getNodeName(serverInfo)),
>           this.serverInfo.getServerAddress());
>         this.serverInfo.setLoad(buildServerLoad());
>         LOG.info("Telling master at " + masterAddress + " that we are up");
>         result = this.hbaseMaster.regionServerStartup(this.serverInfo,
>             EnvironmentEdgeManager.currentTimeMillis());
>         break;
>       } catch (RemoteException e) {
>         IOException ioe = e.unwrapRemoteException();
>         if (ioe instanceof ClockOutOfSyncException) {
>           LOG.fatal("Master rejected startup because clock is out of sync",
>               ioe);
>           // Re-throw IOE will cause RS to abort
>           throw ioe;
>         } else {
>           LOG.warn("remote error telling master we are up", e);
>         }
>       } catch (IOException e) {
>         LOG.warn("error telling master we are up", e);
>       } catch (KeeperException e) {
>         LOG.warn("error putting up ephemeral node in zookeeper", e);
>       }
>       sleeper.sleep(lastMsg);
>     }
> Here's the logs:
> 2011-06-13 11:25:12,236 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we 
> are up
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:419)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>       at $Proxy5.regionServerStartup(Unknown Source)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1511)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1479)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>       at java.lang.Thread.run(Thread.java:662)
> 2011-06-13 11:25:15,231 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> 157-5-111-22:20000 that we are up
> 2011-06-13 11:25:15,232 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we 
> are up
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:419)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>       at $Proxy5.regionServerStartup(Unknown Source)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1511)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1479)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>       at java.lang.Thread.run(Thread.java:662)
> 2011-06-13 11:25:18,225 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> 157-5-111-22:20000 that we are up
> So I think, while the error orrured, we should re-get the master address. 
> This problem could be solved.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3989) Error occured while RegionServer report to Master "we are up" should get master address again

Reply via email to