Gregory Chanan created HBASE-6920:
-------------------------------------

             Summary: On timeout connecting to master, client can get stuck and 
never make progress
                 Key: HBASE-6920
                 URL: https://issues.apache.org/jira/browse/HBASE-6920
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.94.2
            Reporter: Gregory Chanan
            Assignee: Gregory Chanan
            Priority: Critical


HBASE-5058 appears to have introduced an issue where a timeout in 
HConnection.getMaster() can cause the client to never be able to connect to the 
master.  So, for example, an HBaseAdmin object can never successfully be 
initialized.

The issue is here:
{code}
if (tryMaster.isMasterRunning()) {
  this.master = tryMaster;
  this.masterLock.notifyAll();
  break;
}
{code}

If isMasterRunning times out, it throws an UndeclaredThrowableException, which 
is already not ideal, because it can be returned to the application.

 But if the first call to getMaster succeeds, it will set masterChecked = true, 
which makes us never try to reconnect; that is, we will set this.master = null 
and just throw MasterNotRunningExceptions, without even trying to connect.

I tried out a 94 client (actually a 92 client with some 94 patches) on a 
cluster with some network issues, and it would constantly get stuck as 
described above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to