[
https://issues.apache.org/jira/browse/HBASE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468215#comment-13468215
]
Gregory Chanan commented on HBASE-6920:
---------------------------------------
Lars, good question.
There are different failure cases. Basically, does reconnecting to the master
require another call to HBaseRPC.getProxy (e.g. yes if the master went down).
This only fixes the case that doesn't require another call - e.g. the single
RPC call just timed out, but you didn't lose the connection the master. If the
master went down and a different master took over then your client is stuck
(AFAIK, I'd need to actually test it). This is the same as before HBASE-5058
went in. I kept it this way because of the comment in HBASE-5058:
bq. The effect is that the current behavior is not changed. I.e. for a managed
connection we try only once
that is, I didn't want to change the behavior.
What you suggested seems reasonable -- I can try that out. Should it be a
different patch?
> On timeout connecting to master, client can get stuck and never make progress
> -----------------------------------------------------------------------------
>
> Key: HBASE-6920
> URL: https://issues.apache.org/jira/browse/HBASE-6920
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.94.2
> Reporter: Gregory Chanan
> Assignee: Gregory Chanan
> Priority: Critical
> Attachments: HBASE-6920.patch, HBASE-6920-v2.patch
>
>
> HBASE-5058 appears to have introduced an issue where a timeout in
> HConnection.getMaster() can cause the client to never be able to connect to
> the master. So, for example, an HBaseAdmin object can never successfully be
> initialized.
> The issue is here:
> {code}
> if (tryMaster.isMasterRunning()) {
> this.master = tryMaster;
> this.masterLock.notifyAll();
> break;
> }
> {code}
> If isMasterRunning times out, it throws an UndeclaredThrowableException,
> which is already not ideal, because it can be returned to the application.
> But if the first call to getMaster succeeds, it will set masterChecked =
> true, which makes us never try to reconnect; that is, we will set this.master
> = null and just throw MasterNotRunningExceptions, without even trying to
> connect.
> I tried out a 94 client (actually a 92 client with some 94 patches) on a
> cluster with some network issues, and it would constantly get stuck as
> described above.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira