[ 
https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994960#comment-12994960
 ] 

Cosmin Lehene commented on HBASE-3065:
--------------------------------------

We started getting tons of 
{code}org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /hbase{code} both in our MR jobs and in 
our "unit" tests that use the MiniHBaseCluster after upgrading from 0.89 to 
0.90.





> Retry all 'retryable' zk operations; e.g. connection loss
> ---------------------------------------------------------
>
>                 Key: HBASE-3065
>                 URL: https://issues.apache.org/jira/browse/HBASE-3065
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.92.0
>
>
> The 'new' master refactored our zk code tidying up all zk accesses and 
> coralling them behind nice zk utility classes.  One improvement was letting 
> out all KeeperExceptions letting the client deal.  Thats good generally 
> because in old days, we'd suppress important state zk changes in state.  But 
> there is at least one case the new zk utility could handle for the 
> application and thats the class of retryable KeeperExceptions.  The one that 
> comes to mind is conection loss.  On connection loss we should retry the 
> just-failed operation.  Usually the retry will just work.  At worse, on 
> reconnect, we'll pick up the expired session event. 
> Adding in this change shouldn't be too bad given the refactor of zk corralled 
> all zk access into one or two classes only.
> One thing to consider though is how much we should retry.  We could retry on 
> a timer or we could retry for ever as long as the Stoppable interface is 
> passed so if another thread has stopped or aborted the hosting service, we'll 
> notice and give up trying.  Doing the latter is probably better than some 
> kinda timeout.
> HBASE-3062 adds a timed retry on the first zk operation.  This issue is about 
> generalizing what is over there across all zk access.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to