[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

stack (JIRA) Fri, 29 Jul 2011 09:24:39 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072886#comment-13072886
 ]


stack commented on HBASE-3065:
------------------------------

Excellent Ram.  I'd figured the first part of your patch last night -- the 
metdata prefix -- but not the second.  Thank you.  Let me build on your patch 
for the rest of the fix.

So, yes, distributed log splitting does heavy interactions with zk.  It went in 
after the original 3065 patch was done.  The url encoding that distributed 
splitting zk'ing does is clashing with suffixes that this patch adds to file 
names -- we need to find the places where we want to use znode name and make 
sure we url decode.  The second issue is the one you note above where this 
patch adds metadata at front of data and we need to strip it reading in a few 
places.

Let me see how far I get today (I'm gone for a week starting this evening...)

> Retry all 'retryable' zk operations; e.g. connection loss
> ---------------------------------------------------------
>
>                 Key: HBASE-3065
>                 URL: https://issues.apache.org/jira/browse/HBASE-3065
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Liyin Tang
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 3065-v3.txt, 3065-v4.txt, HBASE-3065-addendum.patch, 
> HBase-3065[r1088475]_1.patch, hbase3065_2.patch
>
>
> The 'new' master refactored our zk code tidying up all zk accesses and 
> coralling them behind nice zk utility classes.  One improvement was letting 
> out all KeeperExceptions letting the client deal.  Thats good generally 
> because in old days, we'd suppress important state zk changes in state.  But 
> there is at least one case the new zk utility could handle for the 
> application and thats the class of retryable KeeperExceptions.  The one that 
> comes to mind is conection loss.  On connection loss we should retry the 
> just-failed operation.  Usually the retry will just work.  At worse, on 
> reconnect, we'll pick up the expired session event. 
> Adding in this change shouldn't be too bad given the refactor of zk corralled 
> all zk access into one or two classes only.
> One thing to consider though is how much we should retry.  We could retry on 
> a timer or we could retry for ever as long as the Stoppable interface is 
> passed so if another thread has stopped or aborted the hosting service, we'll 
> notice and give up trying.  Doing the latter is probably better than some 
> kinda timeout.
> HBASE-3062 adds a timed retry on the first zk operation.  This issue is about 
> generalizing what is over there across all zk access.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

Reply via email to