[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

ramkrishna.s.vasudevan (JIRA) Fri, 29 Jul 2011 03:37:55 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072769#comment-13072769
 ]


ramkrishna.s.vasudevan commented on HBASE-3065:
-----------------------------------------------

Reason for TestSplitLogManager failures
=======================================
Pls find the analysis
-> The Recoverable zookeeper encodes the node name while creating.(This is 
already pointed out by Stack).
-> The RecoverableZookeeper while writing the data adds some metadata to it.
{noformat}
byte[] newData = appendMetaData(data);
{noformat}
While we executing some of the testcases 
SplitLogManager.getDataSetWatchSuccess() gets invoked.
Here we do some state comparions like
{noformat}
    if (TaskState.TASK_UNASSIGNED.equals(data)) {
      LOG.debug("task not yet acquired " + path + " ver = " + version);
      handleUnassignedTask(path);
    } else if (TaskState.TASK_OWNED.equals(data)) {
      registerHeartbeat(path, version,
          TaskState.TASK_OWNED.getWriterName(data));
    } else if (TaskState.TASK_RESIGNED.equals(data)) {
      LOG.info("task " + path + " entered state " + new String(data));
      resubmit(path, true);
    }
{noformat}

Here the data variable is with metadata appended while writing the data whereas 
the  TaskState is without metadata.
So any comparison that we make fails.

Also one more observation is 'testOrphanTaskAcquisition()' testcase needs some 
wait mechanism before proceeding.
Because the GetDataAsyncCallback call is asynchronous.

In RecoverableZooKeeper the javadoc itself says creating a node should be 
handled carefully.
I have still not completely covered all the failures but this is the basic 
reason.  Even the test case 'testTaskDone'()' hanging is due to the same 
problem I feel.  
Am not fully aware of the splitlog feature with zookeeper will try to provide 
an addendum to this.

> Retry all 'retryable' zk operations; e.g. connection loss
> ---------------------------------------------------------
>
>                 Key: HBASE-3065
>                 URL: https://issues.apache.org/jira/browse/HBASE-3065
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Liyin Tang
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 3065-v3.txt, 3065-v4.txt, HBase-3065[r1088475]_1.patch, 
> hbase3065_2.patch
>
>
> The 'new' master refactored our zk code tidying up all zk accesses and 
> coralling them behind nice zk utility classes.  One improvement was letting 
> out all KeeperExceptions letting the client deal.  Thats good generally 
> because in old days, we'd suppress important state zk changes in state.  But 
> there is at least one case the new zk utility could handle for the 
> application and thats the class of retryable KeeperExceptions.  The one that 
> comes to mind is conection loss.  On connection loss we should retry the 
> just-failed operation.  Usually the retry will just work.  At worse, on 
> reconnect, we'll pick up the expired session event. 
> Adding in this change shouldn't be too bad given the refactor of zk corralled 
> all zk access into one or two classes only.
> One thing to consider though is how much we should retry.  We could retry on 
> a timer or we could retry for ever as long as the Stoppable interface is 
> passed so if another thread has stopped or aborted the hosting service, we'll 
> notice and give up trying.  Doing the latter is probably better than some 
> kinda timeout.
> HBASE-3062 adds a timed retry on the first zk operation.  This issue is about 
> generalizing what is over there across all zk access.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

Reply via email to