[jira] [Commented] (HELIX-96) ZkBaseDataAccessor.get() hangs during Zookeeper failure

kishore gopalakrishna (JIRA) Tue, 21 May 2013 21:27:43 -0700

    [ 
https://issues.apache.org/jira/browse/HELIX-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663750#comment-13663750
 ]


kishore gopalakrishna commented on HELIX-96:
--------------------------------------------

Copying Ming's response on ML
>From Ming:

We're running Zookeeper Standalone, Helix Admin, Helix Controller all in one 
process called ZAC.
https://github.com/mingfang/apache-helix/blob/master/helix-example/src/main/java/org/apache/helix/examples/ZAC.java

I know it's not ideal to run Zookeeper standalone but it's easier for our ops 
team to manage.
In the event of ZAC failure I expect the cluster to continue to run.
The only difference is there will be no more controller to control cluster, 
therefore no HA.
But as long as we restart ZAC, everything should come back to normal.
And most of the time it does.

The problem is sometimes our nodes hang in the stack trace I sent.
I believe to be a (serious)bug and the correct behavior is to throw some 
exception so that I can handle it and more on.

                
> ZkBaseDataAccessor.get() hangs during Zookeeper failure
> -------------------------------------------------------
>
>                 Key: HELIX-96
>                 URL: https://issues.apache.org/jira/browse/HELIX-96
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.6.0-incubating
>            Reporter: Ming Fang
>            Assignee: Shi Lu
>
> During our failure testing with Zookeeper running in standalone mode, we 
> sometimes see our application hanging in the callstack below...
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x187c1f10> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at 
> java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:237)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2072)
>       at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
>       at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
>       at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
>       at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:254)
>       at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
>       at 
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:315)
>       at 
> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:461)
> The comment in ZKClient.java line 677 seems to say that eventually it would 
> get a Disconnected event and then throw an exception, but we waited for many 
> minutes.
> Also we were able to resume by simply restarting Zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HELIX-96) ZkBaseDataAccessor.get() hangs during Zookeeper failure

Reply via email to