[jira] [Commented] (HELIX-96) ZkBaseDataAccessor.get() hangs during Zookeeper failure

kishore gopalakrishna (JIRA) Sun, 19 May 2013 22:29:57 -0700

    [ 
https://issues.apache.org/jira/browse/HELIX-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661782#comment-13661782
 ]


kishore gopalakrishna commented on HELIX-96:
--------------------------------------------

Hi Ming,

Can you give me more details on your setup. 

Is it that you have only one zookeeper server in production?. If you have more 
than one zookeeper server, the client will try to connect to another server 
automatically.

What behavior are you expecting if Zookeeper goes down. Most access to ZK via 
Helix should be asynchronous/background, it should not impact the regular code 
path of your application. What this means is if Zookeeper is down, your 
application can continue without downtime, and once you bring the zookeeper 
back up, everything should continue to function as if nothing happened. 

We can change the behavior to throw exception after connection timeout if the 
we cannot establish connection to zookeeper. But its not clear to me what would 
we do if we cannot connect.

Note in your code, you can know if you are connected to zk by invoking 
manager.isConnected()





                
> ZkBaseDataAccessor.get() hangs during Zookeeper failure
> -------------------------------------------------------
>
>                 Key: HELIX-96
>                 URL: https://issues.apache.org/jira/browse/HELIX-96
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.6.0-incubating
>            Reporter: Ming Fang
>            Assignee: Shi Lu
>
> During our failure testing with Zookeeper running in standalone mode, we 
> sometimes see our application hanging in the callstack below...
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x187c1f10> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at 
> java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:237)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2072)
>       at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
>       at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
>       at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
>       at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:254)
>       at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
>       at 
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:315)
>       at 
> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:461)
> The comment in ZKClient.java line 677 seems to say that eventually it would 
> get a Disconnected event and then throw an exception, but we waited for many 
> minutes.
> Also we were able to resume by simply restarting Zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HELIX-96) ZkBaseDataAccessor.get() hangs during Zookeeper failure

Reply via email to