[
https://issues.apache.org/jira/browse/HELIX-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661782#comment-13661782
]
kishore gopalakrishna commented on HELIX-96:
--------------------------------------------
Hi Ming,
Can you give me more details on your setup.
Is it that you have only one zookeeper server in production?. If you have more
than one zookeeper server, the client will try to connect to another server
automatically.
What behavior are you expecting if Zookeeper goes down. Most access to ZK via
Helix should be asynchronous/background, it should not impact the regular code
path of your application. What this means is if Zookeeper is down, your
application can continue without downtime, and once you bring the zookeeper
back up, everything should continue to function as if nothing happened.
We can change the behavior to throw exception after connection timeout if the
we cannot establish connection to zookeeper. But its not clear to me what would
we do if we cannot connect.
Note in your code, you can know if you are connected to zk by invoking
manager.isConnected()
> ZkBaseDataAccessor.get() hangs during Zookeeper failure
> -------------------------------------------------------
>
> Key: HELIX-96
> URL: https://issues.apache.org/jira/browse/HELIX-96
> Project: Apache Helix
> Issue Type: Bug
> Components: helix-core
> Affects Versions: 0.6.0-incubating
> Reporter: Ming Fang
> Assignee: Shi Lu
>
> During our failure testing with Zookeeper running in standalone mode, we
> sometimes see our application hanging in the callstack below...
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x187c1f10> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:237)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2072)
> at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
> at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
> at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
> at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:254)
> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> at
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:315)
> at
> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:461)
> The comment in ZKClient.java line 677 seems to say that eventually it would
> get a Disconnected event and then throw an exception, but we waited for many
> minutes.
> Also we were able to resume by simply restarting Zookeeper.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira