We're running Zookeeper Standalone, Helix Admin, Helix Controller all in one 
process called ZAC.
https://github.com/mingfang/apache-helix/blob/master/helix-example/src/main/java/org/apache/helix/examples/ZAC.java

I know it's not ideal to run Zookeeper standalone but it's easier for our ops 
team to manage.
In the event of ZAC failure I expect the cluster to continue to run.
The only difference is there will be no more controller to control cluster, 
therefore no HA.
But as long as we restart ZAC, everything should come back to normal.
And most of the time it does.

The problem is sometimes our nodes hang in the stack trace I sent.
I believe to be a (serious)bug and the correct behavior is to throw some 
exception so that I can handle it and more on.

On May 20, 2013, at 1:29 AM, kishore gopalakrishna (JIRA) <[email protected]> 
wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/HELIX-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661782#comment-13661782
>  ] 
> 
> kishore gopalakrishna commented on HELIX-96:
> --------------------------------------------
> 
> Hi Ming,
> 
> Can you give me more details on your setup. 
> 
> Is it that you have only one zookeeper server in production?. If you have 
> more than one zookeeper server, the client will try to connect to another 
> server automatically.
> 
> What behavior are you expecting if Zookeeper goes down. Most access to ZK via 
> Helix should be asynchronous/background, it should not impact the regular 
> code path of your application. What this means is if Zookeeper is down, your 
> application can continue without downtime, and once you bring the zookeeper 
> back up, everything should continue to function as if nothing happened. 
> 
> We can change the behavior to throw exception after connection timeout if the 
> we cannot establish connection to zookeeper. But its not clear to me what 
> would we do if we cannot connect.
> 
> Note in your code, you can know if you are connected to zk by invoking 
> manager.isConnected()
> 
> 
> 
> 
> 
> 
>> ZkBaseDataAccessor.get() hangs during Zookeeper failure
>> -------------------------------------------------------
>> 
>>                Key: HELIX-96
>>                URL: https://issues.apache.org/jira/browse/HELIX-96
>>            Project: Apache Helix
>>         Issue Type: Bug
>>         Components: helix-core
>>   Affects Versions: 0.6.0-incubating
>>           Reporter: Ming Fang
>>           Assignee: Shi Lu
>> 
>> During our failure testing with Zookeeper running in standalone mode, we 
>> sometimes see our application hanging in the callstack below...
>>   java.lang.Thread.State: TIMED_WAITING (parking)
>>      at sun.misc.Unsafe.park(Native Method)
>>      - parking to wait for  <0x187c1f10> (a 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>      at 
>> java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:237)
>>      at 
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2072)
>>      at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
>>      at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
>>      at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
>>      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
>>      at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:254)
>>      at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
>>      at 
>> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:315)
>>      at 
>> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:461)
>> The comment in ZKClient.java line 677 seems to say that eventually it would 
>> get a Disconnected event and then throw an exception, but we waited for many 
>> minutes.
>> Also we were able to resume by simply restarting Zookeeper.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to