[ 
https://issues.apache.org/jira/browse/ACCUMULO-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758361#comment-13758361
 ] 

John Vines commented on ACCUMULO-1449:
--------------------------------------

So, poking through the code, I think the best plan of action is allowing the 
KeeperException to eventually percolate out to indicate an error. It seems only 
get and getChildren are the only methods which use it, so we could just have 
them return null, but I'm concerned about overloading returns like that. I'm 
thinking the least intrusive way to handle this is to switch ZooKeeperInstance 
to use a new cache method which gets and has a timeout involved and can throw a 
KeeperException, this way the 60-100 or so implementations of get and 
getChildren don't need to be updated to handle KeeperExceptions themselves.
                
> Connector/ZooCache code enters infinite loop when Zookeeper connection lost.
> ----------------------------------------------------------------------------
>
>                 Key: ACCUMULO-1449
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1449
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.5.0
>         Environment: accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, 
> CentOS 6.4
>            Reporter: Luke Brassard
>             Fix For: 1.5.1, 1.6.0
>
>
> While using 1.5.0-RC4 a long-lived {{Connector}} went into an infinite loop 
> of Zookeeper "ConnectionLoss" and "Session expired" failures. In a 
> multithreaded application, all using the same {{Connector}}, there were 
> errors whenever there were calls to {{conn.createScanner()}} and 
> {{conn.createBatchScanner()}}. Here are a couple stacktraces:
> {code}
> 013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>       at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
>       at 
> org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
>       at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
>       at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
>       at 
> org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
>       at 
> org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
>       at 
> org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
>       at 
> org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137)
> {code}    
> {code}    
> 2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will 
> retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>       at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
>       at 
> org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
>       at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
>       at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
>       at 
> org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
>       at 
> org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
>       at 
> org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
>       at 
> org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89)
> {code}
> The method {{ZooCache.retry(ZooRunnable op)}} (ZooCache.java:128) has a 
> {{while(true)}} loop that should probably have a max retries or timeout that 
> will eventually cause the method to throw an exception that can be handled 
> appropriately by the client. As it is currently, this loop will never be 
> exited when Zookeeper continues to error.
> Note: There may have been a network hiccup that triggered the bug, but there 
> was no way to recover without restarting the application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to