[
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056951#comment-14056951
]
Purshotam Shah commented on OOZIE-1921:
---------------------------------------
Similar issue : https://issues.apache.org/jira/browse/BLUR-330
> Curator client reports connection loss to ZK under high load
> ------------------------------------------------------------
>
> Key: OOZIE-1921
> URL: https://issues.apache.org/jira/browse/OOZIE-1921
> Project: Oozie
> Issue Type: Bug
> Components: HA
> Affects Versions: trunk
> Reporter: Mona Chitnis
> Fix For: trunk
>
>
> Seeing two types of Connection Loss exceptions via Curator when running Oozie
> in high load (specifically workflows with ~80 forked actions)
> h5. [1] (znode transaction type: delete)
> {code}
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
> ConnectionLoss
> at
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
> at
> org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
> at
> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
> {code}
> h5. [2]
> {code}
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /oozie/locks/0037706-140704041907-oozie-oozi-W
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> {code}
> Tracking a particular job between the ZK trace logs reporting NoNode
> KeeperExceptions and Oozie logs, found that after encountering the zookeeper
> exceptions with 'delete' of job
> lock znode, that particular job never succeeds in getting lock and proceeding.
> Not that familiar with when Oozie via Curator tries to delete znodes.
> OOZIE-1906 will introduce the Reaper.
> Exception stacktrace pointing to Curator code:
> ConnectionState.getZookeeper() {
> ...
> boolean localIsConnected = isConnected.get();
> if ( !localIsConnected )
> {
> checkTimeouts();
> }
> ..
> }
> isConnected is FALSE and so exception is getting thrown from checkTimeouts().
> Wasn't able to find any good docs or benchmarks explaining timeout issues
> Curator would face due to high load. My suspicion is Curator might have
> limitations in how many concurrent requests for same lock it can handle. In
> this particular stress test, there are 85 forked actions all contending for
> same job lock. Hence we should implement some fallback mechanism in Oozie
> while invoking Curator APIs.
--
This message was sent by Atlassian JIRA
(v6.2#6252)