[ 
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056951#comment-14056951
 ] 

Purshotam Shah commented on OOZIE-1921:
---------------------------------------

Similar issue : https://issues.apache.org/jira/browse/BLUR-330

> Curator client reports connection loss to ZK under high load
> ------------------------------------------------------------
>
>                 Key: OOZIE-1921
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1921
>             Project: Oozie
>          Issue Type: Bug
>          Components: HA
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>             Fix For: trunk
>
>
> Seeing two types of Connection Loss exceptions via Curator when running Oozie 
> in high load (specifically workflows with ~80 forked actions)
> h5. [1] (znode transaction type: delete)
> {code}
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss
>         at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
>         at 
> org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
>         at 
> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
> {code}
> h5. [2]
> {code}
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /oozie/locks/0037706-140704041907-oozie-oozi-W
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> {code}
> Tracking a particular job between the ZK trace logs reporting NoNode 
> KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
> exceptions with 'delete' of job
> lock znode, that particular job never succeeds in getting lock and proceeding.
> Not that familiar with when Oozie via Curator tries to delete znodes. 
> OOZIE-1906 will introduce the Reaper.
> Exception stacktrace pointing to Curator code:
> ConnectionState.getZookeeper() {
> ...
> boolean localIsConnected = isConnected.get();
>         if ( !localIsConnected )
>         {
>             checkTimeouts();
>         }
> ..
> }
> isConnected is FALSE and so exception is getting thrown from checkTimeouts(). 
> Wasn't able to find any good docs or benchmarks explaining timeout issues 
> Curator would face due to high load. My suspicion is Curator might have 
> limitations in how many concurrent requests for same lock it can handle. In 
> this particular stress test, there are 85 forked actions all contending for 
> same job lock. Hence we should implement some fallback mechanism in Oozie 
> while invoking Curator APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to