[jira] [Updated] (OOZIE-1921) Curator client reports connection loss to ZK under high load

Mona Chitnis (JIRA) Wed, 09 Jul 2014 17:18:25 -0700

     [ 
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mona Chitnis updated OOZIE-1921:
--------------------------------

    Description: 
Seeing two types of Connection Loss exceptions via Curator when running Oozie 
in high load (specifically workflows with ~80 forked actions)

h5. [1] (znode transaction type: delete)
{code}
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
{code}

h5. [2]
{code}
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
{code}

Tracking a particular job between the ZK trace logs reporting NoNode 
KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
exceptions with 'delete' of job
lock znode, that particular job never succeeds in getting lock and proceeding.
Not that familiar with when Oozie via Curator tries to delete znodes. 
OOZIE-1906 will introduce the Reaper.

Exception stacktrace pointing to Curator code:

ConnectionState.getZookeeper() {
...
boolean localIsConnected = isConnected.get();
        if ( !localIsConnected )
        {
            checkTimeouts();
        }
..
}

isConnected is FALSE and so exception is getting thrown from checkTimeouts(). 
Wasn't able to find any good docs or benchmarks explaining timeout issues 
Curator would face due to high load. My suspicion is Curator might have 
limitations in how many concurrent requests for same lock it can handle. In 
this particular stress test, there are 85 forked actions all contending for 
same job lock. Hence we should implement some fallback mechanism in Oozie while 
invoking Curator APIs.




  was:
Seeing two types of Connection Loss exceptions via Curator when running Oozie 
in high load

h5. [1] (znode transaction type: delete)
{code}
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
{code}

h5. [2]
{code}
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
{code}

We should probably implement a fallback approach in Oozie while invoking 
Curator library to handle any inherent limitations. But not able to find much 
documentation about Curator benchmarks.



> Curator client reports connection loss to ZK under high load
> ------------------------------------------------------------
>
>                 Key: OOZIE-1921
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1921
>             Project: Oozie
>          Issue Type: Bug
>          Components: HA
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>             Fix For: trunk
>
>
> Seeing two types of Connection Loss exceptions via Curator when running Oozie 
> in high load (specifically workflows with ~80 forked actions)
> h5. [1] (znode transaction type: delete)
> {code}
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
> ConnectionLoss
>         at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
>         at 
> org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
>         at 
> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
> {code}
> h5. [2]
> {code}
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /oozie/locks/0037706-140704041907-oozie-oozi-W
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> {code}
> Tracking a particular job between the ZK trace logs reporting NoNode 
> KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
> exceptions with 'delete' of job
> lock znode, that particular job never succeeds in getting lock and proceeding.
> Not that familiar with when Oozie via Curator tries to delete znodes. 
> OOZIE-1906 will introduce the Reaper.
> Exception stacktrace pointing to Curator code:
> ConnectionState.getZookeeper() {
> ...
> boolean localIsConnected = isConnected.get();
>         if ( !localIsConnected )
>         {
>             checkTimeouts();
>         }
> ..
> }
> isConnected is FALSE and so exception is getting thrown from checkTimeouts(). 
> Wasn't able to find any good docs or benchmarks explaining timeout issues 
> Curator would face due to high load. My suspicion is Curator might have 
> limitations in how many concurrent requests for same lock it can handle. In 
> this particular stress test, there are 85 forked actions all contending for 
> same job lock. Hence we should implement some fallback mechanism in Oozie 
> while invoking Curator APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (OOZIE-1921) Curator client reports connection loss to ZK under high load

Reply via email to