[ 
https://issues.apache.org/jira/browse/SLIDER-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273544#comment-14273544
 ] 

Steve Loughran commented on SLIDER-748:
---------------------------------------

logs show AM is spinning waiting for ZK to come up
{code}
steners registered.
2015-01-12 12:12:20,794 [Thread-13] DEBUG agent.HeartbeatMonitor 
(HeartbeatMonitor.java:run(65)) - Putting monitor to sleep for 60000 
milliseconds
2015-01-12 12:12:26,538 [CuratorFramework-0] ERROR curator.ConnectionState 
(ConnectionState.java:checkTimeouts(201)) - Connection timed out for connection 
string (localhost:65385) and timeout (15000) / elapsed (15607)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:763)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:749)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:56)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$3.call(CuratorFrameworkImpl.java:244)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2015-01-12 12:12:28,543 [CuratorFramework-0] ERROR curator.ConnectionState 
(ConnectionState.java:checkTimeouts(201)) - Connection timed out for connection 
string (localhost:65385) and timeout (15000) / elapsed (17613)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
{code}

Root causes? Possibly
# ZK isn't coming up
# configuration to ZK instance is wrong.

There's a related question: should ZK binding and registration be async? 
There's no reason why not, though it will hide this situation only showing it 
in logs.

Maybe metrics could track it/web UI

Plan
# identify root cause
# make ZK binding async
# publish ZK connection state as a codahale health check (somehow)
# fix root cause once failure handling improved

> TestAgentAMManagementWS.testAgentAMManagementWS failing
> -------------------------------------------------------
>
>                 Key: SLIDER-748
>                 URL: https://issues.apache.org/jira/browse/SLIDER-748
>             Project: Slider
>          Issue Type: Sub-task
>          Components: Web & REST
>    Affects Versions: Slider 0.70
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Critical
>             Fix For: Slider 0.70
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{TestAgentAMManagementWS.testAgentAMManagementWS}} failing on jenkins. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to