[
https://issues.apache.org/jira/browse/SLIDER-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273544#comment-14273544
]
Steve Loughran commented on SLIDER-748:
---------------------------------------
logs show AM is spinning waiting for ZK to come up
{code}
steners registered.
2015-01-12 12:12:20,794 [Thread-13] DEBUG agent.HeartbeatMonitor
(HeartbeatMonitor.java:run(65)) - Putting monitor to sleep for 60000
milliseconds
2015-01-12 12:12:26,538 [CuratorFramework-0] ERROR curator.ConnectionState
(ConnectionState.java:checkTimeouts(201)) - Connection timed out for connection
string (localhost:65385) and timeout (15000) / elapsed (15607)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:763)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:749)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:56)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$3.call(CuratorFrameworkImpl.java:244)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-01-12 12:12:28,543 [CuratorFramework-0] ERROR curator.ConnectionState
(ConnectionState.java:checkTimeouts(201)) - Connection timed out for connection
string (localhost:65385) and timeout (15000) / elapsed (17613)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
{code}
Root causes? Possibly
# ZK isn't coming up
# configuration to ZK instance is wrong.
There's a related question: should ZK binding and registration be async?
There's no reason why not, though it will hide this situation only showing it
in logs.
Maybe metrics could track it/web UI
Plan
# identify root cause
# make ZK binding async
# publish ZK connection state as a codahale health check (somehow)
# fix root cause once failure handling improved
> TestAgentAMManagementWS.testAgentAMManagementWS failing
> -------------------------------------------------------
>
> Key: SLIDER-748
> URL: https://issues.apache.org/jira/browse/SLIDER-748
> Project: Slider
> Issue Type: Sub-task
> Components: Web & REST
> Affects Versions: Slider 0.70
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Critical
> Fix For: Slider 0.70
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> {{TestAgentAMManagementWS.testAgentAMManagementWS}} failing on jenkins.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)