[jira] [Updated] (STORM-131) Intermittent Zookeper errors when shutting down local Topology

Rick Kellogg (JIRA) Thu, 08 Oct 2015 17:32:12 -0700

     [ 
https://issues.apache.org/jira/browse/STORM-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rick Kellogg updated STORM-131:
-------------------------------
    Component/s: storm-core

> Intermittent Zookeper errors when shutting down local Topology
> --------------------------------------------------------------
>
>                 Key: STORM-131
>                 URL: https://issues.apache.org/jira/browse/STORM-131
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/259
> We have a great deal of Storm integration tests in our project (Storm version 
> 0.7.3) using local topology. We have only one topology operational at any 
> moment in time. As tests run they are organized in groups. Each group works 
> within the boundaries of a topology. When the tests finish executing they 
> shutdown their local cluster, then the new group of tests launches its own 
> cluster.
> We see with some remarkable regularity failures related to, what looks like, 
> incorrect Zookeeper shutdown, which leads to a JVM exit (which is a disaster 
> as no test information is recorded at the end). Here is what we see in the 
> main error log (log level: WARN and higher):
> {code}
> 2012-07-07 00:22:58,420 WARN [ConnectionStateManager-0|]@jenkins 
> com.netflix.curator.framework.state.ConnectionStateManager
> => There are no ConnectionStateListeners registered.
> 2012-07-07 00:22:58,534 WARN [Thread-23-EventThread|]@jenkins 
> backtype.storm.cluster
> => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-07 00:23:00,013 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:01,527 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:03,510 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:04,687 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:05,961 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:07,588 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins 
> org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing 
> socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:07,691 ERROR [Thread-23-EventThread|]@jenkins 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl
> => Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at 
> com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2012-07-07 00:23:07,697 WARN [ConnectionStateManager-0|]@jenkins 
> com.netflix.curator.framework.state.ConnectionStateManager
> => There are no ConnectionStateListeners registered.
> 2012-07-07 00:23:07,699 ERROR [Thread-23-EventThread|]@jenkins 
> backtype.storm.zookeeper
> => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at 
> com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> And here is what we see in Storm dedicated log file (log level: DEBUG):
> 2012-07-07 00:22:58,306 INFO [main|]@jenkins backtype.storm.daemon.task
> => Shut down task TLTopology-1-1341620393:31
> 2012-07-07 00:22:58,306 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Shutting down receiving-thread: [TLTopology-1-1341620393, 5]
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Waiting for receiving-thread:[TLTopology-1-1341620393, 5] to die
> 2012-07-07 00:22:58,307 INFO [Thread-319|]@jenkins 
> backtype.storm.messaging.loader
> => Receiving-thread:[TLTopology-1-1341620393, 5] received shutdown notice
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Shutdown receiving-thread: [TLTopology-1-1341620393, 5]
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Terminating zmq context
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Waiting for threads to die
> 2012-07-07 00:22:58,307 INFO [Thread-318|]@jenkins backtype.storm.util
> => Async loop interrupted!
> 2012-07-07 00:22:58,309 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Disconnecting from storm cluster state context
> 2012-07-07 00:22:58,311 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Shut down worker TLTopology-1-1341620393 
> 96e12303-4c22-4821-9f3b-3bce2230bf08 5
> 2012-07-07 00:22:58,311 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path 
> /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e/heartbeats
> 2012-07-07 00:22:58,313 DEBUG [main|]@jenkins backtype.storm.util
> => Removing path 
> /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e/pids
> 2012-07-07 00:22:58,313 DEBUG [main|]@jenkins backtype.storm.util
> => Removing path 
> /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e
> 2012-07-07 00:22:58,313 INFO [main|]@jenkins backtype.storm.daemon.supervisor
> => Shut down 
> 96e12303-4c22-4821-9f3b-3bce2230bf08:16966f32-d0d4-4ee1-a0fe-1d85fc4a478e
> 2012-07-07 00:22:58,314 INFO [main|]@jenkins backtype.storm.daemon.supervisor
> => Shutting down supervisor 96e12303-4c22-4821-9f3b-3bce2230bf08
> 2012-07-07 00:22:58,314 INFO [Thread-25|]@jenkins backtype.storm.event
> => Event manager interrupted
> 2012-07-07 00:22:58,315 INFO [Thread-26|]@jenkins backtype.storm.event
> => Event manager interrupted
> 2012-07-07 00:22:58,318 INFO [main|]@jenkins backtype.storm.testing
> => Shutting down in process zookeeper
> 2012-07-07 00:22:58,321 INFO [main|]@jenkins backtype.storm.testing
> => Done shutting down in process zookeeper
> 2012-07-07 00:22:58,321 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/0202cf11-6ad7-4dda-94d6-622a63c9f6b6
> 2012-07-07 00:22:58,321 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/0202cf11-6ad7-4dda-94d6-622a63c9f6b6
> 2012-07-07 00:22:58,322 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/ee47e3e3-752f-40a8-b6a9-a197a9dda3de
> 2012-07-07 00:22:58,323 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/ee47e3e3-752f-40a8-b6a9-a197a9dda3de
> 2012-07-07 00:22:58,323 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/ece72b84-357e-4183-aeb5-e0d2dc5d6eca
> 2012-07-07 00:22:58,323 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/ece72b84-357e-4183-aeb5-e0d2dc5d6eca
> 2012-07-07 00:22:58,326 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc
> 2012-07-07 00:22:58,326 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc
> 2012-07-07 00:22:58,534 WARN [Thread-23-EventThread|]@jenkins 
> backtype.storm.cluster
> => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-07 00:23:07,699 ERROR [Thread-23-EventThread|]@jenkins 
> backtype.storm.zookeeper
> => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at 
> com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2012-07-07 00:23:07,702 INFO [Thread-23-EventThread|]@jenkins 
> backtype.storm.util
> => Halting process: ("Unrecoverable Zookeeper error")
> {code}
> It seems like a threading issue to me personally. I wonder if there is some 
> form of workaround. I also understand that since this is a "local" topology 
> issue, this might not receive due attention... However, fundamentally this is 
> what new users would start with when they begin to play with Storm, and, I 
> think, it is important to make this experience positive.
> Nathan, thank you very much for everything that you're doing.
> -Kyrill
> ----------
> dkincaid: Looking through the shutdown code for local clusters I noticed a 
> comment in the code about a possible race condition. I'm wondering if we 
> could be running into this on our Jenkins server (which we know runs pretty 
> slowly). Is a worker getting restarted before the supervisor can be shutdown?
> Here is the function with the comment:
> {code}
> (defn kill-local-storm-cluster [cluster-map]
>   (.shutdown (:nimbus cluster-map))
>   (.close (:state cluster-map))
>   (.disconnect (:storm-cluster-state cluster-map))
>   (doseq [s @(:supervisors cluster-map)]
>     (.shutdown-all-workers s)
>     ;; race condition here? will it launch the workers again?
>     (supervisor/kill-supervisor s))
>   (psim/kill-all-processes)
>   (log-message "Shutting down in process zookeeper")
>   (zk/shutdown-inprocess-zookeeper (:zookeeper cluster-map))
>   (log-message "Done shutting down in process zookeeper")
>   (doseq [t @(:tmp-dirs cluster-map)]
>     (log-message "Deleting temporary path " t)
>     (rmr t)
>     ))
> {code}
> --------
> kyrill007: Fantastic catch, Dave!!! This exactly what is happening: 
> supervisor begins launching new workers when the other ones are still being 
> shut down. Here is the proof from the logs:
> {code}
> Shut down process is initiated at 04:37:05,136.
> 2012-07-11 04:37:05,136 INFO [main|]@jenkins backtype.storm.daemon.nimbus
>   => Shutting down master
> 2012-07-11 04:37:05,145 INFO [main|]@jenkins backtype.storm.daemon.nimbus
>   => Shut down master
> 2012-07-11 04:37:05,151 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shutting down 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543:12eba15d-fb17-4a3c-8e25-1c0266eed04d
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.process-simulator
>   => Killing process ea132b37-dc6a-447c-b1de-ac6727c82cef
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.daemon.worker
>   => Shutting down worker TLTopology-1-1341981237 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543 1
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shutting down task TLTopology-1-1341981237:64
> 2012-07-11 04:37:05,153 INFO [Thread-129|]@jenkins backtype.storm.util
>   => Async loop interrupted!
> 2012-07-11 04:37:05,180 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shut down task TLTopology-1-1341981237:64
> 2012-07-11 04:37:05,180 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shutting down task TLTopology-1-1341981237:34
> It continues for a while (we have a lot of workers). Then at 04:37:05,665 we 
> start seeing this:
> 012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins 
> backtype.storm.daemon.supervisor
>   => Assigned tasks: {2 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "TLTopology-1-1341981237", :task-ids (96 66 36 6 102 72 42 12 108 78 48 18 
> 114 84 54 24 120 90 60 30 126)}, 1 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "TLTopology-1-1341981237", :task-ids (64 34 4 100 70 40 10 106 76 46 16 112 
> 82 52 22 118 88 58 28 124 94)}, 3 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "TLTopology-1-1341981237", :task-ids (32 2 98 68 38 8 104 74 44 14 110 80 50 
> 20 116 86 56 26 122 92 62)}}
> 2012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins 
> backtype.storm.daemon.supervisor
>   => Allocated: {"a724dc19-84ec-46dc-9768-afb73df94237" [:valid 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1341981425, 
> :storm-id "TLTopology-1-1341981237", :task-ids #{96 66 36 6 102 72 42 12 108 
> 78 48 18 114 84 54 24 120 90 60 30 126}, :port 2}]}
> 2012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins backtype.storm.util
>   => Making dirs at 
> /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a7f81ea0-a5f6-47de-9a89-47998b1e1639/pids
> 2012-07-11 04:37:05,666 DEBUG [Thread-19|]@jenkins backtype.storm.util
>   => Making dirs at 
> /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/1b5c4c87-4e05-4cab-a580-ae1dabb3fd2e/pids
> 2012-07-11 04:37:05,666 INFO [main|]@jenkins backtype.storm.daemon.worker
>   => Shut down worker TLTopology-1-1341981237 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543 2
> 2012-07-11 04:37:05,667 DEBUG [main|]@jenkins backtype.storm.util
>   => Rmr path 
> /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237/heartbeats
> 2012-07-11 04:37:05,669 DEBUG [main|]@jenkins backtype.storm.util
>   => Removing path 
> /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237/pids
> 2012-07-11 04:37:05,669 DEBUG [main|]@jenkins backtype.storm.util
>   => Removing path 
> /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237
> 2012-07-11 04:37:05,669 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shut down 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543:a724dc19-84ec-46dc-9768-afb73df94237
> 2012-07-11 04:37:05,669 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shutting down supervisor 5c48d4fc-769f-41ef-abd6-f92df60fa543
> 2012-07-11 04:37:05,670 INFO [Thread-18|]@jenkins backtype.storm.event
>   => Event manager interrupted
> 2012-07-11 04:37:05,670 INFO [Thread-19|]@jenkins 
> backtype.storm.daemon.supervisor
>   => Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "TLTopology-1-1341981237", :task-ids (64 34 4 100 70 40 10 106 76 46 16 112 
> 82 52 22 118 88 58 28 124 94)} for this supervisor 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543 on port 1 with id 
> a7f81ea0-a5f6-47de-9a89-47998b1e1639
> 2012-07-11 04:37:05,672 INFO [Thread-19|]@jenkins backtype.storm.daemon.worker
>   => Launching worker for TLTopology-1-1341981237 on 
> 5c48d4fc-769f-41ef-abd6-f92df60fa543:1 with id 
> a7f81ea0-a5f6-47de-9a89-47998b1e1639 and conf {"dev.zookeeper.path" 
> "/tmp/dev-storm-zookeeper", "topology.fall.back.on.java.serialization" true, 
> "zmq.linger.millis" 0, "topology.skip.missing.kryo.registrations" true, 
> "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout" 20000, 
> "nimbus.reassign" true, "nimbus.monitor.freq.secs" 10, "java.library.path" 
> "/usr/local/lib:/opt/local/lib:/usr/lib", "storm.local.dir" 
> "/tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c", 
> "supervisor.worker.start.timeout.secs" 120, "nimbus.cleanup.inbox.freq.secs" 
> 600, "nimbus.inbox.jar.expiration.secs" 3600, "nimbus.host" "localhost", 
> "storm.zookeeper.port" 2000, "transactional.zookeeper.port" nil, 
> "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", 
> "supervisor.enable" true, "storm.zookeeper.servers" ["localhost"], 
> "transactional.zookeeper.root" "/transactional", "topology.worker.childopts" 
> nil, "worker.childopts" "-Xmx768m", "supervisor.heartbeat.frequency.secs" 5, 
> "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, 
> "task.heartbeat.frequency.secs" 3, "topology.max.spout.pending" nil, 
> "storm.zookeeper.retry.interval" 1000, "supervisor.slots.ports" (1 2 3), 
> "topology.debug" false, "nimbus.task.launch.secs" 120, 
> "nimbus.supervisor.timeout.secs" 60, "topology.message.timeout.secs" 30, 
> "task.refresh.poll.secs" 10, "topology.workers" 1, "supervisor.childopts" 
> "-Xmx1024m", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, 
> "worker.heartbeat.frequency.secs" 1, "nimbus.task.timeout.secs" 30, 
> "drpc.invocations.port" 3773, "zmq.threads" 1, "storm.zookeeper.retry.times" 
> 5, "topology.state.synchronization.timeout.secs" 60, 
> "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, 
> "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 
> 8080, "nimbus.childopts" "-Xmx1024m", "topology.ackers" 1, 
> "storm.cluster.mode" "local", "topology.optimize" true, 
> "topology.max.task.parallelism" nil}
> 2012-07-11 04:37:05,675 INFO [Thread-19|]@jenkins backtype.storm.event
>   => Event manager interrupted
> 2012-07-11 04:37:05,677 INFO [Thread-19-EventThread|]@jenkins 
> backtype.storm.zookeeper
>   => Zookeeper state update: :connected:none
> which at the end result in this:
> 2012-07-11 04:37:06,175 INFO [Thread-19-EventThread|]@jenkins 
> backtype.storm.zookeeper
>   => Zookeeper state update: :disconnected:none
> 2012-07-11 04:37:06,175 WARN [Thread-22-EventThread|]@jenkins 
> backtype.storm.cluster
>   => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-11 04:37:15,923 ERROR [Thread-22-EventThread|]@jenkins 
> backtype.storm.zookeeper
>   => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>     at 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
>     at 
> com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
>     at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:613)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
> 2012-07-11 04:37:15,926 INFO [Thread-22-EventThread|]@jenkins 
> backtype.storm.util
>   => Halting process: ("Unrecoverable Zookeeper error")
> {code}
> Dear Nathan,
> If this race condition could somehow be fixed (presumably it is not that hard 
> since we know what the problem is), it would so much appreciated!!!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-131) Intermittent Zookeper errors when shutting down local Topology

Reply via email to