[
https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717125#comment-14717125
]
Jackson Chung commented on STORM-130:
-------------------------------------
We just upgraded to 0.9.5 (because we also ran into this issue often in prev
version), and we also ran into this.
the supervisors did go down in our case.
1 caution in our upgrade is we started a new nimbus, without any supervisors
attached. Then we deployed topologies (from CICD). Next we build new
supervisors and the supervisors will start on startup. However, in between the
network service is restarted (due to hostname changed during the build <-
chef). Just wanna throw this out in case this makes a difference.
we also see this:
{code}
SEVERE: RuntimeException while executing runnable
org.apache.storm.guava.util.concurrent.Futures$4@445058b with executor
org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@691bc565
java.lang.RuntimeException: Failed to connect to
Netty-Client-usw2b-grunt-drone32-prod.amz.relateiq.com/10.30.103.202:6700
at backtype.storm.messaging.netty.Client.connect(Client.java:308)
at backtype.storm.messaging.netty.Client.access$1100(Client.java:78)
at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297)
at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283)
at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275)
at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181)
at
org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at
org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
at
org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
at
org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91)
at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384)
at java.util.concurrent.FutureTask.set(FutureTask.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:274)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Giving up to connect to
Netty-Client-usw2b-grunt-drone32-prod.amz.relateiq.com/10.30.103.202:6700 after
102 failed attempts
at backtype.storm.messaging.netty.Client.connect(Client.java:303)
{code}
> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does
> not exist
> ------------------------------------------------------------------------------------
>
> Key: STORM-130
> URL: https://issues.apache.org/jira/browse/STORM-130
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4
> Reporter: James Xu
> Assignee: Sriharsha Chintalapani
> Priority: Minor
> Fix For: 0.10.0, 0.9.5
>
> Attachments: README.txt, nimbus.log.gz, supervisor_logs.tar.gz,
> worker_logs.tar.gz, worker_logs_of_kafka_traffic.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz,
> workers_with_stormconf.ser.gz
>
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm
> 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File
> '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser'
> does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at
> backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at
> backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that
> storm-code-map is null and new-assignment is not null. I may be that
> new-assignment is got from zk after storm-code-map. So we changed the
> storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager
> processes-event-manager]
> (fn this []
> (let [conf (:conf supervisor)
> storm-cluster-state (:storm-cluster-state supervisor)
> ^ISupervisor isupervisor (:isupervisor supervisor)
> ^LocalState local-state (:local-state supervisor)
> sync-callback (fn [& ignored] (.add event-manager this))
> storm-code-map (read-storm-code-locations storm-cluster-state
> sync-callback)
> downloaded-storm-ids (set (read-downloaded-storm-ids conf))
> all-assignment (read-assignments
> storm-cluster-state
> (:supervisor-id supervisor)
> sync-callback)
> new-assignment (->> all-assignment
> (filter-key #(.confirmAssigned isupervisor %)))
> assigned-storm-ids (assigned-storm-ids-from-port-assignments
> new-assignment)
> existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
> (log-debug "Synchronizing supervisor")
> (log-debug "Storm code map: " storm-code-map)
> (log-debug "Downloaded storm ids: " downloaded-storm-ids)
> (log-debug "All assignment: " all-assignment)
> (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately.
> I was going to investigate more and before I created a posted but... There
> are two schedule-recurring functions of intertest in the function
> (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir
> [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create
> workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO]
> 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac
> and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of
> ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it
> was decided use new instance ZK.
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful
> if you could turn on DEBUG logging on the supervisor and show those logs when
> the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true")
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help.
> Now we down: haven't tried solution proposed by xiaokang but probably it
> would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing
> out Storm stuff in Zk, wiping out Storm data directory, and restarting the
> topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys,
> try restart your cluster and ensure that all previous workers (java
> processes) was killed after this operation.
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all
> Storm state in Zk and data directory (cluster otherwise quickly becomes
> unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log
> showed that a worker died and supervisor and nimbus found that at the same
> time. Than nimbus assign the worker to other host and the supervisor's
> sync-supervisor delete topo's code while sync-process try to restart the died
> timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and
> nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the
> issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let
> me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did
> resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0)
> as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read
> supervisor code, it seems not a bug, the key reason is
> "synchronize-supervisor" which responsible for download file and remove file
> thread and "sync-processes" which responsible for start worker process thread
> is Asynchronous.
> see this case: in synchronize-supervisor read assigment information from zk,
> supervisor download necessary file from nimbus and write local state . in
> aother thread sync-processes funciton read local state to launch workor
> process, when the worker process has not start ,synchronize-supervisor
> function is called again topology's assignment information has changed (cased
> by rebalance,or worker time out etc.) worker assignment to this supervisor
> has move to another supervisor, synchronize-supervisor remove the unnecessary
> file (jar file and ser file etc.) , after this, worker launched by "
> sync-processes" ,ser file was not exsit , this issue occur.
> Though in this issue exception occured but logic sees right. this excetion
> can't compelety Eliminate unless change thread Asynchronous feather.
> there is some way reduction the exception occur, @xiaokang metioned "change
> supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this
> method redeuce nimbus reassign topology. another way change
> "synchronize-supervisor" thread loop time to a longger than 10(default time)
> sec, such as 30 sec。
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)