[
https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-130:
-------------------------------
Component/s: storm-core
> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does
> not exist
> ------------------------------------------------------------------------------------
>
> Key: STORM-130
> URL: https://issues.apache.org/jira/browse/STORM-130
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4
> Reporter: James Xu
> Assignee: Sriharsha Chintalapani
> Priority: Minor
> Fix For: 0.10.0, 0.9.5
>
> Attachments: README.txt, nimbus.log.gz, supervisor.log.2015-08-26,
> supervisor_logs.tar.gz, worker_logs.tar.gz,
> worker_logs_of_kafka_traffic.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz,
> workers_with_stormconf.ser.gz
>
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm
> 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File
> '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser'
> does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at
> backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at
> backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that
> storm-code-map is null and new-assignment is not null. I may be that
> new-assignment is got from zk after storm-code-map. So we changed the
> storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager
> processes-event-manager]
> (fn this []
> (let [conf (:conf supervisor)
> storm-cluster-state (:storm-cluster-state supervisor)
> ^ISupervisor isupervisor (:isupervisor supervisor)
> ^LocalState local-state (:local-state supervisor)
> sync-callback (fn [& ignored] (.add event-manager this))
> storm-code-map (read-storm-code-locations storm-cluster-state
> sync-callback)
> downloaded-storm-ids (set (read-downloaded-storm-ids conf))
> all-assignment (read-assignments
> storm-cluster-state
> (:supervisor-id supervisor)
> sync-callback)
> new-assignment (->> all-assignment
> (filter-key #(.confirmAssigned isupervisor %)))
> assigned-storm-ids (assigned-storm-ids-from-port-assignments
> new-assignment)
> existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
> (log-debug "Synchronizing supervisor")
> (log-debug "Storm code map: " storm-code-map)
> (log-debug "Downloaded storm ids: " downloaded-storm-ids)
> (log-debug "All assignment: " all-assignment)
> (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately.
> I was going to investigate more and before I created a posted but... There
> are two schedule-recurring functions of intertest in the function
> (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir
> [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create
> workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO]
> 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac
> and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of
> ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it
> was decided use new instance ZK.
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful
> if you could turn on DEBUG logging on the supervisor and show those logs when
> the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true")
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help.
> Now we down: haven't tried solution proposed by xiaokang but probably it
> would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing
> out Storm stuff in Zk, wiping out Storm data directory, and restarting the
> topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys,
> try restart your cluster and ensure that all previous workers (java
> processes) was killed after this operation.
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all
> Storm state in Zk and data directory (cluster otherwise quickly becomes
> unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log
> showed that a worker died and supervisor and nimbus found that at the same
> time. Than nimbus assign the worker to other host and the supervisor's
> sync-supervisor delete topo's code while sync-process try to restart the died
> timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and
> nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the
> issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let
> me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did
> resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0)
> as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read
> supervisor code, it seems not a bug, the key reason is
> "synchronize-supervisor" which responsible for download file and remove file
> thread and "sync-processes" which responsible for start worker process thread
> is Asynchronous.
> see this case: in synchronize-supervisor read assigment information from zk,
> supervisor download necessary file from nimbus and write local state . in
> aother thread sync-processes funciton read local state to launch workor
> process, when the worker process has not start ,synchronize-supervisor
> function is called again topology's assignment information has changed (cased
> by rebalance,or worker time out etc.) worker assignment to this supervisor
> has move to another supervisor, synchronize-supervisor remove the unnecessary
> file (jar file and ser file etc.) , after this, worker launched by "
> sync-processes" ,ser file was not exsit , this issue occur.
> Though in this issue exception occured but logic sees right. this excetion
> can't compelety Eliminate unless change thread Asynchronous feather.
> there is some way reduction the exception occur, @xiaokang metioned "change
> supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this
> method redeuce nimbus reassign topology. another way change
> "synchronize-supervisor" thread loop time to a longger than 10(default time)
> sec, such as 30 sec。
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)