[
https://issues.apache.org/jira/browse/STORM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-133:
-------------------------------
Component/s: storm-core
> FileNotFoundException during supervisor cleanup due to non-existant file
> ------------------------------------------------------------------------
>
> Key: STORM-133
> URL: https://issues.apache.org/jira/browse/STORM-133
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/356
> During development runs with local clusters I've sometimes managed to
> encounter FileNotFoundException thrown from FileUtils.forceDelete during
> cleanup.
> Without knowing the internals of Storm it feels a bit strange that clean up...
> a) ...fails when trying to delete a non existing file/folder
> b) ...tries to delete a non-existant file
> Would suppressing FileNotFoundException from FileUtils.forceDelete or
> switching to FileUtils.deleteQuietly be reasonable?
> Sample stack trace:
> ERROR [2012-10-09 22:08:51,255] org.apache.zookeeper.server.NIOServerCnxn:
> Thread Thread[main,5,main] died
> ! java.io.FileNotFoundException: File does not exist:
> /var/folders/30/4f0z00f56sqfh512yr1l61ym0000gn/T/5844d75c-3801-4757-b7ea-f20ca76e4e52/workers/08a06e4f-4cbc-4e10-b013-ff4fcb4cbf51/heartbeats/1349813330619.version
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1386)
> ! at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1044)
> ! at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:977)
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
> ! at backtype.storm.util$rmr.invoke(util.clj:413)
> ! at
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
> ! at
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
> ----------
> MichaelBlume: Getting the same error while trying to test with
> complete-topology
> Note: I'm using storm 0.8.1. Have reproduced on my mac and my ubuntu netbook.
> Can try with development version if desired.
> Reproduced with development version. It's a really weird bug, because the
> code is actually checking for the existence of the path before attempting to
> delete it.
> OK, I think I partially understand. I put some logging into rmr to log out a
> listing of the tree it's attempting to remove. I get this:
> [clojure-test] 3501 [Thread-10] INFO backtype.storm.util -
> ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"
>
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844"
>
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844.version"
>
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801"
>
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801.version"]
> [clojure-test] 3518 [main] INFO backtype.storm.util -
> ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"]
> Looks like backtype.storm.util.rmr's being called on the same dir from two
> threads. The second call happens when the directory's just about empty, but
> not quite gone, but by the time rmr calls forceDelete, the directory's gone
> entirely and an exception gets raised. I've thrown a band-aid on this on my
> fork, but the proper solution would be to figure out why two threads are
> deleting the same directory.
> ----------
> nathanmarz: Please attach a reproducible test case that I can run so I can
> look into what's going on. I run local cluster's all the time (Storm's unit
> tests bring clusters up and down dozens of times) and have never seen this.
> ----------
> chids: I've not been able to create a reliable test case but the condition
> does occur "regularly". Right now I'm running the testBasicTopology code and
> got bitten by this a couple of times in a row. Then it stops happening.
> @MichaelBlume's have you tried to identify what Thread-10 is?
> ----------
> travis: yea, I'm seeing this regularly as well, not quite sure what the
> pattern is yet
> I got stack traces of the two threads trying to rmr:
> 11073 [main] ERROR backtype.storm.util - Rmr path
> /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
> at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
> at backtype.storm.util$rmr.invoke(util.clj:411)
> at
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
> at
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
> at
> backtype.storm.daemon.supervisor$fn__4733$exec_fn__1207__auto__$reify__4742.shutdown_all_workers(supervisor.clj:368)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
> at backtype.storm.testing$kill_local_storm_cluster.invoke(testing.clj:154)
> ... (application specific stacktrack - let me know if it would be useful)
> 11073 [Thread-19] ERROR backtype.storm.util - Rmr path
> /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
> at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
> at backtype.storm.util$rmr.invoke(util.clj:411)
> at
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
> at
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
> at
> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:223)
> at clojure.lang.AFn.applyToHelper(AFn.java:161)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:603)
> at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
> at clojure.lang.RestFn.invoke(RestFn.java:397)
> at backtype.storm.event$event_manager$fn__2484.invoke(event.clj:24)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Thread.java:680)
> Still not quite sure what this all means, but getting closer. This happens
> pretty much every time I run tests on my mac and on our CI box.
> Still no test case, but I did find a workaround. I was running the topology
> like:
> (with-local-cluster [cluster]
> (complete-topology cluster
> (my-topology drpc)
> :mock-sources {"actions" (map vector actions)
> "drpc-requests" requests}))
> Sleeping for 5 seconds before shutting down the cluster seems to fix it
> (shorter periods just made the error sporadic rather than constant:
> (with-local-cluster [cluster]
> (let [results
> (complete-topology cluster
> (my-topology drpc)
> :mock-sources {"actions" (map vector actions)
> "drpc-requests" requests})]
> (Thread/sleep 5000)
> results))
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)