James Xu created STORM-134:
------------------------------

             Summary: FileNotFoundException during supervisor cleanup due to 
non-existant file
                 Key: STORM-134
                 URL: https://issues.apache.org/jira/browse/STORM-134
             Project: Apache Storm (Incubating)
          Issue Type: Bug
            Reporter: James Xu


https://github.com/nathanmarz/storm/issues/356

During development runs with local clusters I've sometimes managed to encounter 
FileNotFoundException thrown from FileUtils.forceDelete during cleanup.

Without knowing the internals of Storm it feels a bit strange that clean up...
a) ...fails when trying to delete a non existing file/folder
b) ...tries to delete a non-existant file

Would suppressing FileNotFoundException from FileUtils.forceDelete or switching 
to FileUtils.deleteQuietly be reasonable?

Sample stack trace:

ERROR [2012-10-09 22:08:51,255] org.apache.zookeeper.server.NIOServerCnxn: 
Thread Thread[main,5,main] died
! java.io.FileNotFoundException: File does not exist: 
/var/folders/30/4f0z00f56sqfh512yr1l61ym0000gn/T/5844d75c-3801-4757-b7ea-f20ca76e4e52/workers/08a06e4f-4cbc-4e10-b013-ff4fcb4cbf51/heartbeats/1349813330619.version
! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1386)
! at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1044)
! at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:977)
! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
! at backtype.storm.util$rmr.invoke(util.clj:413)
! at 
backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
! at backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)

----------
MichaelBlume: Getting the same error while trying to test with complete-topology

Note: I'm using storm 0.8.1. Have reproduced on my mac and my ubuntu netbook. 
Can try with development version if desired.

Reproduced with development version. It's a really weird bug, because the code 
is actually checking for the existence of the path before attempting to delete 
it.

OK, I think I partially understand. I put some logging into rmr to log out a 
listing of the tree it's attempting to remove. I get this:

[clojure-test] 3501 [Thread-10] INFO backtype.storm.util - 
["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"
 
"/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844"
 
"/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844.version"
 
"/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801"
 
"/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801.version"]
[clojure-test] 3518 [main] INFO backtype.storm.util - 
["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"]

Looks like backtype.storm.util.rmr's being called on the same dir from two 
threads. The second call happens when the directory's just about empty, but not 
quite gone, but by the time rmr calls forceDelete, the directory's gone 
entirely and an exception gets raised. I've thrown a band-aid on this on my 
fork, but the proper solution would be to figure out why two threads are 
deleting the same directory.

----------
nathanmarz: Please attach a reproducible test case that I can run so I can look 
into what's going on. I run local cluster's all the time (Storm's unit tests 
bring clusters up and down dozens of times) and have never seen this.

----------
chids: I've not been able to create a reliable test case but the condition does 
occur "regularly". Right now I'm running the testBasicTopology code and got 
bitten by this a couple of times in a row. Then it stops happening.

@MichaelBlume's have you tried to identify what Thread-10 is?

----------
travis: yea, I'm seeing this regularly as well, not quite sure what the pattern 
is yet

I got stack traces of the two threads trying to rmr:

11073 [main] ERROR backtype.storm.util  - Rmr path 
/var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
java.lang.Exception
    at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
    at backtype.storm.util$rmr.invoke(util.clj:411)
    at 
backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
    at 
backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
    at 
backtype.storm.daemon.supervisor$fn__4733$exec_fn__1207__auto__$reify__4742.shutdown_all_workers(supervisor.clj:368)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
    at backtype.storm.testing$kill_local_storm_cluster.invoke(testing.clj:154)
... (application specific stacktrack - let me know if it would be useful)
11073 [Thread-19] ERROR backtype.storm.util  - Rmr path 
/var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
java.lang.Exception
    at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
    at backtype.storm.util$rmr.invoke(util.clj:411)
    at 
backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
    at 
backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
    at 
backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:223)
    at clojure.lang.AFn.applyToHelper(AFn.java:161)
    at clojure.lang.AFn.applyTo(AFn.java:151)
    at clojure.core$apply.invoke(core.clj:603)
    at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
    at clojure.lang.RestFn.invoke(RestFn.java:397)
    at backtype.storm.event$event_manager$fn__2484.invoke(event.clj:24)
    at clojure.lang.AFn.run(AFn.java:24)
        at java.lang.Thread.run(Thread.java:680)
Still not quite sure what this all means, but getting closer. This happens 
pretty much every time I run tests on my mac and on our CI box.

Still no test case, but I did find a workaround. I was running the topology 
like:

(with-local-cluster [cluster]
    (complete-topology cluster
                       (my-topology drpc)
                       :mock-sources {"actions" (map vector actions)
                                      "drpc-requests" requests}))
Sleeping for 5 seconds before shutting down the cluster seems to fix it 
(shorter periods just made the error sporadic rather than constant:

(with-local-cluster [cluster]
  (let [results
        (complete-topology cluster
                           (my-topology drpc)
                           :mock-sources {"actions" (map vector actions)
                                          "drpc-requests" requests})]
    (Thread/sleep 5000)
    results))




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to