[jira] [Updated] (STORM-133) FileNotFoundException during supervisor cleanup due to non-existant file

Rick Kellogg (JIRA) Thu, 08 Oct 2015 17:17:59 -0700

     [ 
https://issues.apache.org/jira/browse/STORM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rick Kellogg updated STORM-133:
-------------------------------
    Component/s: storm-core

> FileNotFoundException during supervisor cleanup due to non-existant file
> ------------------------------------------------------------------------
>
>                 Key: STORM-133
>                 URL: https://issues.apache.org/jira/browse/STORM-133
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/356
> During development runs with local clusters I've sometimes managed to 
> encounter FileNotFoundException thrown from FileUtils.forceDelete during 
> cleanup.
> Without knowing the internals of Storm it feels a bit strange that clean up...
> a) ...fails when trying to delete a non existing file/folder
> b) ...tries to delete a non-existant file
> Would suppressing FileNotFoundException from FileUtils.forceDelete or 
> switching to FileUtils.deleteQuietly be reasonable?
> Sample stack trace:
> ERROR [2012-10-09 22:08:51,255] org.apache.zookeeper.server.NIOServerCnxn: 
> Thread Thread[main,5,main] died
> ! java.io.FileNotFoundException: File does not exist: 
> /var/folders/30/4f0z00f56sqfh512yr1l61ym0000gn/T/5844d75c-3801-4757-b7ea-f20ca76e4e52/workers/08a06e4f-4cbc-4e10-b013-ff4fcb4cbf51/heartbeats/1349813330619.version
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1386)
> ! at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1044)
> ! at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:977)
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
> ! at backtype.storm.util$rmr.invoke(util.clj:413)
> ! at 
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
> ! at 
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
> ----------
> MichaelBlume: Getting the same error while trying to test with 
> complete-topology
> Note: I'm using storm 0.8.1. Have reproduced on my mac and my ubuntu netbook. 
> Can try with development version if desired.
> Reproduced with development version. It's a really weird bug, because the 
> code is actually checking for the existence of the path before attempting to 
> delete it.
> OK, I think I partially understand. I put some logging into rmr to log out a 
> listing of the tree it's attempting to remove. I get this:
> [clojure-test] 3501 [Thread-10] INFO backtype.storm.util - 
> ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"
>  
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844"
>  
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844.version"
>  
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801"
>  
> "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801.version"]
> [clojure-test] 3518 [main] INFO backtype.storm.util - 
> ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"]
> Looks like backtype.storm.util.rmr's being called on the same dir from two 
> threads. The second call happens when the directory's just about empty, but 
> not quite gone, but by the time rmr calls forceDelete, the directory's gone 
> entirely and an exception gets raised. I've thrown a band-aid on this on my 
> fork, but the proper solution would be to figure out why two threads are 
> deleting the same directory.
> ----------
> nathanmarz: Please attach a reproducible test case that I can run so I can 
> look into what's going on. I run local cluster's all the time (Storm's unit 
> tests bring clusters up and down dozens of times) and have never seen this.
> ----------
> chids: I've not been able to create a reliable test case but the condition 
> does occur "regularly". Right now I'm running the testBasicTopology code and 
> got bitten by this a couple of times in a row. Then it stops happening.
> @MichaelBlume's have you tried to identify what Thread-10 is?
> ----------
> travis: yea, I'm seeing this regularly as well, not quite sure what the 
> pattern is yet
> I got stack traces of the two threads trying to rmr:
> 11073 [main] ERROR backtype.storm.util  - Rmr path 
> /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
>     at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
>     at backtype.storm.util$rmr.invoke(util.clj:411)
>     at 
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
>     at 
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
>     at 
> backtype.storm.daemon.supervisor$fn__4733$exec_fn__1207__auto__$reify__4742.shutdown_all_workers(supervisor.clj:368)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>     at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
>     at backtype.storm.testing$kill_local_storm_cluster.invoke(testing.clj:154)
> ... (application specific stacktrack - let me know if it would be useful)
> 11073 [Thread-19] ERROR backtype.storm.util  - Rmr path 
> /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
>     at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
>     at backtype.storm.util$rmr.invoke(util.clj:411)
>     at 
> backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
>     at 
> backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
>     at 
> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:223)
>     at clojure.lang.AFn.applyToHelper(AFn.java:161)
>     at clojure.lang.AFn.applyTo(AFn.java:151)
>     at clojure.core$apply.invoke(core.clj:603)
>     at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
>     at clojure.lang.RestFn.invoke(RestFn.java:397)
>     at backtype.storm.event$event_manager$fn__2484.invoke(event.clj:24)
>     at clojure.lang.AFn.run(AFn.java:24)
>         at java.lang.Thread.run(Thread.java:680)
> Still not quite sure what this all means, but getting closer. This happens 
> pretty much every time I run tests on my mac and on our CI box.
> Still no test case, but I did find a workaround. I was running the topology 
> like:
> (with-local-cluster [cluster]
>     (complete-topology cluster
>                        (my-topology drpc)
>                        :mock-sources {"actions" (map vector actions)
>                                       "drpc-requests" requests}))
> Sleeping for 5 seconds before shutting down the cluster seems to fix it 
> (shorter periods just made the error sporadic rather than constant:
> (with-local-cluster [cluster]
>   (let [results
>         (complete-topology cluster
>                            (my-topology drpc)
>                            :mock-sources {"actions" (map vector actions)
>                                           "drpc-requests" requests})]
>     (Thread/sleep 5000)
>     results))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-133) FileNotFoundException during supervisor cleanup due to non-existant file

Reply via email to