[ https://issues.apache.org/jira/browse/CRUNCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521506#comment-14521506 ]
Ben Roling commented on CRUNCH-515: ----------------------------------- {quote}if the cleanup jobs that folks were using expected to see the string "crunch-" followed by some number of digits, wouldn't the UUID string (which is hex IIRC) cause their cleanup scripts to miss some directories b/c the pattern wouldn't match?{quote} The /tmp cleanup I intend for us to be doing in our clusters will be more general even than just "crunch-" cleanup. I'm expecting we will delete _anything_ in /tmp older than X days. That said, you're right that it is possible there are some other crunch consumers that have set up specific enough /tmp/crunch- cleanup patterns to be broken by this change. My guess is that risk is relatively small but that is just my opinion. {quote} any idea if those stray Crunch dirs are being left around by successful jobs, or jobs that have crashed {quote} We don't have nearly as many failed or killed jobs as we have stray /tmp/crunch-* directories so most of them must be from successful jobs. I don't see any obvious way for me to trace back from the stray directories to the jobs that created them to be able to do analytics to identify which jobs are leaving behind the most stray directories. I suppose it _might_ be possible with searches of the job logs but I don't have an easy mechanism available to me to search across all of those at the moment. I can look into it a bit more though. I talked with [~mkwhitacre] about this previously and he had a theory about some activity occurring after pipeline.done() that was resulting in the temp dirs being left behind. I don't remember the specifics of the theory and I haven't had a chance to try to validate it yet. > Decrease probability of collision on Crunch temp directories > ------------------------------------------------------------ > > Key: CRUNCH-515 > URL: https://issues.apache.org/jira/browse/CRUNCH-515 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.8.4, 0.11.0 > Reporter: Ben Roling > Assignee: Josh Wills > Attachments: CRUNCH-515-1.patch > > > I've heard reports of failures of Crunch pipelines at our organization due to > collision on temp directories. > Take the following stack trace from an old internal email thread I dug up as > an example: > {noformat} > 2015-04-02 04:45:49,208 INFO > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > /tmp/crunch-686245394/p2/output already exists > at > org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:394) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:582) > at > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340) > at > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277) > at > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316) > at > org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113) > at > org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55) > at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84) > at java.lang.Thread.run(Thread.java:682) > {noformat} > What we found in this case is the pre-existing directory was rather old. It > hung around because we're doing a poor job of cleaning old garbage out of our > HDFS /tmp directory. We intend to set up a job to delete stuff older than a > couple of weeks or so out of /tmp but I think the chances of a collision will > still be high enough that failures like this might still happen on occasion. > The temp directory Crunch chooses is a random 31-bit value: > https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L326 > I say 31 bit value because it comes from a 32-bit random integer but only > includes positive values, thereby excluding 1 bit. > The following blog post shows some probabilities for 32-bit hash collisions, > which are essentially the same problem: > http://preshing.com/20110504/hash-collision-probabilities/ > Since we're dealing with 31 bits instead of 32 the probabilities will be > higher than expressed there for 32 bits. Even with 32 bits the probability > of collision is 1 in 100 with just 9292 values. > I have not done any thorough investigation to understand why, but in our > production environment we have a lot of Crunch jobs and we are leaving > 200-300 stray Crunch temp directories per day. Depending on how aggressive > we get with a scheduled job to clean old stuff out of temp we could still > have a realistic chance of hitting a collision. > My proposal is to change the random integer component of the temp path to a > UUID or something similar to make it drastically more unlikely that a > collision will ever occur regardless of whether or not "/tmp" is ever cleaned > up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)