Ben Roling created CRUNCH-515:
---------------------------------

             Summary: Decrease probability of collision on Crunch temp 
directories
                 Key: CRUNCH-515
                 URL: https://issues.apache.org/jira/browse/CRUNCH-515
             Project: Crunch
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.11.0, 0.8.4
            Reporter: Ben Roling
            Assignee: Josh Wills


I've heard reports of failures of Crunch pipelines at our organization due to 
collision on temp directories.

Take the following stack trace from an old internal email thread I dug up as an 
example:

{noformat}
2015-04-02 04:45:49,208 INFO 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: 
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
/tmp/crunch-686245394/p2/output already exists
    at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:394)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:582)
    at 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340)
    at 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277)
    at 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316)
    at 
org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113)
    at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55)
    at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84)
    at java.lang.Thread.run(Thread.java:682)
{noformat}

What we found in this case is the pre-existing directory was rather old.  It 
hung around because we're doing a poor job of cleaning old garbage out of our 
HDFS /tmp directory.  We intend to set up a job to delete stuff older than a 
couple of weeks or so out of /tmp but I think the chances of a collision will 
still be high enough that failures like this might still happen on occasion.

The temp directory Crunch chooses is a random 31-bit value:
https://github.com/apache/crunch/blob/apache-crunch-0.11.0/crunch-core/src/main/java/org/apache/crunch/impl/dist/DistributedPipeline.java#L326

I say 31 bit value because it comes from a 32-bit random integer but only 
includes positive values, thereby excluding 1 bit.

The following blog post shows some probabilities for 32-bit hash collisions, 
which are essentially the same problem:
http://preshing.com/20110504/hash-collision-probabilities/

Since we're dealing with 31 bits instead of 32 the probabilities will be higher 
than expressed there for 32 bits.  Even with 32 bits the probability of 
collision is 1 in 100 with just 9292 values.

I have not done any thorough investigation to understand why, but in our 
production environment we have a lot of Crunch jobs and we are leaving 200-300 
stray Crunch temp directories per day.  Depending on how aggressive we get with 
a scheduled job to clean old stuff out of temp we could still have a realistic 
chance of hitting a collision.

My proposal is to change the random integer component of the temp path to a 
UUID or something similar to make it drastically more unlikely that a collision 
will ever occur regardless of whether or not "/tmp" is ever cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to