Sean Owen created CRUNCH-575:
--------------------------------

             Summary: DistributedPipeline temp dir choice can collide with 
itself
                 Key: CRUNCH-575
                 URL: https://issues.apache.org/jira/browse/CRUNCH-575
             Project: Crunch
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.12.0
            Reporter: Sean Owen
            Assignee: Josh Wills
            Priority: Minor


We've observed that Crunch jobs can fail because the output temp dir already 
exists:

{code}
2015-04-02 04:45:49,208 INFO 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: 
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
/tmp/crunch-686245394/p2/output already exists
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
{code}

One possible cause is the choice of random directory name, which is based on a 
random nonnegative 32-bit int. The chance of collision is more than 50% at 
about 55,000 temp dirs, which is not unimaginable.

A suggested fix, at least for that theoretical cause, is to generate a much 
larger random value. 64 bits should put this firmly in the realm of extremely 
improbably (billions, not tens of thousands).

(HT [~wilfreds] / CC [~tomwhite])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to