[ https://issues.apache.org/jira/browse/CRUNCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved CRUNCH-575. ------------------------------ Resolution: Duplicate Oops, definitely a duplicate. Will continue there. > DistributedPipeline temp dir choice can collide with itself > ----------------------------------------------------------- > > Key: CRUNCH-575 > URL: https://issues.apache.org/jira/browse/CRUNCH-575 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.12.0 > Reporter: Sean Owen > Assignee: Josh Wills > Priority: Minor > Attachments: CRUNCH_575.patch > > > We've observed that Crunch jobs can fail because the output temp dir already > exists: > {code} > 2015-04-02 04:45:49,208 INFO > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > /tmp/crunch-686245394/p2/output already exists > at > org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) > {code} > One possible cause is the choice of random directory name, which is based on > a random nonnegative 32-bit int. The chance of collision is more than 50% at > about 55,000 temp dirs, which is not unimaginable. > A suggested fix, at least for that theoretical cause, is to generate a much > larger random value. 64 bits should put this firmly in the realm of extremely > improbably (billions, not tens of thousands). > (HT [~wilfreds] / CC [~tomwhite]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)