Re: Running parallel jobs having the same output directory

Aaron Kimball Mon, 20 Jul 2009 17:03:38 -0700

There's likely another gotcha regarding the fact that various logs and job
config files are written to the _logs directory under the output directory.
You'd need to uniquify that as well. There may be other traps, but I don't
know them :)

This might be a bit of a frustrating endeavour since you're trying to
override behaviour that's been baked in to Hadoop for a while. Why in
particular do you need all your jobs to emit to a common directory? You
could probably save yourself some headache by writing to subdirectories of a
common dir.

e.g., rather than having jobs 0..n write to /user/foo/commonoutput, just
write to /user/foo/outputs/0, /user/foo/outputs/1, etc..

If you need to collect the various outputs together to use in a subsequent
MR job, you can use FileInputFormat.addInputPath() multiple times on the
various directories. Or you could modify other downstream logic of yours to
either recursively descend a level into a hierarchy, or use
FileSystem.rename() to move the files from the different directories into a
single aggregate directory after all the jobs have succeeded.

- Aaron

On Mon, Jul 20, 2009 at 11:51 AM, Thibaut_ <[email protected]> wrote:

>
> Hi,
>
> I'm trying to run a few parallel jobs which have the same input directory
> and the same output directory.
>
> I modified the FileInputClass to check for non zero files, and also
> modified
> the output class to allow non empty directories (the input directory =
> output directory in my case). I made sure that each job output is unique,
> thus there are no file conflicts there.
>
> Everything runs fine for a while, but I'm having problems with the
> temporary
> directory:
> java.io.IOException: The temporary job-output directory
> hdfs://internal1:50010/user/root/0/_temporary doesn't exist!
>
> I could go further down and try to make the _temporary directory job
> dependent. But before I do that, I would like to know if there are other
> traps/errors I could run into running parallel jobs having the same
> output/input directory?
>
> (Btw this is hadoop-0.20.0)
>
> Thanks,
> Thibaut
>
> --
> View this message in context:
> http://www.nabble.com/Running-parallel-jobs-having-the-same-output-directory-tp24575402p24575402.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Running parallel jobs having the same output directory

Reply via email to