[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832199#action_12832199
 ] 

Jim Finnessy commented on MAPREDUCE-1471:
-----------------------------------------

It is a pretty common use case for me, and I'm guessing others, to write a 
number of files to the same output directory from concurrent jobs. 

For instance counting the number of occurrences of a daily event for a entire 
month

Setting up the directory structure:
.../2008/07/

With all of the count files for that month ending up in that directory.

When running concurrent jobs 1 and 2, hadoop creates the following temporary 
directories/files with a modified version of SequenceFileOutputFormat, that 
allows me name the file what I desire and run with the same working path for 
multiple jobs.
.../2008/07/_temporary/_attempt_local_0001_r_000000_0
.../2008/07/_temporary/_attempt_local_0002_r_000000_0

When the job 1 completes, in order to clean up it's temporary files, it removes
.../2008/07/_temporary/

This then blows away the temporary files for job 2.

I would say that normally this is not a hadoop problem because of the way the I 
extended SequenceFileOutputFormat to allow for mutliple jobs to have the same 
working path so long as the output file name is unique (which it is). However, 
it is the cleanupJob in FileOutputCommitter that causes the problem, and since 
the committer in FileOutputFormat is private, I cannot extend and replace the 
FileOutputCommitter with my own. Currently I have overidden the 
getOutputCommiter(context) method in my out FileOutputFormat to work around 
this, but if the class ever starts accessing the committer without going 
through that method, I'm in trouble again.

So....I'd really appreciate it if the commiter in FIleOutputCommitter is 
protected rather than private in FileOutputFormat so that it can be overridden 
for client applications or the jobs only cleanup the temporary files that they 
create rather than recursively deleting the highest level directory 
(_temporary).

Thanks,
Jim



> FileOutputCommitter does not safely clean up it's temporary files
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1471
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1471
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Jim Finnessy
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When the FileOutputCommitter cleans up during it's cleanupJob method, it 
> potentially deletes the temporary files of other concurrent jobs.
> Since all the temporary files for all concurrent jobs are written to 
> working_path/_temporary/ any concurrent tasks that have the same working_path 
> will remove all currently executing jobs when it removes 
> working_path/_temporary during job cleanup.
> If the file name output is guaranteed by the client application to be unique, 
> the temporary files/directories should also be guaranteed to be unique to 
> avoid this problem. Suggest modifying cleanupJob to only remove files that it 
> created itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to