[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Andrew Otto (Jira) Fri, 11 Oct 2024 07:16:35 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888669#comment-17888669
 ]


Andrew Otto commented on MAPREDUCE-7331:
----------------------------------------

[The Wikimedia Foundation was just 
hit|https://phabricator.wikimedia.org/T376882] by this _temporary directory 
bug.  We didn't quite realize, and have been losing small amounts of data for 
over a year because of it.

Having the same temp directory used by multiple jobs seems like a strange 
design decision, no?  Is there something I'm missing? Is there any reason for 
this at all?

I suppose, if a job fails and its _temporary directory is not cleaned up, the 
next job will run and end up deleting _temporary. So I guess that's good?  The 
downside is this bug though, which seems much worse then leftover per-job 
_temporary directories.

 

A configurable temp directory would certainly help, but, why not just make the 
default temp directory be unique per job, e.g.  {{{}_temporary_\{job_id}{}}}?  

> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7331
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 3.0.0
>         Environment: CDH 6.2.1 Hadoop 3.0.0
>            Reporter: Bimalendu Choudhary
>            Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Reply via email to