[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888669#comment-17888669 ]
Andrew Otto commented on MAPREDUCE-7331: ---------------------------------------- [The Wikimedia Foundation was just hit|https://phabricator.wikimedia.org/T376882] by this _temporary directory bug. We didn't quite realize, and have been losing small amounts of data for over a year because of it. Having the same temp directory used by multiple jobs seems like a strange design decision, no? Is there something I'm missing? Is there any reason for this at all? I suppose, if a job fails and its _temporary directory is not cleaned up, the next job will run and end up deleting _temporary. So I guess that's good? The downside is this bug though, which seems much worse then leftover per-job _temporary directories. A configurable temp directory would certainly help, but, why not just make the default temp directory be unique per job, e.g. {{{}_temporary_\{job_id}{}}}? > Make temporary directory used by FileOutputCommitter configurable > ----------------------------------------------------------------- > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: mrv2 > Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 > Reporter: Bimalendu Choudhary > Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org