[
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888669#comment-17888669
]
Andrew Otto commented on MAPREDUCE-7331:
----------------------------------------
[The Wikimedia Foundation was just
hit|https://phabricator.wikimedia.org/T376882] by this _temporary directory
bug. We didn't quite realize, and have been losing small amounts of data for
over a year because of it.
Having the same temp directory used by multiple jobs seems like a strange
design decision, no? Is there something I'm missing? Is there any reason for
this at all?
I suppose, if a job fails and its _temporary directory is not cleaned up, the
next job will run and end up deleting _temporary. So I guess that's good? The
downside is this bug though, which seems much worse then leftover per-job
_temporary directories.
A configurable temp directory would certainly help, but, why not just make the
default temp directory be unique per job, e.g. {{{}_temporary_\{job_id}{}}}?
> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-7331
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: mrv2
> Affects Versions: 3.0.0
> Environment: CDH 6.2.1 Hadoop 3.0.0
> Reporter: Bimalendu Choudhary
> Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary
> directory results in multiple application using the same temporary directory.
> This casues unwanted results of one application interfering with other
> applications temporary files. Also one application ending up deleting
> temporary files of other. There is no way right now for applications to have
> there unique path to store the temporary files to avoid any interference from
> other totally independent applications. I think the temporary directory
> being used by FileOutputCommitter should be made configurable to let the
> caller call with with its own unique value as per the requirement and avoid
> it getting deleted or overwritten by other applications
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
> public static final String PENDING_DIR_NAME_DEFAULT =
> "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>
> This can be used very efficiently by Spark applications to handle even stage
> failures where temporary directories from previous attempts cause problem and
> can help in so many situations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]