[
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307365#comment-17307365
]
Bimalendu Choudhary edited comment on MAPREDUCE-7331 at 3/23/21, 7:16 PM:
--------------------------------------------------------------------------
The temporary files gets deleted at the end of the commitJob when we get the
pendingjobAttemptPath and simply delete that path. So anything inside gets
deleted. I don't think that underlying attempt task attempt paths get deleted
individually. So in my case whether the other application had the same
Mapreduce jobId or not, does not matter. Even if they share the same
JObID/taskattempt path, they will be writing to different partition directories
inside it.
To me looks like one application finishes first and ends up deleting the whole
_temporary directory. For now the workaorund we are trying out is configuring
not to delete the _temporary file at the end when we know that we have multiple
spark application using the same directory.
In my case we are running multiple Spark application to process individual
partition of the same table to make the processing fast. Since all are separate
partitions so there is no chance of data interference. But we end up getting
FileNotFound exception.
was (Author: bimalenduc):
The temporary files gets deleted at the end of the commitJob when we get the
pendingjobAttemptPath and simply delete that path. So anything inside gets
deleted. I don't think that underlying attempt task attempt paths get deleted
individually. So in my case whether the other application had the same
Mapreduce jobId or not, does not matter. Even if they share the same
JObID/taskattempt path, they will be writing to different partition directories
inside it.
To me looks like on application finishes first and ends up deleting the whole
_temporary directory. For now the workaorund we are trying out is configuring
not to delete the _temporary file at the end when we know that we have multiple
spark application using the same directory.
In my case we are running multiple Spark application to process individual
partition of the same table to make the processing fast. Since all are separate
partitions so there is no chance of data interference. But we end up getting
FileNotFound exception.
> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-7331
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 3.0.0
> Environment: CDH 6.2.1 Hadoop 3.0.0
> Reporter: Bimalendu Choudhary
> Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary
> directory results in multiple application using the same temporary directory.
> This casues unwanted results of one application interfering with other
> applications temporary files. Also one application ending up deleting
> temporary files of other. There is no way right now for applications to have
> there unique path to store the temporary files to avoid any interference from
> other totally independent applications. I think the temporary directory
> being used by FileOutputCommitter should be made configurable to let the
> caller call with with its own unique value as per the requirement and avoid
> it getting deleted or overwritten by other applications
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
> public static final String PENDING_DIR_NAME_DEFAULT =
> "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>
> This can be used very efficiently by Spark applications to handle even stage
> failures where temporary directories from previous attempts cause problem and
> can help in so many situations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]