Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/22112
> The FileCommitProtocol is an internal API, and our current implementation
does store task-level data temporary in a staging directory (See
HadoopMapReduceCommitProtocol). That said, we can fix the FileCommitProtocol to
be able to rollback a committed task, as long as the job is not committed.
>
> As an example, we can change the semantic of
FileCommitProtocol.commitTask, and say that this may be called multiple times
for the same task, and the later should win. And change
HadoopMapReduceCommitProtocol to support it.
Hmm I wasn't aware that option was added, looks like its only if you use
the dynamicPartitionOverwrite which looks like its for very specific output
committers for sql. I don't see how that works with all output committers. I
can write an output committer that never writes anything to disk, it might
write it to a DB, Hbase, or any custom one. Some of these moves don't make
sense in.
Note that its also a performance impact, that is why the mapreduce output
committers stopped doing this, see the v2 algorithm set via
mapreduce.fileoutputcommitter.algorithm.version .
(https://issues.apache.org/jira/browse/SPARK-20107). I realize the performance
might need to be impacted in certain cases but wanted to point this out at
least, my main concern is really the above statement I don't see how this works
for all output committers
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]