Jungtaek Lim created SPARK-27210:
------------------------------------
Summary: Cleanup incomplete output files in
ManifestFileCommitProtocol if task is aborted
Key: SPARK-27210
URL: https://issues.apache.org/jira/browse/SPARK-27210
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim
Unlike HadoopMapReduceCommitProtocol, ManifestFileCommitProtocol doesn't clean
up incomplete output files for both cases: task is aborted as well as job is
aborted.
In HadoopMapReduceCommitProtocol, it leverages stage directory to write
intermediate files so once job is aborted it can simply delete stage directory
to clean up everything. Even HadoopMapReduceCommitProtocol puts more effort on
cleaning up intermediate files on task side if task is aborted.
ManifestFileCommitProtocol doesn't do anything for cleaning up but just
maintains the metadata which list of complete output files are written. It
should be better if ManifestFileCommitProtocol can do the best effort to clean
up: not sure it can do job level cleanup since it doesn't leverage stage
directory, but it's clear that it can still put best effort to do task level
cleanup.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]