[
https://issues.apache.org/jira/browse/SPARK-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patrick Wendell updated SPARK-1100:
-----------------------------------
Assignee: Patrick Wendell (was: Patrick Cogan)
> saveAsTextFile shouldn't clobber by default
> -------------------------------------------
>
> Key: SPARK-1100
> URL: https://issues.apache.org/jira/browse/SPARK-1100
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 0.9.0
> Reporter: Diana Carroll
> Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> If I call rdd.saveAsTextFile with an existing directory, it will cheerfully
> and silently overwrite the files in there. This is bad enough if it means
> I've accidentally blown away the results of a job that might have taken
> minutes or hours to run. But it's worse if the second job happens to have
> fewer partitions than the first...in that case, my output directory now
> contains some "part" files from the earlier job, and some "part" files from
> the later job. The only way to know the difference is timestamp.
> I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce
> which insists that the output directory not exist before the job starts.
> Similarly HDFS won't override files by default. Perhaps there could be an
> optional argument for saveAsTextFile that indicates if it should delete the
> existing directory before starting. (I can't see any time I'd want to allow
> writing to an existing directory with data already in it. Would the mix of
> output from different tasks ever be desirable?)
--
This message was sent by Atlassian JIRA
(v6.2#6252)