GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/3832
[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs
This patch disables output spec. validation for jobs launched through Spark
Streaming, since this interferes with checkpoint recovery.
Hadoop OutputFormats have a `checkOutputSpecs` method which performs
certain checks prior to writing output, such as checking whether the output
directory already exists. SPARK-1100 added checks for FileOutputFormat,
SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and
SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just
FileOutputFormat.
In Spark Streaming, we might have to re-process a batch during checkpoint
recovery, so `save` actions may be called multiple times. In addition to
`DStream`'s own save actions, users might use `transform` or `foreachRDD` and
call the `RDD` and `PairRDD` save actions. When output spec. validation is
enabled, the second calls to these actions will fail due to existing output.
This patch automatically disables output spec. validation for jobs
submitted by the Spark Streaming scheduler and introduces a
`spark.streaming.hadoop.validateOutputSpecs` setting to re-enable the old
behavior. This is done by using Scala's `DynamicVariable` to propagate the
bypass setting without having to mutate SparkConf or introduce a global
variable.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-4835
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3832.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3832
----
commit 762e473d3d2bd90110029006b06fb701825ecdde
Author: Josh Rosen <[email protected]>
Date: 2014-12-30T01:13:50Z
[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]