GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/3832

    [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs

    This patch disables output spec. validation for jobs launched through Spark 
Streaming, since this interferes with checkpoint recovery.
    
    
    Hadoop OutputFormats have a `checkOutputSpecs` method which performs 
certain checks prior to writing output, such as checking whether the output 
directory already exists.  SPARK-1100 added checks for FileOutputFormat, 
SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and 
SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just 
FileOutputFormat.
    
    In Spark Streaming, we might have to re-process a batch during checkpoint 
recovery, so `save` actions may be called multiple times.  In addition to 
`DStream`'s own save actions, users might use `transform` or `foreachRDD` and 
call the `RDD` and `PairRDD` save actions.  When output spec. validation is 
enabled, the second calls to these actions will fail due to existing output.
    
    This patch automatically disables output spec. validation for jobs 
submitted by the Spark Streaming scheduler and introduces a 
`spark.streaming.hadoop.validateOutputSpecs` setting to re-enable the old 
behavior.  This is done by using Scala's `DynamicVariable` to propagate the 
bypass setting without having to mutate SparkConf or introduce a global 
variable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-4835

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3832
    
----
commit 762e473d3d2bd90110029006b06fb701825ecdde
Author: Josh Rosen <[email protected]>
Date:   2014-12-30T01:13:50Z

    [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to