Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

Steve Loughran Tue, 03 Apr 2018 13:45:49 -0700


> On 3 Apr 2018, at 11:19, cane <zhoukang199...@gmail.com> wrote:
> 
> Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause
> data loss.
> I check the comment of thi api:
> 
>  We should make sure our tasks are idempotent when speculation is enabled,
> i.e. do
>   * not use output committer that writes data directly.
>   * There is an example in
> https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
>   * result of using direct output committer with speculation enabled.
>   */
> 
> But if this the rule we must follow?
> For example,for parquet it will got ParquetOutPutCommitter.
> In this case, speculation must disable for parquet?
> 
> Is there some one know the history?
> Thanks too much!



If you are writing to HDFS or object stores other than s3 and you make sure 
that you are using the FileOutputFormat commit algorithm, you can use 
speculation without problems. 

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1

if you use the version 2 algorithm then you are vulnerable to a failure during 
task commit, but only during task commit and then if speculative/repeated tasks 
generate output files with different names.

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

If you are using S3 as a direct destination of work, then, in the absence of a 
consistency layer (S3mer, EMR consistent s3, Hadoop 3,x + S3Guard) or an 
S3-Specific committer, you are always at risk of data loss. Don't dp that

Further reading

https://github.com/steveloughran/zero-rename-committer/releases/download/tag_draft_003/a_zero_rename_committer.pdf


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

Reply via email to