[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102837#comment-17102837
 ] 

Afroz Baig commented on SPARK-29037:
------------------------------------

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Does this really stop duplication of data getting committed to the final 
location?

See the below scenario,
There was an issue with one of the spark jobs and it failed in the first 
attempt with file not found exception but succeeded in the second attempt. 
The problem it caused was, the first attempt wrote the data to final location 
and failed.

The second attempt also re-wrote the same data again and succeeded completely. 
This caused business impact due to the duplication of data. The mode of writing 
was "append". 

First attempt of the Job failed with file not found exception : 2020-05-03 
21:32:22,237 [Thread-10] ERROR 
org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 
null. java.io.FileNotFoundException: File 
hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_000024/calendar_date=2020-05-03
 does not exist. 

Do you think setting up this conf in spark submit command will help in avoiding 
the duplication?

> [Core] Spark gives duplicate result when an application was killed and rerun
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-29037
>                 URL: https://issues.apache.org/jira/browse/SPARK-29037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.3
>            Reporter: feiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> For InsertIntoHadoopFsRelation operations.
> Case A:
> Application appA insert overwrite table table_a with static partition 
> overwrite.
> But it was killed when committing tasks, because one task is hang.
> And parts of its committed tasks output is kept under 
> /path/table_a/_temporary/0/.
> Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
> It executes successfully.
> But it also commit the data reminded by killed application to destination dir.
> Case B:
> Application appA insert overwrite table table_a.
> Application appB insert overwrite table table_a, too.
> They execute concurrently, and they may all use /path/table_a/_temporary/0/ 
> as workPath.
> And their result may be corruptted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to