[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102837#comment-17102837 ]
Afroz Baig commented on SPARK-29037: ------------------------------------ spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Does this really stop duplication of data getting committed to the final location? See the below scenario, There was an issue with one of the spark jobs and it failed in the first attempt with file not found exception but succeeded in the second attempt. The problem it caused was, the first attempt wrote the data to final location and failed. The second attempt also re-wrote the same data again and succeeded completely. This caused business impact due to the duplication of data. The mode of writing was "append". First attempt of the Job failed with file not found exception : 2020-05-03 21:32:22,237 [Thread-10] ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job null. java.io.FileNotFoundException: File hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_000024/calendar_date=2020-05-03 does not exist. Do you think setting up this conf in spark submit command will help in avoiding the duplication? > [Core] Spark gives duplicate result when an application was killed and rerun > ---------------------------------------------------------------------------- > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0, 2.3.3 > Reporter: feiwang > Priority: Major > Attachments: screenshot-1.png > > > For InsertIntoHadoopFsRelation operations. > Case A: > Application appA insert overwrite table table_a with static partition > overwrite. > But it was killed when committing tasks, because one task is hang. > And parts of its committed tasks output is kept under > /path/table_a/_temporary/0/. > Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/. > It executes successfully. > But it also commit the data reminded by killed application to destination dir. > Case B: > Application appA insert overwrite table table_a. > Application appB insert overwrite table table_a, too. > They execute concurrently, and they may all use /path/table_a/_temporary/0/ > as workPath. > And their result may be corruptted. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org