[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102837#comment-17102837 ]
Afroz Baig edited comment on SPARK-29037 at 5/8/20, 7:11 PM: ------------------------------------------------------------- spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Does this really stop duplication of data getting committed to the final location? See the below scenario, Basically, I hit into 2 issues. One that was mentioned in the https://issues.apache.org/jira/browse/SPARK-18883 and other that is being discussed here in this scenario. There was an issue with this one of the spark jobs and it failed in the first attempt with file not found exception but succeeded in the second attempt. The problem it caused was, the first attempt wrote the data to final location and failed. The second attempt also re-wrote the same data again and succeeded completely. This caused business impact due to the duplication of data. The mode of writing was "append". First attempt of the Job failed with file not found exception : 2020-05-03 21:32:22,237 [Thread-10] ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job null. java.io.FileNotFoundException: File hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_000024/calendar_date=2020-05-03 does not exist. Do you think setting up this conf in spark submit command will help in avoiding the duplication? was (Author: afrozbaig): spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Does this really stop duplication of data getting committed to the final location? See the below scenario, There was an issue with one of the spark jobs and it failed in the first attempt with file not found exception but succeeded in the second attempt. The problem it caused was, the first attempt wrote the data to final location and failed. The second attempt also re-wrote the same data again and succeeded completely. This caused business impact due to the duplication of data. The mode of writing was "append". First attempt of the Job failed with file not found exception : 2020-05-03 21:32:22,237 [Thread-10] ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job null. java.io.FileNotFoundException: File hdfs://nnproxies/insight_prod/rdf/output/forecast_revolution/uk/activation_fw_nws_calc/aggregate_expansion_data/_temporary/0/task_20200503213210_0012_m_000024/calendar_date=2020-05-03 does not exist. Do you think setting up this conf in spark submit command will help in avoiding the duplication? > [Core] Spark gives duplicate result when an application was killed and rerun > ---------------------------------------------------------------------------- > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0, 2.3.3 > Reporter: feiwang > Priority: Major > Attachments: screenshot-1.png > > > For InsertIntoHadoopFsRelation operations. > Case A: > Application appA insert overwrite table table_a with static partition > overwrite. > But it was killed when committing tasks, because one task is hang. > And parts of its committed tasks output is kept under > /path/table_a/_temporary/0/. > Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/. > It executes successfully. > But it also commit the data reminded by killed application to destination dir. > Case B: > Application appA insert overwrite table table_a. > Application appB insert overwrite table table_a, too. > They execute concurrently, and they may all use /path/table_a/_temporary/0/ > as workPath. > And their result may be corruptted. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org