[ 
https://issues.apache.org/jira/browse/HUDI-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka resolved HUDI-7100.
---------------------------------

> Data loss when using insert_overwrite_table with insert.drop.duplicates
> -----------------------------------------------------------------------
>
>                 Key: HUDI-7100
>                 URL: https://issues.apache.org/jira/browse/HUDI-7100
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: writer-core
>            Reporter: Aditya Goenka
>            Assignee: sivabalan narayanan
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.12.4, 0.14.1, 0.13.2
>
>
> Code to reproduce - 
> Github Issue - [https://github.com/apache/hudi/issues/9967]
> ```
> schema = StructType(
> [
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ]
> )
> data = [
> Row(1, "a"),
> Row(2, "a"),
> Row(3, "c"),
> ]
> hudi_configs = {
> "hoodie.table.name": TABLE_NAME,
> "hoodie.datasource.write.recordkey.field": "name",
> "hoodie.datasource.write.precombine.field": "id",
> "hoodie.datasource.write.operation":"insert_overwrite_table",
> "hoodie.table.keygenerator.class": 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
> }
> df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()
> -- Showing no records
> ```
> df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to