[
https://issues.apache.org/jira/browse/HUDI-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-7100:
---------------------------------
Labels: pull-request-available (was: )
> Data loss when using insert_overwrite_table with insert.drop.duplicates
> -----------------------------------------------------------------------
>
> Key: HUDI-7100
> URL: https://issues.apache.org/jira/browse/HUDI-7100
> Project: Apache Hudi
> Issue Type: Bug
> Components: writer-core
> Reporter: Aditya Goenka
> Assignee: sivabalan narayanan
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.12.4, 0.14.1, 0.13.2
>
>
> Code to reproduce -
> Github Issue - [https://github.com/apache/hudi/issues/9967]
> ```
> schema = StructType(
> [
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ]
> )
> data = [
> Row(1, "a"),
> Row(2, "a"),
> Row(3, "c"),
> ]
> hudi_configs = {
> "hoodie.table.name": TABLE_NAME,
> "hoodie.datasource.write.recordkey.field": "name",
> "hoodie.datasource.write.precombine.field": "id",
> "hoodie.datasource.write.operation":"insert_overwrite_table",
> "hoodie.table.keygenerator.class":
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
> }
> df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()
> -- Showing no records
> ```
> df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()
--
This message was sent by Atlassian Jira
(v8.20.10#820010)