Aditya Goenka created HUDI-7100:
-----------------------------------
Summary: Data loss when using insert_overwrite_table with
insert.drop.duplicates
Key: HUDI-7100
URL: https://issues.apache.org/jira/browse/HUDI-7100
Project: Apache Hudi
Issue Type: Bug
Components: writer-core
Reporter: Aditya Goenka
Fix For: 0.12.4, 0.14.1, 0.13.2
Code to reproduce -
Github Issue - [https://github.com/apache/hudi/issues/9967]
```
schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
]
)
data = [
Row(1, "a"),
Row(2, "a"),
Row(3, "c"),
]
hudi_configs = {
"hoodie.table.name": TABLE_NAME,
"hoodie.datasource.write.recordkey.field": "name",
"hoodie.datasource.write.precombine.field": "id",
"hoodie.datasource.write.operation":"insert_overwrite_table",
"hoodie.table.keygenerator.class":
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
}
df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
spark.read.format("hudi").load(PATH).show()
-- Showing no records
```
df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)
spark.read.format("hudi").load(PATH).show()
--
This message was sent by Atlassian Jira
(v8.20.10#820010)