[
https://issues.apache.org/jira/browse/HUDI-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kazdy updated HUDI-5839:
------------------------
Affects Version/s: 0.13.0
> Insert in non-strict mode deduplices dataset in "append" mode - spark
> ---------------------------------------------------------------------
>
> Key: HUDI-5839
> URL: https://issues.apache.org/jira/browse/HUDI-5839
> Project: Apache Hudi
> Issue Type: Bug
> Components: spark, writer-core
> Affects Versions: 0.13.0
> Reporter: kazdy
> Priority: Major
>
> There seem to be a bug with non-strict insert mode when precombine is not
> defined (but I have not checked for when it is).
> When using spark datasource it can insert duplicates only in overwrite mode
> or append mode when data is inserted to the table for the first time, but if
> I want to insert in append mode for the second time it deduplicates the
> dataset as if it was working in upsert mode.
> I happens to be a regression, because I'm using this functionality in Hudi
> 0.12.1.
> {code:java}
> from pyspark.sql.functions import expr
> opt_insert = {
> 'hoodie.table.name': 'huditbl',
> 'hoodie.datasource.write.recordkey.field': 'keyid',
> 'hoodie.datasource.write.table.name': 'huditbl',
> 'hoodie.datasource.write.operation': 'insert',
> 'hoodie.sql.insert.mode': 'non-strict',
> 'hoodie.upsert.shuffle.parallelism': 2,
> 'hoodie.insert.shuffle.parallelism': 2,
> 'hoodie.combine.before.upsert': 'false',
> 'hoodie.combine.before.insert': 'false',
> 'hoodie.datasource.write.insert.drop.duplicates': 'false'
> }
> df = spark.range(0, 10).toDF("keyid") \
> .withColumn("age", expr("keyid + 1000"))
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("overwrite"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10
> df = df.union(df) # creates duplicates
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("append"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10 but should return
> 20 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)