kazdy created HUDI-5839:
---------------------------
Summary: Insert in non-strict mode deduplices dataset in "append"
mode - spark
Key: HUDI-5839
URL: https://issues.apache.org/jira/browse/HUDI-5839
Project: Apache Hudi
Issue Type: Bug
Components: spark, writer-core
Reporter: kazdy
There seem to be a bug with non-strict insert mode when precombine is not
defined (but I have not checked for when it is).
When using spark datasource it can insert duplicates only in overwrite mode or
append mode when data is inserted to the table for the first time, but if I
want to insert in append mode for the second time it deduplicates the dataset
as if it was working in upsert mode.
I happens to be a regression, because I'm using this functionality in Hudi
0.12.1.
{code:java}
from pyspark.sql.functions import expr
opt_insert = {
'hoodie.table.name': 'huditbl',
'hoodie.datasource.write.recordkey.field': 'keyid',
'hoodie.datasource.write.table.name': 'huditbl',
'hoodie.datasource.write.operation': 'insert',
'hoodie.sql.insert.mode': 'non-strict',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.combine.before.upsert': 'false',
'hoodie.combine.before.insert': 'false',
'hoodie.datasource.write.insert.drop.duplicates': 'false'
}
df = spark.range(0, 10).toDF("keyid") \
.withColumn("age", expr("keyid + 1000"))
df.write.format("hudi"). \
options(**opt_insert). \
mode("overwrite"). \
save(path)
spark.read.format("hudi").load(path).count() # returns 10
df = df.union(df) # creates duplicates
df.write.format("hudi"). \
options(**opt_insert). \
mode("append"). \
save(path)
spark.read.format("hudi").load(path).count() # returns 10 but should return 20
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)