Jonathan Vexler created HUDI-5257:
-------------------------------------
Summary: Spark-Sql duplicates and re-uses record keys under
certain configs and use cases
Key: HUDI-5257
URL: https://issues.apache.org/jira/browse/HUDI-5257
Project: Apache Hudi
Issue Type: Bug
Components: bootstrap, spark-sql
Reporter: Jonathan Vexler
Attachments: bad_data.txt
On a new table with primary key _{_}row_key and partitioned by partition_path,
if you do a bulk insert by{_}
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same
key even though they have different primary key values, and that there are
multiple files even though there are only 10 records
changing hoodie.datasource.write.operation=bulk_insert to
hoodie.datasource.write.operation=insert causes the data to be inserted
correctly. I do not know if it is using bulk insert with this change.
However, if you use bulk insert with raw data like
{code:java}
spark.sql(s"""
| insert into $tableName values
| $values
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected
--
This message was sent by Atlassian Jira
(v8.20.10#820010)