[
https://issues.apache.org/jira/browse/HUDI-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Vexler updated HUDI-5257:
----------------------------------
Description:
On a new table with primary key _row_key and partitioned by partition_path, if
you do a bulk insert by:
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same
key even though they have different primary key values, and that there are
multiple files even though there are only 10 records
changing hoodie.datasource.write.operation=bulk_insert to
hoodie.datasource.write.operation=insert causes the data to be inserted
correctly. I do not know if it is using bulk insert with this change.
However, if you use bulk insert with raw data like
{code:java}
spark.sql(s"""
| insert into $tableName values
| $values
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected
was:
On a new table with primary key _{_}row_key and partitioned by partition_path,
if you do a bulk insert by{_}
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same
key even though they have different primary key values, and that there are
multiple files even though there are only 10 records
changing hoodie.datasource.write.operation=bulk_insert to
hoodie.datasource.write.operation=insert causes the data to be inserted
correctly. I do not know if it is using bulk insert with this change.
However, if you use bulk insert with raw data like
{code:java}
spark.sql(s"""
| insert into $tableName values
| $values
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected
> Spark-Sql duplicates and re-uses record keys under certain configs and use
> cases
> --------------------------------------------------------------------------------
>
> Key: HUDI-5257
> URL: https://issues.apache.org/jira/browse/HUDI-5257
> Project: Apache Hudi
> Issue Type: Bug
> Components: bootstrap, spark-sql
> Reporter: Jonathan Vexler
> Priority: Major
> Attachments: bad_data.txt
>
>
> On a new table with primary key _row_key and partitioned by partition_path,
> if you do a bulk insert by:
> {code:java}
> insertDf.createOrReplaceTempView("insert_temp_table")
> spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
> spark.sql("set hoodie.sql.bulk.insert.enable=true")
> spark.sql("set hoodie.sql.insert.mode=non-strict")
> spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
> you will get data with: [^bad_data.txt] where multiple records have the same
> key even though they have different primary key values, and that there are
> multiple files even though there are only 10 records
> changing hoodie.datasource.write.operation=bulk_insert to
> hoodie.datasource.write.operation=insert causes the data to be inserted
> correctly. I do not know if it is using bulk insert with this change.
>
> However, if you use bulk insert with raw data like
> {code:java}
> spark.sql(s"""
> | insert into $tableName values
> | $values
> |""".stripMargin
> ){code}
> where $values is something like
> {code:java}
> (1, 'a1', 10, 1000, "2021-01-05"), {code}
> then hoodie.datasource.write.operation=bulk_insert works as expected
--
This message was sent by Atlassian Jira
(v8.20.10#820010)