[jira] [Updated] (HUDI-5257) Spark-Sql duplicates and re-uses record keys under certain configs and use cases

Jonathan Vexler (Jira) Mon, 21 Nov 2022 12:08:04 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Vexler updated HUDI-5257:
----------------------------------
    Description: 
On a new table with primary key  _row_key and partitioned by partition_path, if 
you do a bulk insert by:
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same 
key even though they have different primary key values, and that there are 
multiple files even though there are only 10 records

changing hoodie.datasource.write.operation=bulk_insert to 
hoodie.datasource.write.operation=insert causes the data to be inserted 
correctly. I do not know if it is using bulk insert with this change. 

 

However, if you use bulk insert with raw data like 
{code:java}
spark.sql(s"""         
| insert into $tableName values         
| $values 
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected

  was:
On a new table with primary key  _{_}row_key and partitioned by partition_path, 
if you do a bulk insert by{_} 

 
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same 
key even though they have different primary key values, and that there are 
multiple files even though there are only 10 records

changing hoodie.datasource.write.operation=bulk_insert to 
hoodie.datasource.write.operation=insert causes the data to be inserted 
correctly. I do not know if it is using bulk insert with this change. 

 

However, if you use bulk insert with raw data like 
{code:java}
spark.sql(s"""         
| insert into $tableName values         
| $values 
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected


> Spark-Sql duplicates and re-uses record keys under certain configs and use 
> cases
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-5257
>                 URL: https://issues.apache.org/jira/browse/HUDI-5257
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: bootstrap, spark-sql
>            Reporter: Jonathan Vexler
>            Priority: Major
>         Attachments: bad_data.txt
>
>
> On a new table with primary key  _row_key and partitioned by partition_path, 
> if you do a bulk insert by:
> {code:java}
> insertDf.createOrReplaceTempView("insert_temp_table")
> spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
> spark.sql("set hoodie.sql.bulk.insert.enable=true")
> spark.sql("set hoodie.sql.insert.mode=non-strict")
> spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
> you will get data with: [^bad_data.txt] where multiple records have the same 
> key even though they have different primary key values, and that there are 
> multiple files even though there are only 10 records
> changing hoodie.datasource.write.operation=bulk_insert to 
> hoodie.datasource.write.operation=insert causes the data to be inserted 
> correctly. I do not know if it is using bulk insert with this change. 
>  
> However, if you use bulk insert with raw data like 
> {code:java}
> spark.sql(s"""         
> | insert into $tableName values         
> | $values 
> |""".stripMargin
> ){code}
> where $values is something like
> {code:java}
> (1, 'a1', 10, 1000, "2021-01-05"), {code}
> then hoodie.datasource.write.operation=bulk_insert works as expected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5257) Spark-Sql duplicates and re-uses record keys under certain configs and use cases

Reply via email to