Jonathan Vexler created HUDI-5257:
-------------------------------------

             Summary: Spark-Sql duplicates and re-uses record keys under 
certain configs and use cases
                 Key: HUDI-5257
                 URL: https://issues.apache.org/jira/browse/HUDI-5257
             Project: Apache Hudi
          Issue Type: Bug
          Components: bootstrap, spark-sql
            Reporter: Jonathan Vexler
         Attachments: bad_data.txt

On a new table with primary key  _{_}row_key and partitioned by partition_path, 
if you do a bulk insert by{_} 

 
{code:java}
insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table") {code}
you will get data with: [^bad_data.txt] where multiple records have the same 
key even though they have different primary key values, and that there are 
multiple files even though there are only 10 records

changing hoodie.datasource.write.operation=bulk_insert to 
hoodie.datasource.write.operation=insert causes the data to be inserted 
correctly. I do not know if it is using bulk insert with this change. 

 

However, if you use bulk insert with raw data like 
{code:java}
spark.sql(s"""         
| insert into $tableName values         
| $values 
|""".stripMargin
){code}
where $values is something like
{code:java}
(1, 'a1', 10, 1000, "2021-01-05"), {code}
then hoodie.datasource.write.operation=bulk_insert works as expected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to