parisni opened a new issue, #10508:
URL: https://github.com/apache/hudi/issues/10508

   The ComplexKeyGenerator does not produce the same result for 0.14 than 
previous versions. This leads to duplicate data when upserting.
   
   **To Reproduce**
   
   ```python
   tableName = 'test_hudi'
   basePath = "/tmp/{tableName}".format(tableName=tableName)
   df = (
       spark.sql("select 1 event_id, 2 event_date, 3 version")
   )
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "event_id",
       "hoodie.datasource.write.partitionpath.field": "event_date",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.precombine.field": "version",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.metadata.enable": "true",
   }
   
(df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
   spark.read.format("hudi").load(basePath).show()
   ```
   
   0.13.1
   ```
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|event_id|version|event_date|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   |  20240116135635023|20240116135635023...|        event_id:1|                
     2|a1e0e599-c09c-44d...|       1|      3|         2|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   ```
   
   0.14.1
   ```
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|event_id|version|event_date|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   |  20240116135503412|20240116135503412...|                 1|                
     2|ce35287e-af94-48c...|       1|      3|         2|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------+
   ```
   
   **Expected behavior**
   
   The _hoodie_record_key should not change across versions or at least 
specified in the migration guide.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.2.2
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to