[I] [SUPPORT] Data duplicated in base file on updating record partition [hudi]

via GitHub Thu, 28 Mar 2024 01:06:27 -0700


pravin1406 opened a new issue, #10932:
URL: https://github.com/apache/hudi/issues/10932


   **Describe the problem you faced**
   When upserting records into an empty hudi MOR table, when partition col is 
updated to new value and then reupdated to older value keeping the 
precombinekey and recordkey same, leads to duplicate result while reading. The 
first records coming from the log file and the exact same record but with 
different seq no coming from base file.
   
   Similarly when partition col is updated to new value twice, then after third 
update, the record get's duplicated in the output table. Actually this record 
is duplicated in the base file itself.
   
   
   This behaviour does not occur when spark sql, which uses expressionPayload.
   I expected the same here, i.e when precombine key is older, no partition 
updation takes place, partition update only when precombine key is greater. But 
here precombine key is not honoured and even partition update is not proper.
   
   
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   Hudi Config Used: 
   `df.write.mode("append").format("hudi")
   .option("hoodie.datasource.write.operation","upsert")
   .option("hoodie.spark.sql.insert.into.operation","upsert")
   .option("hoodie.datasource.write.precombine.field", precombine)
   .option("hoodie.datasource.write.recordkey.field", recordkey)
   .option("hoodie.datasource.write.partitionpath.field",partitionby) 
   .option("hoodie.datasource.write.table.type","MERGE_ON_READ")
   
.option("hoodie.datasource.write.payload.class","org.apache.hudi.common.model.DefaultHoodieRecordPayload")
   .option("hoodie.enable.data.skipping", "true")
   .option("hoodie.datasource.write.reconcile.schema", "true")
   .option("hoodie.datasource.hive_sync.support_timestamp", "true")
   .option("hoodie.upsert.shuffle.parallelism","200")
   .option("hoodie.index.type","GLOBAL_SIMPLE")
   .option("hoodie.simple.index.update.partition.path","true")
   .option("hoodie.datasource.hive_sync.enable", "true")
   .option("hoodie.datasource.hive_sync.mode", "HMS")
   .option("hoodie.datasource.hive_sync.sync_comment", "true")
   .option("hoodie.datasource.hive_sync.database","default")
   .option("hoodie.datasource.hive_sync.table",tablename)
   .option("hoodie.table.name",tablename)
   .option("hoodie.datasource.hive_sync.partition_fields",partitionby)
   
.option("hoodie.datasource.write.hive_style_partitioning","true").save("file:///tmp/hudi/output/"+tablename)`
   
   Steps to reproduce the behavior:
   Case A)
   
     1. Upsert 1 record into a hudi table.
     2. Update the partition col to a new a value and upsert that
     3. Update the partition col to the older value and upsert that
     4. Out hudi table will result in duplicate record, with one coming from 
.log file and another from base file
     
   Case B)
   
     1. Upsert 1 record into a hudi table.
     2. Update the partition col to a new a value and upsert that
     3. Update the partition col to a another new a value and upsert that
     4. Out hudi table will result in duplicate record with both coming from 
same base file.
     
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2_1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : LocalFs
   
   * Running on Docker? (yes/no) : no
   
   Input data i used
   <img width="1398" alt="Screenshot 2024-03-28 at 1 30 03 PM" 
src="https://github.com/apache/hudi/assets/25177655/0412f11f-97e7-4b47-b5f2-1dcceb37570a";>
   
   Case A with duplicate output
   <img width="1440" alt="Screenshot 2024-03-28 at 1 30 40 PM" 
src="https://github.com/apache/hudi/assets/25177655/dd8e2eb5-5a98-4300-8164-392d62933dbf";>
   
   Case B with duplicate output
   
   <img width="1440" alt="Screenshot 2024-03-28 at 1 31 36 PM" 
src="https://github.com/apache/hudi/assets/25177655/c1391427-f771-440d-87e9-93133a92638d";>
   
   Case B proof of duplicate in actual base file
   
   <img width="1440" alt="Screenshot 2024-03-28 at 1 35 35 PM" 
src="https://github.com/apache/hudi/assets/25177655/f6b52325-6489-4fff-91b8-5ebfa2e8e271";>
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Data duplicated in base file on updating record partition [hudi]

Reply via email to