soumilshah1995 opened a new issue, #9263:
URL: https://github.com/apache/hudi/issues/9263

   TO @nadine farah @shivNarayan 
   
   Subject : Hard deletes in Hudi 
   
   Explanation of problem 
   
   i had a conversation with @Howard Cho yesterday on google meet and we had 
some conversation on hard delete to remove duplicates and i decided to give a 
shot.here is scenarios
   
   INGESTED Batch 1 using BULK_INSERT
   ```
   +---------+----+-------+-------+-------------------+----------------------+
   |partition|uuid|precomb|message|_hoodie_commit_time|_hoodie_commit_seqno  |
   +---------+----+-------+-------+-------------------+----------------------+
   |2        |2   |222    |mes 2  |20230722082644003  |20230722082644003_15_2|
   |1        |1   |111    |mess 1 |20230722082644003  |20230722082644003_7_1 |
   +---------+----+-------+-------+-------------------+----------------------+
   ```
   
   
   
   
   Ingested BATCH 2 Using BULK_INSERT 
   ```
   +---------+----+-------+--------+-------------------+----------------------+
   |partition|uuid|precomb|message |_hoodie_commit_time|_hoodie_commit_seqno  |
   +---------+----+-------+--------+-------------------+----------------------+
   |3        |3   |333    |insert 3|20230722082703914  |20230722082703914_15_4|
   |2        |2   |222    |mes 2   |20230722082644003  |20230722082644003_15_2|
   |1        |1   |111    |mess 1  |20230722082644003  |20230722082644003_7_1 |
   |2        |2   |222    |upt 2   |20230722082703914  |20230722082703914_7_3 |
   +---------+----+-------+--------+-------------------+----------------------+
   ```
   Ask was lets use hard delete in Hudi to remove UUID 2 and where Commit is 
20230722082644003  
   
   Great 
   ```
   df = 
spark.read.format("hudi").load(path).createOrReplaceTempView("hudi_snapshot")
   
   query = """
   SELECT *
   FROM hudi_snapshot
   WHERE (uuid, _hoodie_commit_time) IN (
       SELECT uuid, MIN(_hoodie_commit_time) AS min_commit_time
       FROM hudi_snapshot
       GROUP BY uuid
       HAVING COUNT(*) > 1
   );
   """
   delete_df = spark.sql(query)
   delete_df.select(["partition", "uuid", "precomb", "message", 
"_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)
   ```
   
   
   OUTPUT
   ```
   +---------+----+-------+-------+-------------------+----------------------+
   |partition|uuid|precomb|message|_hoodie_commit_time|_hoodie_commit_seqno  |
   +---------+----+-------+-------+-------------------+----------------------+
   |2        |2   |222    |mes 2  |20230722082644003  |20230722082644003_15_2|
   +---------+----+-------+-------+-------------------+----------------------+
   ```
   Great we identified item to delete 
   
   After Hard delete 
   ```
   +---------+----+-------+--------+-------------------+----------------------+
   |partition|uuid|precomb|message |_hoodie_commit_time|_hoodie_commit_seqno  |
   +---------+----+-------+--------+-------------------+----------------------+
   |3        |3   |333    |insert 3|20230722082703914  |20230722082703914_15_4|
   |1        |1   |111    |mess 1  |20230722082644003  |20230722082644003_7_1 |
   +---------+----+-------+--------+-------------------+----------------------+
   ```
   
   But it removed 
   ```
   |2        |2   |222    |upt 2   |20230722082703914  |20230722082703914_7_3 |
   ```
   
   Which it should not 
   Any advice | Feedback or suggestion how to handle hard delete 
   
   Thanking you TIme in advance 
   Shah


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to