soumilshah1995 opened a new issue, #9263:
URL: https://github.com/apache/hudi/issues/9263
TO @nadine farah @shivNarayan
Subject : Hard deletes in Hudi
Explanation of problem
i had a conversation with @Howard Cho yesterday on google meet and we had
some conversation on hard delete to remove duplicates and i decided to give a
shot.here is scenarios
INGESTED Batch 1 using BULK_INSERT
```
+---------+----+-------+-------+-------------------+----------------------+
|partition|uuid|precomb|message|_hoodie_commit_time|_hoodie_commit_seqno |
+---------+----+-------+-------+-------------------+----------------------+
|2 |2 |222 |mes 2 |20230722082644003 |20230722082644003_15_2|
|1 |1 |111 |mess 1 |20230722082644003 |20230722082644003_7_1 |
+---------+----+-------+-------+-------------------+----------------------+
```
Ingested BATCH 2 Using BULK_INSERT
```
+---------+----+-------+--------+-------------------+----------------------+
|partition|uuid|precomb|message |_hoodie_commit_time|_hoodie_commit_seqno |
+---------+----+-------+--------+-------------------+----------------------+
|3 |3 |333 |insert 3|20230722082703914 |20230722082703914_15_4|
|2 |2 |222 |mes 2 |20230722082644003 |20230722082644003_15_2|
|1 |1 |111 |mess 1 |20230722082644003 |20230722082644003_7_1 |
|2 |2 |222 |upt 2 |20230722082703914 |20230722082703914_7_3 |
+---------+----+-------+--------+-------------------+----------------------+
```
Ask was lets use hard delete in Hudi to remove UUID 2 and where Commit is
20230722082644003
Great
```
df =
spark.read.format("hudi").load(path).createOrReplaceTempView("hudi_snapshot")
query = """
SELECT *
FROM hudi_snapshot
WHERE (uuid, _hoodie_commit_time) IN (
SELECT uuid, MIN(_hoodie_commit_time) AS min_commit_time
FROM hudi_snapshot
GROUP BY uuid
HAVING COUNT(*) > 1
);
"""
delete_df = spark.sql(query)
delete_df.select(["partition", "uuid", "precomb", "message",
"_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)
```
OUTPUT
```
+---------+----+-------+-------+-------------------+----------------------+
|partition|uuid|precomb|message|_hoodie_commit_time|_hoodie_commit_seqno |
+---------+----+-------+-------+-------------------+----------------------+
|2 |2 |222 |mes 2 |20230722082644003 |20230722082644003_15_2|
+---------+----+-------+-------+-------------------+----------------------+
```
Great we identified item to delete
After Hard delete
```
+---------+----+-------+--------+-------------------+----------------------+
|partition|uuid|precomb|message |_hoodie_commit_time|_hoodie_commit_seqno |
+---------+----+-------+--------+-------------------+----------------------+
|3 |3 |333 |insert 3|20230722082703914 |20230722082703914_15_4|
|1 |1 |111 |mess 1 |20230722082644003 |20230722082644003_7_1 |
+---------+----+-------+--------+-------------------+----------------------+
```
But it removed
```
|2 |2 |222 |upt 2 |20230722082703914 |20230722082703914_7_3 |
```
Which it should not
Any advice | Feedback or suggestion how to handle hard delete
Thanking you TIme in advance
Shah
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]