[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

via GitHub Fri, 17 Mar 2023 10:56:04 -0700


nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1474205280


   probably, here is what you can do
   1. query the table to find all duplicates. 
   2. store the dupes to some staging location (may be df.write.parquet). 
   3. issue deletes for these records to against hudi. 
   4. for the same batch, de-duplicate to pick one version of the record and 
ingest to hudi using upsert. 
   
   If anything crashes inbetween, you always have the staging data. this is 
just to ensure after deleting from hudi table, if your process crashes, you may 
have lost track of the records. bcoz, snapshot query is not going to return it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Reply via email to