vinothchandar commented on PR #18159:
URL: https://github.com/apache/hudi/pull/18159#issuecomment-3953318016

   Posting here, so it's not lost inside a inline comment..  
   re: https://github.com/apache/hudi/pull/18159#discussion_r2835497856 
   
   IIUC what you propose will keep the same memory overhead on the driver, i.e 
`List<CleanInfo>` -- which is file-scale. By left join, you mean `table_files` 
and `clean_files` records being joined to find removal candidates? This is to 
address the blob reference url uniqueness as well? Please clarify. 
   
   I thought of few more tricky aspects : 
   
   [1] The blob references are too many to be written into clean metadata on 
the timeline. Same issue, that it's record-scale data, that will be serialized 
into a avro file, read by driver -- will OOM. So, we need some special 
metadata? 
   
   [2] should we do this in a idempotent manner during execution, not planning? 
If so - does it open up issues with concurrent actions making it unsafe (I 
think it works if we assume the 1:1 mappings from record key to blob reference)
   
   [3] Regardless, we ensure idempotent behavior for failed clean execution and 
retry , by first deleting the blob reference, then deleting the file slice, 
such that we won't ever lose the "pointer" to the blob reference.. ?
   
   Can we write our a mini design that addresses these? We can then proceed 
accordingly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to