vinothchandar commented on PR #18159: URL: https://github.com/apache/hudi/pull/18159#issuecomment-3953318016
Posting here, so it's not lost inside a inline comment.. re: https://github.com/apache/hudi/pull/18159#discussion_r2835497856 IIUC what you propose will keep the same memory overhead on the driver, i.e `List<CleanInfo>` -- which is file-scale. By left join, you mean `table_files` and `clean_files` records being joined to find removal candidates? This is to address the blob reference url uniqueness as well? Please clarify. I thought of few more tricky aspects : [1] The blob references are too many to be written into clean metadata on the timeline. Same issue, that it's record-scale data, that will be serialized into a avro file, read by driver -- will OOM. So, we need some special metadata? [2] should we do this in a idempotent manner during execution, not planning? If so - does it open up issues with concurrent actions making it unsafe (I think it works if we assume the 1:1 mappings from record key to blob reference) [3] Regardless, we ensure idempotent behavior for failed clean execution and retry , by first deleting the blob reference, then deleting the file slice, such that we won't ever lose the "pointer" to the blob reference.. ? Can we write our a mini design that addresses these? We can then proceed accordingly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
