kbuci commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3820542082
Thanks for the discussion @vinothchandar > empty clean for now, without any additional format changes or complexity (stop gap) as requested, here is a PR for supporting an empty clean plan https://github.com/apache/hudi/pull/11605/changes . Surya had created this back in HUDI 0.x, but I can rebase it on latest master if it helps. > To ponder; if we can maintain the table service state also in the MDT.. For our use cases we want to support being able to use "restore" to restore a datasets to an earlier instant on the timeline. So I believe this approach would satisfy that requirement. Since the MDT would also be restored to said instant. But thinking out loud: - Even this allows us to no longer need empty clean, we would still need some "operation" that wouldn't actually clean anything but would write a new ECTR to the MDT. And some configuration/logic to determine who (clean? archival?) attempts this operation (and how often). So even though this prevents us from having to block archival on ECTR, it looks to me that we would still have to orchestrate some operation at a regular cadence (the same way we do for empty clean) - In our org's 0.14 HUDI workloads, we sometimes have to delete and re-create the MDT to mitigate bugs/issues that merge (since re-creating the MDT is much cheaper for us than re-creating the entire dataset/data files). If we want to still keep this flexibility/mitigation option available in 1.2 without forcing our workloads to do a full scan clean (after this MDT rebootstrap event), we might want to consider also duplicating the latest ECTR somewhere else outside of the proposed MDT partition (such as some [dot] hoodie metafile) > Orthogonally, love to see some numbers at which we see this issues i.e number of files and such.. Sure let me try to find some logs/numbers on our end. But I should mention upfront that, in addition to using HUDI 0.14, we: - Do not use timeline server. This is since back in 0.10 we noticed issues (memory pressure iirc) for workloads where we process multiple datasets in a single "ingestion" spark job. And we have not yet internally validated if this still is an issue on 0.14/1.x - Use in memory filesystem view. We don't use spillable filesystem view due to a correctness issue we noticed due to serialization https://github.com/apache/hudi/issues/17957 @danny0405 > in this commit, the fs view request from clean planner goes though a set of new stateless APIs(that basically does not cache the file group on fs view), so that to midigate the memory pressure of driver. Thanks for sharing this context. We actually backported some recent changes to clean like https://github.com/apache/hudi/pull/10928 to our 0.14 tbuild o try to mitigate these issues. I have not done any proper profiling but my initial hunch is that: - The `shouldUseBatchLookup` option in CleanPlanActionExecutor is continually adding partitions from all batches to the FS view without "unloading" partitions from prior batch. So the driver memory usage continually increases. And then when it is copied over to the spark task (which "looks for" all files to clean in a given partition) that adds more memory usage as well. (note that as mentioned above, although we use MDT we do not use timeline server or spillable FS view) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
