kbuci commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3775629090
> Thanks [@kbuci](https://github.com/kbuci) for the feature ask. > > * May I know if we are looking for a fix in latest master (1.x), or 0.x as well. Just for 1.x, latest master > * Clarification on the 2nd fix w/n your organization: I understand the config to control the "num instants to clean" in every clean operation. But did you guys also consider "num files to clean" config. w/ num files, we can set it to 100k or something which would work for all datasets right. Just that with "num instants to clean", as you pointed out, we might need to manually configure diff values for diff streams. We considered having "num files to clean" configs in internal design discussions, due to the reasons you raised. Essentially I was thinking of having two types of "knobs" - The "num files to clean" . - The storage footprint of the clean planner. For example, even if there is nothing to clean, if there are thousands of partitions with thousands of files each, then the resulting in-memory filesystem view object creating during clean planner may cause memory pressure. For such datasets/workloads, we would probably want some sort of limit on `partition count scanned x files per partition`. For incremental clean this may mean at the worst case we would only move the ECTR one instant. For full-scan clean its a bit more tricky - we would have to "partially" clean a subset of the partitions per run of clean. Also even if we were to add these configs, the next issue would be how to have users know what values to set? It would be ideal if a user could pass in a general config for clean dictating "how much memory should clean use at most", and then HUDI could infer the values for the aforementioned configs > Was just curious on any brainstorming you did before going ahead w/ "num instants to clean" based solution. Back when we implemented that config internally, it was mainly done urgently so that we could quickly mitigate production issues with upsert-heavy datasets. Where if clean was paused for a while and resumed later, the next clean would have to process a large enough count of files to cause an OOM. So regardless of if this config makes it in the final design here, we don't expect this internal config to be sufficient for all the workloads/scenarios discussed above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
