Akeron-Zhu opened a new issue, #4255: URL: https://github.com/apache/amoro/issues/4255
### Search before asking - [x] I have searched in the [issues](https://github.com/apache/amoro/issues?q=is%3Aissue) and found no similar issues. ### What would you like to be improved? Currently, the planning phase suffers from excessive memory consumption when dealing with tables containing massive deleted small files. The root cause lies in the inefficient storage of file relationships: a single DeleteFile is often associated with multiple DataFiles. In the current implementation, these associations are likely stored as explicit lists or object references. When a table has a large volume of data files referencing the same delete files, the memory overhead for maintaining these references grows unboundedly. This redundancy causes the planning index to consume significantly more heap memory than necessary, leading to potential Out-Of-Memory (OOM) errors and degraded performance during query planning. ### How should we improve? I propose optimizing the memory layout of the planning index by introducing RoaringBitmap to compress the association between DeleteFile and DataFile. Instead of storing explicit lists of file IDs or object references, we can use RoaringBitmaps to represent the set of DataFile IDs associated with each DeleteFile. RoaringBitmap provides highly efficient compression for integer sets (file IDs), significantly reducing the memory footprint required to store these many-to-many relationships. ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Subtasks _No response_ ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
