Akeron-Zhu opened a new issue, #4255:
URL: https://github.com/apache/amoro/issues/4255

   ### Search before asking
   
   - [x] I have searched in the 
[issues](https://github.com/apache/amoro/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### What would you like to be improved?
   
   Currently, the planning phase suffers from excessive memory consumption when 
dealing with tables containing massive deleted small files. The root cause lies 
in the inefficient storage of file relationships: a single DeleteFile is often 
associated with multiple DataFiles.
   In the current implementation, these associations are likely stored as 
explicit lists or object references. When a table has a large volume of data 
files referencing the same delete files, the memory overhead for maintaining 
these references grows unboundedly. This redundancy causes the planning index 
to consume significantly more heap memory than necessary, leading to potential 
Out-Of-Memory (OOM) errors and degraded performance during query planning.
   
   ### How should we improve?
   
   I propose optimizing the memory layout of the planning index by introducing 
RoaringBitmap to compress the association between DeleteFile and DataFile. 
Instead of storing explicit lists of file IDs or object references, we can use 
RoaringBitmaps to represent the set of DataFile IDs associated with each 
DeleteFile. RoaringBitmap provides highly efficient compression for integer 
sets (file IDs), significantly reducing the memory footprint required to store 
these many-to-many relationships.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Subtasks
   
   _No response_
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to