JayjeetAtGithub opened a new issue, #7200:
URL: https://github.com/apache/arrow-datafusion/issues/7200

   ### Describe the bug
   
   On executing `SortPreservingMerge` on multiple streams of record batches and 
using a high-cardinality dictionary field as the sort key, the `RowConverter` 
instance used to merge the multiple `RowCursorStreams` keeps growing in memory 
(as it keeps accumulating the dict mappings internally in the 
`OrderPreservingInterner` structure). This unbounded memory growth eventually 
causes data fusion to get killed by the OOM killer. 
   
   
   
   
   ### To Reproduce
   
   Detailed steps to reproduce this issue is given 
[here](https://github.com/JayjeetAtGithub/iox_observe_bench/blob/main/docs/oom_kill.md#reproducing-the-issue).
 
   
   
   ### Expected behavior
   
   `SortPreservingMerge` on streams of record batches with high-cardinality 
dictionary-encoded sort keys should be memory aware and keep memory usage 
within a user-defined limit. 
   
   ### Additional context
   
   **Possible solution:**
   
   1. Keep track of the memory usage for the `RowConverter` using the `size()` 
method which in the case of `Dictionary` fields returns the size of the 
`OrderPreservingInterner`.
   2. If the size of the `RowConverter` grows more than a user-defined memory 
limit, take note of the `RowCursorStream` that are still getting converted, 
delete the converter, create a new one, and re-do the aborted conversions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to