JayjeetAtGithub opened a new issue, #7200: URL: https://github.com/apache/arrow-datafusion/issues/7200
### Describe the bug On executing `SortPreservingMerge` on multiple streams of record batches and using a high-cardinality dictionary field as the sort key, the `RowConverter` instance used to merge the multiple `RowCursorStreams` keeps growing in memory (as it keeps accumulating the dict mappings internally in the `OrderPreservingInterner` structure). This unbounded memory growth eventually causes data fusion to get killed by the OOM killer. ### To Reproduce Detailed steps to reproduce this issue is given [here](https://github.com/JayjeetAtGithub/iox_observe_bench/blob/main/docs/oom_kill.md#reproducing-the-issue). ### Expected behavior `SortPreservingMerge` on streams of record batches with high-cardinality dictionary-encoded sort keys should be memory aware and keep memory usage within a user-defined limit. ### Additional context **Possible solution:** 1. Keep track of the memory usage for the `RowConverter` using the `size()` method which in the case of `Dictionary` fields returns the size of the `OrderPreservingInterner`. 2. If the size of the `RowConverter` grows more than a user-defined memory limit, take note of the `RowCursorStream` that are still getting converted, delete the converter, create a new one, and re-do the aborted conversions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
