nuno-faria commented on issue #17597: URL: https://github.com/apache/datafusion/issues/17597#issuecomment-3302639721
> This should be a memory leak, and would be great to fix. > > BTW, regarding the `limit 20000` being slower than limit: I think TopK operator's heap implementation will be inevitably slower than `SortExec` + `LimitExec` when the k is large, due to constant factors in the implementation. IIRC now the planner always opt to `TopK` path when there is a `LIMIT`, it can be better to use some heuristic to make a better decision (for very large K, opt to `SortExec` + `LimitExec` instead of `TopK`); and also maybe add an option to disable `TopK` Thanks for feedback. I think the issue was that the record batches in TopK were being referenced multiple times unnecessarily, causing huge allocations. I pushed a fix to solve it in #17622. This also solved the execution time problem, now it is faster with the `LIMIT` than without it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
