nuno-faria commented on issue #17597:
URL: https://github.com/apache/datafusion/issues/17597#issuecomment-3302639721

   > This should be a memory leak, and would be great to fix.
   > 
   > BTW, regarding the `limit 20000` being slower than limit: I think TopK 
operator's heap implementation will be inevitably slower than `SortExec` + 
`LimitExec` when the k is large, due to constant factors in the implementation. 
IIRC now the planner always opt to `TopK` path when there is a `LIMIT`, it can 
be better to use some heuristic to make a better decision (for very large K, 
opt to `SortExec` + `LimitExec` instead of `TopK`); and also maybe add an 
option to disable `TopK`
   
   Thanks for feedback. I think the issue was that the record batches in TopK 
were being referenced multiple times unnecessarily, causing huge allocations. I 
pushed a fix to solve it in #17622. This also solved the execution time 
problem, now it is faster with the `LIMIT` than without it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to