ctsk commented on issue #13729: URL: https://github.com/apache/datafusion/issues/13729#issuecomment-2594124886
I checked out your second point and was able to replicate the difference in speed between the two queries. I don't yet see how a TopK operator would help the non-ordered query. When comparing the query plans, it seems like the "limit" is not propagated to AggregateExec: Query with OrderBy: | SortPreservingMergeExec: [min(hits.AdvEngineID)@1 ASC NULLS LAST], fetch=10 | SortExec: TopK(fetch=10), expr=[min(hits.AdvEngineID)@1 ASC NULLS LAST], preserve_partitioning=[true] | AggregateExec: mode=FinalPartitioned, gby=[UserID@0 as UserID], aggr=[min(hits.AdvEngineID)], lim=[10] | CoalesceBatchesExec: target_batch_size=8192 | RepartitionExec: partitioning=Hash([UserID@0], 8), input_partitions=8 | AggregateExec: mode=Partial, gby=[UserID@0 as UserID], aggr=[min(hits.AdvEngineID)], lim=[10] Query without OrderBy: | GlobalLimitExec: skip=0, fetch=10 | CoalescePartitionsExec | AggregateExec: mode=FinalPartitioned, gby=[UserID@0 as UserID], aggr=[min(hits.AdvEngineID)] | CoalesceBatchesExec: target_batch_size=8192 | RepartitionExec: partitioning=Hash([UserID@0], 8), input_partitions=8 | AggregateExec: mode=Partial, gby=[UserID@0 as UserID], aggr=[min(hits.AdvEngineID)] After hackily adding the limit to the aggregation, the query was faster without the orderby (1.4 seconds vs 1.55 seconds). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org