[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

via GitHub Wed, 09 Aug 2023 13:07:50 -0700


avantgardnerio commented on PR #7192:
URL: 
https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1672072928


   TLDR: with the naive, unoptimized version in place, it looks to be 2X slower 
according to a test with [realistic 
data](https://github.com/avantgardnerio/arrow-datafusion/blob/deb23674b7fc45a40d74e241b0504c5067d83751/datafusion/core/src/physical_plan/aggregates/priority_queue.rs#L399):
   
   <img width="371" alt="Screenshot 2023-08-09 at 1 58 03 PM" 
src="https://github.com/apache/arrow-datafusion/assets/3855243/c725e323-5082-4f8d-9197-34ff6774975e";>
   
   This is based upon the fact that currently, the normal sort is running twice 
or with the rule enabled 1 of each.
   
   ```
   GlobalLimitExec: skip=0, fetch=10
     SortPreservingMergeExec: [MAX(traces.timestamp_ms)@1 DESC], fetch=10
       SortExec: fetch=10, expr=[MAX(traces.timestamp_ms)@1 DESC]
         AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[MAX(traces.timestamp_ms)], lim=[10]
           CoalesceBatchesExec: target_batch_size=8192
             RepartitionExec: partitioning=Hash([trace_id@0], 10), 
input_partitions=10
               AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[MAX(traces.timestamp_ms)]
                 RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1
                   MemoryExec: partitions=1, partition_sizes=[1]
   ```
   
   I'm not super worried because:
   1. it just validates the concerns we all had about tree balancing and heap 
allocations
   2. when the new rule runs twice, the second invocation should be on 
negligible data, so I predict it's be on back to par with the unlimited 
aggregation
   
   No matter what, this rule is much more memory efficient. I'll pass the limit 
down the tree and we'll see if I'm right and we match speed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

Reply via email to