Re: [I] Use pipeline aggregation when data is implicitly sorted by group-by keys [datafusion]

via GitHub Sun, 11 Jan 2026 18:02:56 -0800


NGA-TRAN commented on issue #19655:
URL: https://github.com/apache/datafusion/issues/19655#issuecomment-3736644707


   **Some hints:** I suggest we start by running some benchmarking or 
performance tests to determine whether this is worth pursuing.
   - Create two different schemas on the same dataset: one explicitly sorted on 
`(a, b)` and another sorted only on `a`. The first should produce a group‑by 
pipeline, while the second will fall back to a group‑by hash.
   - We may also want to run these tests on more realistic query  such as (data 
is explicitly sorted on `f_dkey, timestamp` but implicitly sorted by `f_dkey, 
date_bin, env`)
   
   ```SQL
   SELECT env, time_bin, AVG(max_bin_value) AS avg_max_value
   FROM
   (
       SELECT  f_dkey,
               date_bin(INTERVAL '30 seconds', timestamp) AS time_bin,
               env,
               MAX(value) AS max_bin_value
       FROM
           (
           SELECT
               f.f_dkey,
               d.env,
               d.service,
               d.host,
               f.timestamp,
               f.value
           FROM dimension_table d
           INNER JOIN fact_table_ordered f ON d.d_dkey = f.f_dkey
           WHERE service = 'log'
           ) AS j
       GROUP BY f_dkey, time_bin, env
   ) AS a
   GROUP BY env, time_bin
   ORDER BY env, time_bin;
   ```
   
   - Also, these 2 PRs have some related properties and tests:
      - https://github.com/apache/datafusion/pull/19124
      - https://github.com/apache/datafusion/pull/19304
      
   @xavlee : When you start working on this, let us chat with @gene-bordegaray 
first. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Use pipeline aggregation when data is implicitly sorted by group-by keys [datafusion]

Reply via email to