NGA-TRAN commented on issue #19655:
URL: https://github.com/apache/datafusion/issues/19655#issuecomment-3736644707
**Some hints:** I suggest we start by running some benchmarking or
performance tests to determine whether this is worth pursuing.
- Create two different schemas on the same dataset: one explicitly sorted on
`(a, b)` and another sorted only on `a`. The first should produce a group‑by
pipeline, while the second will fall back to a group‑by hash.
- We may also want to run these tests on more realistic query such as (data
is explicitly sorted on `f_dkey, timestamp` but implicitly sorted by `f_dkey,
date_bin, env`)
```SQL
SELECT env, time_bin, AVG(max_bin_value) AS avg_max_value
FROM
(
SELECT f_dkey,
date_bin(INTERVAL '30 seconds', timestamp) AS time_bin,
env,
MAX(value) AS max_bin_value
FROM
(
SELECT
f.f_dkey,
d.env,
d.service,
d.host,
f.timestamp,
f.value
FROM dimension_table d
INNER JOIN fact_table_ordered f ON d.d_dkey = f.f_dkey
WHERE service = 'log'
) AS j
GROUP BY f_dkey, time_bin, env
) AS a
GROUP BY env, time_bin
ORDER BY env, time_bin;
```
- Also, these 2 PRs have some related properties and tests:
- https://github.com/apache/datafusion/pull/19124
- https://github.com/apache/datafusion/pull/19304
@xavlee : When you start working on this, let us chat with @gene-bordegaray
first.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]