Jacques Nadeau created DRILL-3910:
-------------------------------------
Summary: Leverage Calcite's Clustered Collation
Key: DRILL-3910
URL: https://issues.apache.org/jira/browse/DRILL-3910
Project: Apache Drill
Issue Type: Improvement
Components: Query Planning & Optimization
Reporter: Jacques Nadeau
Right now streaming aggregate requires full collation. I was just talking to
[~julianhyde] and he pointed out that Calcite has a version of Collation that
is Clustered (similar to what MSSQL calls Segment). Realistically, Streaming
aggregate only requires a clustered collation and we should switch to requiring
this. We should also go through existing operators and make sure we manage
whether or not the operators maintain a clustered collation. We should then be
able to have flatten produce a clustered output against the carry-through
fields. This will allow us to do a better job taking advantage of the
clustered-ness of data for doing additional operations. Flatten should also
produce data which exposes the distribution trait on the carry-through fields.
This means that a query like this:
select a, count(b) from (
select a, flatten(x) as b from t
)x
group by a
Should be executed without redistribution of data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)