[
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Grove closed ARROW-7480.
-----------------------------
Resolution: Fixed
Fixed by https://github.com/apache/arrow/pull/6625
> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns
> don't match the selected columns
> ------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-7480
> URL: https://issues.apache.org/jira/browse/ARROW-7480
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust, Rust - DataFusion
> Reporter: Kyle McCarthy
> Assignee: Andy Grove
> Priority: Major
> Fix For: 1.0.0
>
>
> There are two scenarios that cause problems but are related to the queries
> with aggregate expressions and the SQL planner. The aggregate_test_100
> dataset is used for both of the queries.
> At a high level, the issue is basically that queries containing aggregate
> expressions may generate the wrong schema.
>
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the
> wrong schema is generated. In the src/sql/planner.rs file the impl for
> SqlToRel is defined. In the sql_to_rel method, it checks if the query
> contains aggregate expressions, and if it does it generates the schema from
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c1",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "c13",
> data_type: Utf8,
> nullable: false,
> },
> Field {
> name: "MIN",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would
> assume that this schema is actually correct for the Aggregate logical plan,
> but isn't projecting the data correctly thus resulting in the wrong query
> result schema?
>
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This
> query actually will run, but the wrong schema is produced.
> Query:
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
> fields: [
> Field {
> name: "c0",
> data_type: Utf8,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> Field {
> name: "c1",
> data_type: Float64,
> nullable: true,
> },
> ],
> metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX,
> Float64).
>
> ----
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but
> others (Postgres) will require that all the columns must be in the GROUP BY
> to be used in an aggregate function.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)