[ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006516#comment-17006516
 ] 

Kyle McCarthy commented on ARROW-7480:
--------------------------------------

I can work on a fix for this, I would like some feedback from some of the other 
contributors on what to do with the second scenario since different DBs handle 
the case differently. 

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7480
>                 URL: https://issues.apache.org/jira/browse/ARROW-7480
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Kyle McCarthy
>            Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c1",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "c13",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "MIN",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c0",
>             data_type: Utf8,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> ----
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to