[jira] [Commented] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

Kyle McCarthy (Jira) Sun, 19 Jan 2020 14:31:17 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019047#comment-17019047
 ]


Kyle McCarthy commented on ARROW-7480:
--------------------------------------

@[~andygrove] sorry for tagging you directly, but I was looking for some 
clarification wasn't sure the right way to ask.

In the SQL planner, I am a little confused how the column numbers are derived 
for group by and order by expressions. Using this statement:

 {{SELECT first_name FROM person ORDER BY first_name}}

It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person 
projection=None". I was expecting it to produce "Sort: #1...". Are the column 
numbers for sorts and groups relative to the column position in the projection 
rather than the table?

Thanks!

> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns 
> don't match the selected columns
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7480
>                 URL: https://issues.apache.org/jira/browse/ARROW-7480
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Kyle McCarthy
>            Priority: Major
>
> There are two scenarios that cause problems but are related to the queries 
> with aggregate expressions and the SQL planner. The aggregate_test_100 
> dataset is used for both of the queries. 
> At a high level, the issue is basically that queries containing aggregate 
> expressions may generate the wrong schema.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> Query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
> Error:
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of 
> fields in schema")){noformat}
> While the error is an ArrowError, it actually looks like it is because the 
> wrong schema is generated. In the src/sql/planner.rs file the impl for 
> SqlToRel is defined. In the sql_to_rel method, it checks if the query 
> contains aggregate expressions, and if it does it generates the schema from 
> the columns included in group expressions and aggregate expressions.
> This in turn generates the following schema:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c1",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "c13",
>             data_type: Utf8,
>             nullable: false,
>         },
>         Field {
>             name: "MIN",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> }{code}
> I am not super familiar with how DataFusion works under the hood, but I would 
> assume that this schema is actually correct for the Aggregate logical plan, 
> but isn't projecting the data correctly thus resulting in the wrong query 
> result schema? 
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This 
> query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c0",
>             data_type: Utf8,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, 
> Float64).
>  
> ----
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but 
> others (Postgres) will require that all the columns must be in the GROUP BY 
> to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

Reply via email to