[ https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019047#comment-17019047 ]
Kyle McCarthy commented on ARROW-7480: -------------------------------------- @[~andygrove] sorry for tagging you directly, but I was looking for some clarification wasn't sure the right way to ask. In the SQL planner, I am a little confused how the column numbers are derived for group by and order by expressions. Using this statement: {{SELECT first_name FROM person ORDER BY first_name}} It produces the SQL plan "Sort: #0 ASC\n Projection: #1\n TableScan: person projection=None". I was expecting it to produce "Sort: #1...". Are the column numbers for sorts and groups relative to the column position in the projection rather than the table? Thanks! > [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns > don't match the selected columns > ------------------------------------------------------------------------------------------------------------ > > Key: ARROW-7480 > URL: https://issues.apache.org/jira/browse/ARROW-7480 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion > Reporter: Kyle McCarthy > Priority: Major > > There are two scenarios that cause problems but are related to the queries > with aggregate expressions and the SQL planner. The aggregate_test_100 > dataset is used for both of the queries. > At a high level, the issue is basically that queries containing aggregate > expressions may generate the wrong schema. > > *Scenario 1* > Columns are grouped by but not selected. > Query: > {code:java} > SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code} > Error: > {noformat} > ArrowError(InvalidArgumentError("number of columns must match number of > fields in schema")){noformat} > While the error is an ArrowError, it actually looks like it is because the > wrong schema is generated. In the src/sql/planner.rs file the impl for > SqlToRel is defined. In the sql_to_rel method, it checks if the query > contains aggregate expressions, and if it does it generates the schema from > the columns included in group expressions and aggregate expressions. > This in turn generates the following schema: > {code:java} > Schema { > fields: [ > Field { > name: "c1", > data_type: Utf8, > nullable: false, > }, > Field { > name: "c13", > data_type: Utf8, > nullable: false, > }, > Field { > name: "MIN", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > }{code} > I am not super familiar with how DataFusion works under the hood, but I would > assume that this schema is actually correct for the Aggregate logical plan, > but isn't projecting the data correctly thus resulting in the wrong query > result schema? > > *Senario 2* > Columns are selected, but not grouped or part of an aggregate function. This > query actually will run, but the wrong schema is produced. > Query: > {code:java} > SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code} > Schema generated: > {code:java} > Schema { > fields: [ > Field { > name: "c0", > data_type: Utf8, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > } {code} > This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, > Float64). > > ---- > Schema 2 is questionable since some DBMS will run the query (ex MySQL) but > others (Postgres) will require that all the columns must be in the GROUP BY > to be used in an aggregate function. -- This message was sent by Atlassian Jira (v8.3.4#803005)