alamb commented on code in PR #12466: URL: https://github.com/apache/datafusion/pull/12466#discussion_r1768774745
########## datafusion/core/src/datasource/listing_table_factory.rs: ########## @@ -113,19 +114,39 @@ impl TableProviderFactory for ListingTableFactory { .with_collect_stat(state.config().collect_statistics()) .with_file_extension(file_extension) .with_target_partitions(state.config().target_partitions()) - .with_table_partition_cols(table_partition_cols) - .with_file_sort_order(cmd.order_exprs.clone()); + .with_table_partition_cols(table_partition_cols); options .validate_partitions(session_state, &table_path) .await?; let resolved_schema = match provided_schema { - None => options.infer_schema(session_state, &table_path).await?, + // We will need to check the table columns against the schema + // this is done so that we can do an ORDER BY for external table creation + // specifically for parquet file format. + // See: https://github.com/apache/datafusion/issues/7317 + None => { + let schema = options.infer_schema(session_state, &table_path).await?; Review Comment: This looks great. Thank you @devanbenz -- perfect ########## datafusion/sql/src/statement.rs: ########## @@ -1136,14 +1136,29 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { schema: &DFSchemaRef, planner_context: &mut PlannerContext, ) -> Result<Vec<Vec<SortExpr>>> { - // Ask user to provide a schema if schema is empty. + let mut all_results = vec![]; if !order_exprs.is_empty() && schema.fields().is_empty() { - return plan_err!( - "Provide a schema before specifying the order while creating a table." - ); + let mut results = vec![]; + for expr in order_exprs { + for ordered_expr in expr { Review Comment: I think it would be nice to use `map` / `collect` to follow the same conventions, but I don't think it is required It also took me a while to get used to the map/collect pattern. At first I thought it was just functional language hipster stuff, but then I realized that it is often a key optimization (When possible, `collect` can figure out how but the target container is and do a single allocation rather than having to grow) ########## datafusion/sqllogictest/test_files/create_external_table.slt: ########## @@ -228,3 +228,13 @@ OPTIONS ( format.delimiter '|', has_header false, compression gzip); + +# Create an external parquet table and infer schema to order by + +# query should succeed +statement ok +CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id); + +# query should fail with bad column +statement error +CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (foo); Review Comment: Another reason this will fail is that there is already a table named `t` -- so it is probably good to check the actual error ########## datafusion/sqllogictest/test_files/create_external_table.slt: ########## @@ -228,3 +228,13 @@ OPTIONS ( format.delimiter '|', has_header false, compression gzip); + +# Create an external parquet table and infer schema to order by + +# query should succeed +statement ok +CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id); Review Comment: Can you also add a test that shows the table is actually ordered correctly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org