Re: [PR] feat(planner): Allowing setting sort order of parquet files without specifying the schema [datafusion]

via GitHub Fri, 20 Sep 2024 08:08:05 -0700


alamb commented on code in PR #12466:
URL: https://github.com/apache/datafusion/pull/12466#discussion_r1768774745



##########
datafusion/core/src/datasource/listing_table_factory.rs:
##########
@@ -113,19 +114,39 @@ impl TableProviderFactory for ListingTableFactory {
             .with_collect_stat(state.config().collect_statistics())
             .with_file_extension(file_extension)
             .with_target_partitions(state.config().target_partitions())
-            .with_table_partition_cols(table_partition_cols)
-            .with_file_sort_order(cmd.order_exprs.clone());
+            .with_table_partition_cols(table_partition_cols);
 
         options
             .validate_partitions(session_state, &table_path)
             .await?;
 
         let resolved_schema = match provided_schema {
-            None => options.infer_schema(session_state, &table_path).await?,
+            // We will need to check the table columns against the schema
+            // this is done so that we can do an ORDER BY for external table 
creation
+            // specifically for parquet file format.
+            // See: https://github.com/apache/datafusion/issues/7317
+            None => {
+                let schema = options.infer_schema(session_state, 
&table_path).await?;

Review Comment:
   This looks great. Thank you @devanbenz -- perfect



##########
datafusion/sql/src/statement.rs:
##########
@@ -1136,14 +1136,29 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         schema: &DFSchemaRef,
         planner_context: &mut PlannerContext,
     ) -> Result<Vec<Vec<SortExpr>>> {
-        // Ask user to provide a schema if schema is empty.
+        let mut all_results = vec![];
         if !order_exprs.is_empty() && schema.fields().is_empty() {
-            return plan_err!(
-                "Provide a schema before specifying the order while creating a 
table."
-            );
+            let mut results = vec![];
+            for expr in order_exprs {
+                for ordered_expr in expr {

Review Comment:
   I think it would be nice to use `map` / `collect` to follow the same 
conventions, but I don't think it is required
   
   It also took me a while to get used to the map/collect pattern. At first I 
thought it was just functional language hipster stuff, but then I realized that 
it is often a key optimization (When possible, `collect` can figure out how but 
the target container is and do a single allocation rather than having to grow)



##########
datafusion/sqllogictest/test_files/create_external_table.slt:
##########
@@ -228,3 +228,13 @@ OPTIONS (
         format.delimiter '|',
         has_header false,
         compression gzip);
+
+# Create an external parquet table and infer schema to order by
+
+# query should succeed
+statement ok
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION 
'../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id);
+
+# query should fail with bad column
+statement error
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION 
'../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (foo);

Review Comment:
   Another reason this will fail is that there is already a table named `t` -- 
so it is probably good to check the actual error
   
   



##########
datafusion/sqllogictest/test_files/create_external_table.slt:
##########
@@ -228,3 +228,13 @@ OPTIONS (
         format.delimiter '|',
         has_header false,
         compression gzip);
+
+# Create an external parquet table and infer schema to order by
+
+# query should succeed
+statement ok
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION 
'../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id);

Review Comment:
   Can you also add a test that shows the table is actually ordered correctly? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat(planner): Allowing setting sort order of parquet files without specifying the schema [datafusion]

Reply via email to