hiltontj commented on code in PR #8560:
URL: https://github.com/apache/arrow-datafusion/pull/8560#discussion_r1428691774


##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -0,0 +1,151 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# TESTS FOR PARQUET FILES
+
+# Setup basic alltypes_plain table:
+
+statement ok
+CREATE EXTERNAL TABLE alltypes_plain (
+  id  INT NOT NULL,
+  bool_col BOOLEAN NOT NULL,
+  tinyint_col TINYINT NOT NULL,
+  smallint_col SMALLINT NOT NULL,
+  int_col INT NOT NULL,
+  bigint_col BIGINT NOT NULL,
+  float_col FLOAT NOT NULL,
+  double_col DOUBLE NOT NULL,
+  date_string_col BYTEA NOT NULL,
+  string_col VARCHAR NOT NULL,
+  timestamp_col TIMESTAMP NOT NULL,
+)
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION '../../parquet-testing/data/alltypes_plain.parquet'
+
+# Test a basic query:
+
+query IT
+SELECT id, CAST(string_col AS varchar) FROM alltypes_plain
+----
+4 0
+5 1
+6 0
+7 1
+2 0
+3 1
+0 0
+1 1
+
+# Explain query on the un-ordered table, expect no "output_ordering" clause in 
physical_plan -> ParquetExec:
+
+query TT
+EXPLAIN SELECT int_col, string_col
+FROM alltypes_plain
+ORDER BY string_col, int_col
+LIMIT 10
+----
+logical_plan
+Limit: skip=0, fetch=10
+--Sort: alltypes_plain.string_col ASC NULLS LAST, alltypes_plain.int_col ASC 
NULLS LAST, fetch=10
+----TableScan: alltypes_plain projection=[int_col, string_col]
+physical_plan
+GlobalLimitExec: skip=0, fetch=10
+--SortExec: TopK(fetch=10), expr=[string_col@1 ASC NULLS LAST,int_col@0 ASC 
NULLS LAST]
+----ParquetExec: file_groups={1 group: 
[[WORKSPACE_ROOT/parquet-testing/data/alltypes_plain.parquet]]}, 
projection=[int_col, string_col]
+
+# Setup alltypes_plain, with an order clause:
+
+statement ok
+CREATE EXTERNAL TABLE alltypes_plain_with_order (
+  id  INT NOT NULL,
+  bool_col BOOLEAN NOT NULL,
+  tinyint_col TINYINT NOT NULL,
+  smallint_col SMALLINT NOT NULL,
+  int_col INT NOT NULL,
+  bigint_col BIGINT NOT NULL,
+  float_col FLOAT NOT NULL,
+  double_col DOUBLE NOT NULL,
+  date_string_col BYTEA NOT NULL,
+  string_col VARCHAR NOT NULL,
+  timestamp_col TIMESTAMP NOT NULL,
+)
+STORED AS PARQUET
+WITH HEADER ROW
+WITH ORDER (string_col ASC NULLS LAST, int_col NULLS LAST)
+LOCATION '../../parquet-testing/data/alltypes_plain.parquet'
+
+# Explain query on the ordered table, expect to see the "output_ordering" 
clause in physical_plan -> ParquetExec:
+
+query TT
+EXPLAIN SELECT int_col, string_col
+FROM alltypes_plain_with_order
+ORDER BY string_col, int_col
+LIMIT 10
+----
+logical_plan
+Limit: skip=0, fetch=10
+--Sort: alltypes_plain_with_order.string_col ASC NULLS LAST, 
alltypes_plain_with_order.int_col ASC NULLS LAST, fetch=10
+----TableScan: alltypes_plain_with_order projection=[int_col, string_col]
+physical_plan
+GlobalLimitExec: skip=0, fetch=10
+--ParquetExec: file_groups={1 group: 
[[WORKSPACE_ROOT/parquet-testing/data/alltypes_plain.parquet]]}, 
projection=[int_col, string_col], output_ordering=[string_col@1 ASC NULLS LAST, 
int_col@0 ASC NULLS LAST]
+
+# Setup alltypes_plain, from the directory, with ordering clause:
+
+statement ok
+CREATE EXTERNAL TABLE alltypes_plain_from_dir (
+  id  INT NOT NULL,
+  bool_col BOOLEAN NOT NULL,
+  tinyint_col TINYINT NOT NULL,
+  smallint_col SMALLINT NOT NULL,
+  int_col INT NOT NULL,
+  bigint_col BIGINT NOT NULL,
+  float_col FLOAT NOT NULL,
+  double_col DOUBLE NOT NULL,
+  date_string_col BYTEA NOT NULL,
+  string_col VARCHAR NOT NULL,
+  timestamp_col TIMESTAMP NOT NULL,
+)
+STORED AS PARQUET
+WITH HEADER ROW
+WITH ORDER (string_col ASC NULLS LAST, int_col NULLS LAST)
+PARTITIONED BY (string_col, int_col)
+LOCATION '../../parquet-testing/data/alltypes_dir'

Review Comment:
   > Hmm, I suspect that a better approach may be to manually register the 
tables/temp files as was done for CSV 
[here](https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/src/test_context.rs#L192-L212).
   > 
   > I will spend some time with this to see if I can do something similar for 
Parquet.
   
   I had a crack at this. Changes can be seen 
[here](https://github.com/hiltontj/arrow-datafusion/commit/0987f7ea95f31b04f15b018cded658e204356d3d).
   
   I based the approach off of how the `avro.slt` tests are setup. The issue in 
this case, is that trying to use temp files to test for the `EXPLAIN` output 
will not work. The temp files have different paths on each run, so the test 
will fail on every subsequent run. An example of the resulting error:
   
   ```
   External error: query result mismatch:
   [SQL] EXPLAIN SELECT int_col, string_col
   FROM alltypes_plain_multi_files
   ORDER BY string_col, int_col
   LIMIT 10
   [Diff] (-expected|+actual)
       logical_plan
       Limit: skip=0, fetch=10
       --Sort: alltypes_plain_multi_files.string_col ASC NULLS LAST, 
alltypes_plain_multi_files.int_col ASC NULLS LAST, fetch=10
       ----TableScan: alltypes_plain_multi_files projection=[int_col, 
string_col]
       physical_plan
       GlobalLimitExec: skip=0, fetch=10
       --SortPreservingMergeExec: [string_col@1 ASC NULLS LAST,int_col@0 ASC 
NULLS LAST], fetch=10
   -   ----ParquetExec: file_groups={4 groups: 
[[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpVf7EaN/parquet/multi/alltypes_plain0.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpVf7EaN/parquet/multi/alltypes_plain1.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpVf7EaN/parquet/multi/alltypes_plain2.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpVf7EaN/parquet/multi/alltypes_plain3.parquet]]},
 projection=[int_col, string_col], output_ordering=[string_col@1 ASC NULLS 
LAST, int_col@0 ASC NULLS LAST]
   +   ----ParquetExec: file_groups={4 groups: 
[[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpFrLBcF/parquet/multi/alltypes_plain0.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpFrLBcF/parquet/multi/alltypes_plain1.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpFrLBcF/parquet/multi/alltypes_plain2.parquet],
 
[var/folders/3_/_nl7ywwd0l389b58dnksk8fm0000gn/T/.tmpFrLBcF/parquet/multi/alltypes_plain3.parquet]]},
 projection=[int_col, string_col], output_ordering=[string_col@1 ASC NULLS 
LAST, int_col@0 ASC NULLS LAST]
   ```
   
   If the aim is to just test that multi-file Parquet tables work, then we can 
use something similar to the approach used in `avro.slt`, but if we still want 
to test characteristics of the execution plan, then my original approach (or 
another solition, e.g., leaving it in Rust) might be necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to