simicd commented on code in PR #8857:
URL: https://github.com/apache/arrow-datafusion/pull/8857#discussion_r1452298161


##########
datafusion/sqllogictest/test_files/order.slt:
##########
@@ -578,3 +578,192 @@ SortPreservingMergeExec: [log_c12_base_c11@0 DESC]
 
 statement ok
 drop table aggregate_test_100;
+
+
+# Sort with lots of repetition values
+# Test sorting a parquet file with 2 million records that has lots of values 
that are repeated
+statement ok
+CREATE EXTERNAL TABLE repeat_much STORED AS PARQUET LOCATION 
'../../parquet-testing/data/repeat_much.snappy.parquet';
+
+query I
+SELECT a FROM repeat_much ORDER BY a LIMIT 20;
+----
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+2450962
+
+
+# Create external table with optional pre-known sort order
+# DataFusion may take advantage of this ordering to omit sorts or use more 
efficient algorithms.
+statement ok
+set datafusion.catalog.information_schema = true;
+
+statement ok
+CREATE EXTERNAL TABLE dt (a_id integer, a_str string, a_bool boolean) STORED 
AS CSV WITH ORDER (a_id ASC) LOCATION 'file://path/to/table';
+
+#TODO: How to check for order in sqllogictest?

Review Comment:
   Thanks a lot for the fast review and the tip @alamb! 
   
   I had to switch the table to an existing query since the old unit test with 
the inexistent `LOCATION 'file://path/to/table'` would only generate a logical 
plan, not the physical plan.
   
   So I changed to a TPCH table, without a `WITH ORDER (o_orderkey ASC)` clause 
the physical plan shows a SortExec:
   ```
   query TT
   EXPLAIN SELECT o_orderkey, o_orderstatus FROM orders ORDER BY o_orderkey ASC
   ----
   logical_plan
   Sort: orders.o_orderkey ASC NULLS LAST
   --TableScan: orders projection=[o_orderkey, o_orderstatus]
   physical_plan
   SortExec: expr=[o_orderkey@0 ASC NULLS LAST]
   --CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/tpch-csv/orders.csv]]}, 
projection=[o_orderkey, o_orderstatus], has_header=true
   ```
   
   Adding it drops it from the physical plan and pushes it into output_ordering 
in the CsvExec:
   ```
   query TT
   EXPLAIN SELECT o_orderkey, o_orderstatus FROM orders ORDER BY o_orderkey ASC
   ----
   logical_plan
   Sort: orders.o_orderkey ASC NULLS LAST
   --TableScan: orders projection=[o_orderkey, o_orderstatus]
   physical_plan CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/tpch-csv/orders.csv]]}, 
projection=[o_orderkey, o_orderstatus], output_ordering=[o_orderkey@0 ASC NULLS 
LAST], has_header=true
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to