pmcgleenon commented on issue #6983: URL: https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1953000796
I ran the reproducer https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1662556865 and didn't see this issue. 1. generate benchmark data ``` cd benchmarks ./bench.sh data tpch10 ``` 2. run CLI with query (3.2 seconds) and without query (3.5 seconds) ``` DataFusion CLI v36.0.0 ❯ create external table test stored as parquet location '/Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet'; 0 rows in set. Query took 0.115 seconds. ❯ create table t as select * from test; 0 rows in set. Query took 3.527 seconds. ``` ``` DataFusion CLI v36.0.0 ❯ create external table test stored as parquet location '/Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet'; 0 rows in set. Query took 0.006 seconds. ❯ create table t as (select * from test where l_linenumber > 0); 0 rows in set. Query took 3.216 seconds. ``` 3. ran the rust program with query (3.1 seconds) and without query (3 seconds) ``` let _df = _ctx .read_parquet(FILENAME, _read_options) .await .unwrap(); // .filter(col("l_orderkey").gt(lit(0))) // .unwrap(); ``` 4. checked the plan output for the presence of file_groups in the physical plan to make it parallel. ``` ❯ explain select * from test; +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | TableScan: test projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] | | physical_plan | ParquetExec: file_groups={4 groups: [[Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:0..10165445], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:10165445..20330890], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:20330890..30496335], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:30496335..40661778]]}, projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] | | | | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.010 seconds. ``` ``` ❯ explain select * from test where l_orderkey > 0; +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Filter: test.l_orderkey > Int64(0) | | | TableScan: test projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment], partial_filters=[test.l_orderkey > Int64(0)] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: l_orderkey@0 > 0 | | | ParquetExec: file_groups={4 groups: [[Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:0..10165445], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:10165445..20330890], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:20330890..30496335], [Users/pmcgleen/work/arrow-datafusion/datafusion-cli/part-0.parquet:30496335..40661778]]}, projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment], predicate=l_orderkey@0 > 0, pruning_predicate=l_orderkey_max@0 > 0, required_guarantees=[] | | | | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.012 seconds. ``` @alamb this looks ok to me (unless I've missed something). `file_groups = 4` means it's loaded in parallel on each of the 4 CPUs available? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
