yjshen commented on a change in pull request #2000: URL: https://github.com/apache/arrow-datafusion/pull/2000#discussion_r828721274
##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -236,32 +237,56 @@ impl ExecutionPlan for ParquetExec {
let adapter = SchemaAdapter::new(self.base_config.file_schema.clone());
- let join_handle = task::spawn_blocking(move || {
- if let Err(e) = read_partition(
- object_store.as_ref(),
- adapter,
- partition_index,
- &partition,
- metrics,
- &projection,
- &pruning_predicate,
- batch_size,
- response_tx.clone(),
- limit,
- partition_col_proj,
- ) {
- println!(
+ let join_handle = if projection.is_empty() {
+ task::spawn_blocking(move || {
+ if let Err(e) = read_partition_no_file_columns(
+ object_store.as_ref(),
+ &partition,
+ batch_size,
+ response_tx.clone(),
Review comment:
For `SELECT year,month,day FROM t where id > 0`, I think the id>0 filter
would pass down to parquet scan even if it's not in the "select" list?
Here's the output of `explain verbose ....`, I make changes to print pushed
down filters in ParquetExec as well.
```
+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan | Projection:
#t.year, #t.month, #t.day
|
| | Filter: #t.id >
Int64(0)
|
| | TableScan: t
projection=None
|
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE
|
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE
|
| logical_plan after eliminate_limit | SAME TEXT AS ABOVE
|
| logical_plan after projection_push_down | Projection:
#t.year, #t.month, #t.day
|
| | Filter: #t.id >
Int64(0)
|
| | TableScan: t
projection=Some([0, 11, 12, 13])
|
| logical_plan after filter_push_down | Projection:
#t.year, #t.month, #t.day
|
| | Filter: #t.id >
Int64(0)
|
| | TableScan: t
projection=Some([0, 11, 12, 13]), filters=[#t.id > Int64(0)]
|
| logical_plan after limit_push_down | SAME TEXT AS ABOVE
|
| logical_plan after SingleDistinctAggregationToGroupBy | SAME TEXT AS ABOVE
|
| logical_plan after ToApproxPerc | SAME TEXT AS ABOVE
|
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE
|
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE
|
| logical_plan after eliminate_limit | SAME TEXT AS ABOVE
|
| logical_plan after projection_push_down | SAME TEXT AS ABOVE
|
| logical_plan after filter_push_down | SAME TEXT AS ABOVE
|
| logical_plan after limit_push_down | SAME TEXT AS ABOVE
|
| logical_plan after SingleDistinctAggregationToGroupBy | SAME TEXT AS ABOVE
|
| logical_plan after ToApproxPerc | SAME TEXT AS ABOVE
|
| logical_plan | Projection:
#t.year, #t.month, #t.day
|
| | Filter: #t.id >
Int64(0)
|
| | TableScan: t
projection=Some([0, 11, 12, 13]), filters=[#t.id > Int64(0)]
|
| initial_physical_plan | ProjectionExec:
expr=[year@1 as year, month@2 as month, day@3 as day]
|
| | FilterExec:
CAST(id@0 AS Int64) > 0
|
| | ParquetExec:
limit=None, partitions=[year=2021/month=09/day=09/file.parquet,
year=2021/month=10/day=09/file.parquet,
year=2021/month=10/day=28/file.parquet], filters=CAST(id_max@0 AS Int64) > 0
|
| |
|
| physical_plan after aggregate_statistics | SAME TEXT AS ABOVE
|
| physical_plan after hash_build_probe_order | SAME TEXT AS ABOVE
|
| physical_plan after coalesce_batches | ProjectionExec:
expr=[year@1 as year, month@2 as month, day@3 as day]
|
| |
CoalesceBatchesExec: target_batch_size=4096
|
| | FilterExec:
CAST(id@0 AS Int64) > 0
|
| | ParquetExec:
limit=None, partitions=[year=2021/month=09/day=09/file.parquet,
year=2021/month=10/day=09/file.parquet,
year=2021/month=10/day=28/file.parquet], filters=CAST(id_max@0 AS Int64) > 0 |
| |
|
| physical_plan after repartition | ProjectionExec:
expr=[year@1 as year, month@2 as month, day@3 as day]
|
| |
CoalesceBatchesExec: target_batch_size=4096
|
| | FilterExec:
CAST(id@0 AS Int64) > 0
|
| |
RepartitionExec: partitioning=RoundRobinBatch(12)
|
| |
ParquetExec: limit=None, partitions=[year=2021/month=09/day=09/file.parquet,
year=2021/month=10/day=09/file.parquet,
year=2021/month=10/day=28/file.parquet], filters=CAST(id_max@0 AS Int64) > 0 |
| |
|
| physical_plan after add_merge_exec | SAME TEXT AS ABOVE
|
| physical_plan | ProjectionExec:
expr=[year@1 as year, month@2 as month, day@3 as day]
|
| |
CoalesceBatchesExec: target_batch_size=4096
|
| | FilterExec:
CAST(id@0 AS Int64) > 0
|
| |
RepartitionExec: partitioning=RoundRobinBatch(12)
|
| |
ParquetExec: limit=None, partitions=[year=2021/month=09/day=09/file.parquet,
year=2021/month=10/day=09/file.parquet,
year=2021/month=10/day=28/file.parquet], filters=CAST(id_max@0 AS Int64) > 0 |
| |
|
+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
