zhuqi-lucas commented on code in PR #13788: URL: https://github.com/apache/datafusion/pull/13788#discussion_r1885717612
##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -843,8 +843,16 @@ impl TableProvider for ListingTable {
});
// TODO (https://github.com/apache/datafusion/issues/11600) remove
downcast_ref from here?
let session_state =
state.as_any().downcast_ref::<SessionState>().unwrap();
+
+ // We should not limit the number of partitioned files to scan if
there are filters and limit
+ // at the same time. This is because the limit should be applied after
the filters are applied.
+ let mut statistic_file_limit = limit;
Review Comment:
Thank you @korowa for review, i agree with you if we can done this fix in
planner and optimize rule will be more reasonable. I was trying to do this
way, but i met some problems, and i need more investigation and help if we can
fix using this way.
With **datafusion.execution.parquet.pushdown_filters = true**, this is the
bug when it happened, so i try to do something based this, and i was trying
investigating the diffs when setting this config to true, **the following are
without this config, and it's the right result:**
**The final logic plan:**
```rust
| logical_plan
| | Limit:
skip=0, fetch=1
|
| | Filter:
?table?.a = Utf8View("GAS2")
|
| |
TableScan: ?table? projection=[a], partial_filters=[?table?.a =
Utf8View("GAS2")]
```
The initial physical plan:
```rust
| initial_physical_plan
| |
GlobalLimitExec: skip=0, fetch=1
|
| |
LocalLimitExec: fetch=1
|
| |
FilterExec: a@0 = GAS2
|
| |
ParquetExec: file_groups={10 groups:
```
**And the following is the wrong case** that setting
datafusion.execution.parquet.pushdown_filters = true:
**The final logic plan:**
```rust
| logical_plan | Limit:
skip=0, fetch=1
|
| | TableScan:
?table? projection=[a], full_filters=[?table?.a = Utf8View("GAS2")], fetch=1
```
**The initial physical plan:**
```rust
| initial_physical_plan
| |
GlobalLimitExec: skip=0, fetch=1
| |
ParquetExec: file_groups={1 group:
[[data/part-00000-87b62c7b-3844-4507-99af-c9bc3d829c70-c000.snappy.parquet]]},
projection=[a], limit=1, predicate=a@0 = GAS2, pruning_predicate=CASE WHEN
a_null_count@2 = a_row_count@3 THEN false ELSE a_min@0 <= GAS2 AND GAS2 <=
a_max@1 END, required_guarantees=[a in (GAS2)]
```
**I only can see the physical plan is skipping many files due to get
partition file list**, the planner and optimize seems work well.
And moreover:
**statistic_file_limit = limit** This will not apply to the limit number, i
can pass it to the partition file selection.
And please correct me if i was investigating the wrong direction, thanks a
lot!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
