alamb commented on code in PR #21057:
URL: https://github.com/apache/datafusion/pull/21057#discussion_r2969462639
##########
datafusion/optimizer/src/push_down_filter.rs:
##########
@@ -1130,6 +1137,13 @@ impl OptimizerRule for PushDownFilter {
}
LogicalPlan::Join(join) => push_down_join(join,
Some(&filter.predicate)),
LogicalPlan::TableScan(scan) => {
+ // If the scan has a fetch (limit), pushing filters into it
+ // would change semantics: the limit should apply before the
+ // filter, not after.
+ if scan.fetch.is_some() {
+ filter.input = Arc::new(LogicalPlan::TableScan(scan));
+ return Ok(Transformed::no(LogicalPlan::Filter(filter)));
+ }
Review Comment:
> After PushDownLimit folds the limit, we get FILTER ->
TABLE_SCAN(fetch=50). This PR prevents pushing the filter into scan.filters
when fetch is set.
This is an excellent point
> . So the filter can be moved past it or run after it — both are
semantically correct
I am not sure they are both semantically correct. I think limit is (should
be) applied after Filters in table providers and there are some places where
that already happens. For example
-
https://github.com/apache/datafusion/blob/ee24f3c3cd320b88c5ea6a985cbc17d3a5b6b37b/datafusion/catalog-listing/src/table.rs#L511-L514
Also in the parquet data source I know the limit is applied *after* filters
as well
However, that does not appear to be explicitly documented anywhere I could
find -- I will make a PR to clarify the documentation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]