zhuqi-lucas commented on PR #19446:
URL: https://github.com/apache/datafusion/pull/19446#issuecomment-3690779474
> I'm comfortable with `ProjectionExec` and `CooperativeExec` but I think
`FilterExec` needs some more discussion since pushing ordering below filters
might have a negative impact on performance for some queries / file sources.
@adriangb Happy holidays! Thank you for the feedback! I completely
understand your concern about FilterExec.
If i makes sense right, i've updated the implementation to address the
performance issue you mentioned. Now `FilterExec::try_pushdown_sort` only
pushes down the sort when the child node already has an output ordering:
```rust
fn try_pushdown_sort(
&self,
order: &[PhysicalSortExpr],
) -> Result<SortOrderPushdownResult<Arc<dyn ExecutionPlan>>> {
if let Some(_child_ordering) = self.input.output_ordering() {
match self.input.try_pushdown_sort(order)? {
SortOrderPushdownResult::Exact { inner } => {
let new_exec =
Arc::new(self.clone()).with_new_children(vec![inner])?;
Ok(SortOrderPushdownResult::Exact { inner: new_exec })
}
SortOrderPushdownResult::Inexact { inner } => {
let new_exec =
Arc::new(self.clone()).with_new_children(vec![inner])?;
Ok(SortOrderPushdownResult::Inexact { inner: new_exec })
}
SortOrderPushdownResult::Unsupported => {
Ok(SortOrderPushdownResult::Unsupported)
}
}
} else {
Ok(SortOrderPushdownResult::Unsupported)
}
}
```
This way:
- When the data source can optimize to use pushdown sort, so we pushdown to
leverage the existing ordering.
- When the data source is unsorted, we don't pushdown, so sorting happens
after filtering on the smaller filtered dataset
This should avoid the performance degradation you were concerned about while
still enabling the optimization for sorted data sources. The new tests validate
this with the `WHERE timeframe = 'quarterly'` cases, which require pushdown
through FilterExec to utilize the sorted Parquet file's ordering.
What do you think about this approach?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]