zhuqi-lucas commented on PR #19446:
URL: https://github.com/apache/datafusion/pull/19446#issuecomment-3690779474

   > I'm comfortable with `ProjectionExec` and `CooperativeExec` but I think 
`FilterExec` needs some more discussion since pushing ordering below filters 
might have a negative impact on performance for some queries / file sources.
   
   @adriangb Happy holidays! Thank you for the feedback! I completely 
understand your concern about FilterExec.
   
   If i makes sense right, i've updated the implementation to address the 
performance issue you mentioned. Now `FilterExec::try_pushdown_sort` only 
pushes down the sort when the child node already has an output ordering:
   
   ```rust
   fn try_pushdown_sort(
           &self,
           order: &[PhysicalSortExpr],
       ) -> Result<SortOrderPushdownResult<Arc<dyn ExecutionPlan>>> {
           if let Some(_child_ordering) = self.input.output_ordering() {
               match self.input.try_pushdown_sort(order)? {
                   SortOrderPushdownResult::Exact { inner } => {
                       let new_exec = 
Arc::new(self.clone()).with_new_children(vec![inner])?;
                       Ok(SortOrderPushdownResult::Exact { inner: new_exec })
                   }
                   SortOrderPushdownResult::Inexact { inner } => {
                       let new_exec = 
Arc::new(self.clone()).with_new_children(vec![inner])?;
                       Ok(SortOrderPushdownResult::Inexact { inner: new_exec })
                   }
                   SortOrderPushdownResult::Unsupported => {
                       Ok(SortOrderPushdownResult::Unsupported)
                   }
               }
           } else {
               Ok(SortOrderPushdownResult::Unsupported)
           }
       }
   ```
   
   This way:
   - When the data source can optimize to use pushdown sort, so we pushdown to 
leverage the existing ordering.
   - When the data source is unsorted, we don't pushdown, so sorting happens 
after filtering on the smaller filtered dataset
   
   This should avoid the performance degradation you were concerned about while 
still enabling the optimization for sorted data sources. The new tests validate 
this with the `WHERE timeframe = 'quarterly'` cases, which require pushdown 
through FilterExec to utilize the sorted Parquet file's ordering.
   
   
   What do you think about this approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to