returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543527000



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide 
statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       > Maybe `supports_filter_pushdown`?
   
   That's definitely better, other people seem to like it as well, sold :p
   
   > I feel this makes us evaluate the expression twice: once for testing 
filter support and once at scan time. Moreover, it is likely that different 
partitions will have different `TableProviderFilterPushDown` values. It would 
be great optimize out filters on that finer granularity 🚀 But that would imply 
doing that optimization in the physical plan.
   
   Yeah, I think this depends on whether it's preferable to:
   - offer extensibility for providers, allowing them to interpret filters 
however they like
   - promote some notion of external data chunk into an actual DataFusion 
concept (I need to read through your catalogue proposal doc properly!) and get 
providers to return a list of available chunks with their bounds, then having 
DataFusion perform the high-level filtering where possible, and pass that more 
selective list of chunk IDs back for actual data retrieval
   
   And personally I see pros and cons for both of those options (flexibility is 
cool, but sometimes dangerous).
   
   I'm not familiar enough with partitioning in DataFusion yet to quite 
understand what you mean for the different pushdown support return values 
across partitions; I would expect that the filter pushdown status is usually a 
property of the column in question? e.g. depending on whether a column tracks 
sufficient statistics and the provider storage mechanism allows for it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to