[GitHub] [arrow] alamb commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

GitBox Tue, 15 Dec 2020 15:01:23 -0800


alamb commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543730441




##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       If we don't like passing around `&[]` all over the place, we could name 
`scan` --> `scan_with_filter` and add a default implementation of `scan` 
   
   That would also have the nice benefit of keeping the signatures the same for 
existing TableProvider implementations. I do not feel strongly about this

##########
File path: rust/datafusion/src/logical_plan/plan.rs
##########
@@ -477,17 +479,29 @@ impl LogicalPlan {
                     LogicalPlan::TableScan {
                         ref source,
                         ref projection,
+                        ref filters,
                         ..
-                    } => match source {
-                        TableSource::FromContext(table_name) => write!(
-                            f,
-                            "TableScan: {} projection={:?}",
-                            table_name, projection
-                        ),
-                        TableSource::FromProvider(_) => {
-                            write!(f, "TableScan: projection={:?}", projection)
+                    } => {
+                        match source {
+                            TableSource::FromContext(table_name) => write!(
+                                f,
+                                "TableScan: {} projection={:?}",
+                                table_name, projection
+                            )?,
+                            TableSource::FromProvider(_) => {
+                                write!(f, "TableScan: projection={:?}", 
projection)?
+                            }
+                        };
+
+                        // TODO: this probably shouldn't be conditional on a 
non-empty list

Review comment:
       I think it is fine to not print out filters if they aren't present

##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       > I think it makes sense to separate the generic support for predicate 
push-down to the data source from the implementation for various data sources 
such as parquet because each change will be fairly big so makes sense to split 
into smaller changes;
   
   I think this approach makes a lot of sense. 
   
   I have also been thinking about pushing predicates down into custom table 
providers, and FWIW I think this interface would work well for my use. 

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I think that these names make sense as does the behavior described

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I think if filtering requires IO it could be done as part of the actual 
Execution. 
   
   It seems like the main benefit of removing plan nodes during planning is to 
avoid potentially expensive I/O that could be ruled out with Metadata queries 
(as @returnString  describes is his usecase) 
   
   Thus if the system has to go and do I/O anyways to figure out if it rule out 
a plan node, much of the benefit of pruning such nodes from the plan seem to be 
lost. In that case it seems better to me to do do such pruning as part of 
Execution. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] alamb commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Reply via email to