alamb commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543730441
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
Review comment:
If we don't like passing around `&[]` all over the place, we could name
`scan` --> `scan_with_filter` and add a default implementation of `scan`
That would also have the nice benefit of keeping the signatures the same for
existing TableProvider implementations. I do not feel strongly about this
##########
File path: rust/datafusion/src/logical_plan/plan.rs
##########
@@ -477,17 +479,29 @@ impl LogicalPlan {
LogicalPlan::TableScan {
ref source,
ref projection,
+ ref filters,
..
- } => match source {
- TableSource::FromContext(table_name) => write!(
- f,
- "TableScan: {} projection={:?}",
- table_name, projection
- ),
- TableSource::FromProvider(_) => {
- write!(f, "TableScan: projection={:?}", projection)
+ } => {
+ match source {
+ TableSource::FromContext(table_name) => write!(
+ f,
+ "TableScan: {} projection={:?}",
+ table_name, projection
+ )?,
+ TableSource::FromProvider(_) => {
+ write!(f, "TableScan: projection={:?}",
projection)?
+ }
+ };
+
+ // TODO: this probably shouldn't be conditional on a
non-empty list
Review comment:
I think it is fine to not print out filters if they aren't present
##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ _filters: &[Expr],
Review comment:
> I think it makes sense to separate the generic support for predicate
push-down to the data source from the implementation for various data sources
such as parquet because each change will be fairly big so makes sense to split
into smaller changes;
I think this approach makes a lot of sense.
I have also been thinking about pushing predicates down into custom table
providers, and FWIW I think this interface would work well for my use.
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
pub total_byte_size: Option<usize>,
}
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+ /// The expression cannot be used by the provider.
+ Unsupported,
+ /// The expression can be used to help minimise the data retrieved,
+ /// but the provider cannot guarantee that all returned tuples
Review comment:
I think that these names make sense as does the behavior described
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
Review comment:
I think if filtering requires IO it could be done as part of the actual
Execution.
It seems like the main benefit of removing plan nodes during planning is to
avoid potentially expensive I/O that could be ruled out with Metadata queries
(as @returnString describes is his usecase)
Thus if the system has to go and do I/O anyways to figure out if it rule out
a plan node, much of the benefit of pruning such nodes from the plan seem to be
lost. In that case it seems better to me to do do such pruning as part of
Execution.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]