alamb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070243952


##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -982,18 +980,6 @@ impl TableProvider for ListingTable {
                     return Ok(TableProviderFilterPushDown::Exact);
                 }
 
-                // if we can't push it down completely with only the 
filename-based/path-based

Review Comment:
   This change makes sense to me -- when @itsjunetime originally implemented 
this code, there was some complexity because there was no way to do filter 
pushdown in ExecutionPlans so in my mind this approach was a (clever) workaround
   
   The comments even hint that this is a parquet specific special case
   
   I think the new pattern of handling predicates more generally in this PR is 
cleaner and will support more cases. Since this code is only currently executed 
   
   Perhaps  @cisaacson has some other thoughts



##########
datafusion/datasource-parquet/src/mod.rs:
##########
@@ -244,7 +242,7 @@ impl ParquetExecBuilder {
             inner: DataSourceExec::new(Arc::new(base_config.clone())),
             base_config,
             predicate,
-            pruning_predicate: parquet.pruning_predicate,
+            pruning_predicate: None, // for backwards compat since 
`ParquetExec` is only for backwards compat anyway

Review Comment:
   yeah this is fine too in my opinion. It is almost time to remove ParquetExec 
anyways -- maybe we should just do it in this release 🤔 



##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -982,18 +980,6 @@ impl TableProvider for ListingTable {
                     return Ok(TableProviderFilterPushDown::Exact);
                 }
 
-                // if we can't push it down completely with only the 
filename-based/path-based
-                // column names, then we should check if we can do parquet 
predicate pushdown
-                let supports_pushdown = 
self.options.format.supports_filters_pushdown(
-                    &self.file_schema,
-                    &self.table_schema,
-                    &[filter],
-                )?;
-
-                if supports_pushdown == FilePushdownSupport::Supported {
-                    return Ok(TableProviderFilterPushDown::Exact);
-                }

Review Comment:
   > We can justify implementing other TableProviders for Parquet, but still I 
cannot understand why we need to degrade the capabilities of our ListingTable. 
Is't it always better pruning/simplifying things at the higher levels as 
possible?
   
   I don't think this degrades the capabilities of the current listing table. I 
think the only implications are for anyone who used a custom `FileFormat` and 
impleented `supports_filters_pushdown` -- I suspect this is not very common and 
we can likely avoid consternation by mentioning it in the upgrade guide (see 
comment below)



##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -559,25 +549,8 @@ impl FileSource for ParquetSource {
                     .predicate()
                     .map(|p| format!(", predicate={p}"))
                     .unwrap_or_default();
-                let pruning_predicate_string = self
-                    .pruning_predicate
-                    .as_ref()
-                    .map(|pre| {
-                        let mut guarantees = pre
-                            .literal_guarantees()
-                            .iter()
-                            .map(|item| format!("{}", item))
-                            .collect_vec();
-                        guarantees.sort();
-                        format!(
-                            ", pruning_predicate={}, required_guarantees=[{}]",
-                            pre.predicate_expr(),
-                            guarantees.join(", ")
-                        )
-                    })
-                    .unwrap_or_default();

Review Comment:
   I think it is important to keep these in the physical plans -- in particular 
what I think is important is to be able to check via the explain plan if 
pruning is happening by looking at the explain plan



##########
datafusion/datasource/src/file_format.rs:
##########
@@ -109,37 +108,10 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
         not_impl_err!("Writer not implemented for this format")
     }
 
-    /// Check if the specified file format has support for pushing down the 
provided filters within
-    /// the given schemas. Added initially to support the Parquet file 
format's ability to do this.
-    fn supports_filters_pushdown(

Review Comment:
   yes I agree 
   
   Since 
[`FileFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html)
 is a pub trait, this is technically a breaking API change, but I do think it 
was a parquet specific optimization
   
   I recommend we mark this PR as an API change and add a note to the upgrade 
guide 
https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md
   
   I think it should basically say if you implemented `FileFormat` (which 
probably no one did) and ListingTable you will have to implement the newly 
added `ExecutionPlan::try_pushdown_filter` method into your execution plan 
directly if you want the filters to be pushed down



##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -982,18 +980,6 @@ impl TableProvider for ListingTable {
                     return Ok(TableProviderFilterPushDown::Exact);
                 }
 
-                // if we can't push it down completely with only the 
filename-based/path-based
-                // column names, then we should check if we can do parquet 
predicate pushdown
-                let supports_pushdown = 
self.options.format.supports_filters_pushdown(
-                    &self.file_schema,
-                    &self.table_schema,
-                    &[filter],
-                )?;
-
-                if supports_pushdown == FilePushdownSupport::Supported {
-                    return Ok(TableProviderFilterPushDown::Exact);
-                }

Review Comment:
   > I have one question: aren't we expecting/preparing for, people to use 
ListingTable if they read Parquet files? Are we eventually planning to remove 
all format-specific handlings? Or this is a case only for filter pushdown?
   
   For what it is worth, we (InfluxData) doesn't use ListingTable to read 
parquet files, instead we provide our own equivalent and create the 
DataSourceExec's directly
   
   > I would keep supports_filters_pushdown so that TableProviders can do Exact 
pruning of filters, e.g. using partition columns.
   
   Yes I think that is important too -- I don't think we should be removing any 
APIs from ListingTable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to