alamb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070243952
########## datafusion/core/src/datasource/listing/table.rs: ########## @@ -982,18 +980,6 @@ impl TableProvider for ListingTable { return Ok(TableProviderFilterPushDown::Exact); } - // if we can't push it down completely with only the filename-based/path-based Review Comment: This change makes sense to me -- when @itsjunetime originally implemented this code, there was some complexity because there was no way to do filter pushdown in ExecutionPlans so in my mind this approach was a (clever) workaround The comments even hint that this is a parquet specific special case I think the new pattern of handling predicates more generally in this PR is cleaner and will support more cases. Since this code is only currently executed Perhaps @cisaacson has some other thoughts ########## datafusion/datasource-parquet/src/mod.rs: ########## @@ -244,7 +242,7 @@ impl ParquetExecBuilder { inner: DataSourceExec::new(Arc::new(base_config.clone())), base_config, predicate, - pruning_predicate: parquet.pruning_predicate, + pruning_predicate: None, // for backwards compat since `ParquetExec` is only for backwards compat anyway Review Comment: yeah this is fine too in my opinion. It is almost time to remove ParquetExec anyways -- maybe we should just do it in this release 🤔 ########## datafusion/core/src/datasource/listing/table.rs: ########## @@ -982,18 +980,6 @@ impl TableProvider for ListingTable { return Ok(TableProviderFilterPushDown::Exact); } - // if we can't push it down completely with only the filename-based/path-based - // column names, then we should check if we can do parquet predicate pushdown - let supports_pushdown = self.options.format.supports_filters_pushdown( - &self.file_schema, - &self.table_schema, - &[filter], - )?; - - if supports_pushdown == FilePushdownSupport::Supported { - return Ok(TableProviderFilterPushDown::Exact); - } Review Comment: > We can justify implementing other TableProviders for Parquet, but still I cannot understand why we need to degrade the capabilities of our ListingTable. Is't it always better pruning/simplifying things at the higher levels as possible? I don't think this degrades the capabilities of the current listing table. I think the only implications are for anyone who used a custom `FileFormat` and impleented `supports_filters_pushdown` -- I suspect this is not very common and we can likely avoid consternation by mentioning it in the upgrade guide (see comment below) ########## datafusion/datasource-parquet/src/source.rs: ########## @@ -559,25 +549,8 @@ impl FileSource for ParquetSource { .predicate() .map(|p| format!(", predicate={p}")) .unwrap_or_default(); - let pruning_predicate_string = self - .pruning_predicate - .as_ref() - .map(|pre| { - let mut guarantees = pre - .literal_guarantees() - .iter() - .map(|item| format!("{}", item)) - .collect_vec(); - guarantees.sort(); - format!( - ", pruning_predicate={}, required_guarantees=[{}]", - pre.predicate_expr(), - guarantees.join(", ") - ) - }) - .unwrap_or_default(); Review Comment: I think it is important to keep these in the physical plans -- in particular what I think is important is to be able to check via the explain plan if pruning is happening by looking at the explain plan ########## datafusion/datasource/src/file_format.rs: ########## @@ -109,37 +108,10 @@ pub trait FileFormat: Send + Sync + fmt::Debug { not_impl_err!("Writer not implemented for this format") } - /// Check if the specified file format has support for pushing down the provided filters within - /// the given schemas. Added initially to support the Parquet file format's ability to do this. - fn supports_filters_pushdown( Review Comment: yes I agree Since [`FileFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html) is a pub trait, this is technically a breaking API change, but I do think it was a parquet specific optimization I recommend we mark this PR as an API change and add a note to the upgrade guide https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md I think it should basically say if you implemented `FileFormat` (which probably no one did) and ListingTable you will have to implement the newly added `ExecutionPlan::try_pushdown_filter` method into your execution plan directly if you want the filters to be pushed down ########## datafusion/core/src/datasource/listing/table.rs: ########## @@ -982,18 +980,6 @@ impl TableProvider for ListingTable { return Ok(TableProviderFilterPushDown::Exact); } - // if we can't push it down completely with only the filename-based/path-based - // column names, then we should check if we can do parquet predicate pushdown - let supports_pushdown = self.options.format.supports_filters_pushdown( - &self.file_schema, - &self.table_schema, - &[filter], - )?; - - if supports_pushdown == FilePushdownSupport::Supported { - return Ok(TableProviderFilterPushDown::Exact); - } Review Comment: > I have one question: aren't we expecting/preparing for, people to use ListingTable if they read Parquet files? Are we eventually planning to remove all format-specific handlings? Or this is a case only for filter pushdown? For what it is worth, we (InfluxData) doesn't use ListingTable to read parquet files, instead we provide our own equivalent and create the DataSourceExec's directly > I would keep supports_filters_pushdown so that TableProviders can do Exact pruning of filters, e.g. using partition columns. Yes I think that is important too -- I don't think we should be removing any APIs from ListingTable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org