alamb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2210136013
########## docs/source/library-user-guide/upgrading.md: ########## @@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../user-guide/configs.md) documentation. +### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown + +We are moving away from converting data (using `SchemaAdapter`) to converting the expressions themselves (which is more efficient and flexible). Review Comment: I think we should also link to a ticket that describes the plan and backstory to move away from SchemaAdpater to rewriting the expressions. This both provides an avenue for feedback as well as additional information for people upgrading ########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -106,6 +106,8 @@ pub(crate) struct DatafusionArrowPredicate { rows_matched: metrics::Count, /// how long was spent evaluating this predicate time: metrics::Time, + /// used to perform type coercion while filtering rows Review Comment: I think it is a bit unclear how the schema mapper and expression rewriter work together -- I think it is the case that the schema is mapped first and then the simplified physical expression is evaluated against the mapped schema rather than the file schema Maybe we can add a comment explaining how this works ########## docs/source/library-user-guide/upgrading.md: ########## @@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../user-guide/configs.md) documentation. +### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown + +We are moving away from converting data (using `SchemaAdapter`) to converting the expressions themselves (which is more efficient and flexible). +The first place this change has taken place is in predicate pushdown for Parquet. +By default if you do not use a custom `SchemaAdapterFactory` we will use expression conversion instead. +If you do set a custom `SchemaAdapterFactory` we will continue to use it but emit a warning about that code path being deprecated. + +To resolve this you need to implement a custom `PhysicalExprAdapterFactory` and use that instead of a `SchemaAdapterFactory`. +See the [default values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs) for an example of how to do this. +Opting into the new APIs will set you up for future changes since we plan to expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion. Review Comment: A link to some description of the future plan would be super helpful here ########## datafusion/datasource-parquet/src/opener.rs: ########## @@ -1095,4 +1124,167 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0); } + + fn get_value(metrics: &MetricsSet, metric_name: &str) -> usize { + match metrics.sum_by_name(metric_name) { + Some(v) => v.as_usize(), + _ => { + panic!( + "Expected metric not found. Looking for '{metric_name}' in\n\n{metrics:#?}" + ); + } + } + } + + #[tokio::test] + async fn test_custom_schema_adapter_no_rewriter() { Review Comment: If possible, I think this test should be of the "end to end" variety (in `core_integration`) that shows how these APIs interact to rewrite predicates / schemas correctly. It is not super clear to me how these low level APIs would be used by users and thus not sure if this test covers the cases correctly ########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -140,6 +143,8 @@ impl ArrowPredicate for DatafusionArrowPredicate { } fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> { + let batch = self.schema_mapper.map_batch(batch)?; Review Comment: Applying the schema mapper first means the predicate is applied on batches that have been mapped to the table schema, but wasn't the predicate rewritten to be in terms of the file schema? ########## docs/source/library-user-guide/upgrading.md: ########## @@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../user-guide/configs.md) documentation. +### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown Review Comment: Is the plan to avoid using Schema Adapters for all schema conversion, or just predicate pushdown? ```suggestion ### Deprecating `SchemaAdapterFactory` and `SchemaAdapter` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org