mbutrovich opened a new pull request, #4730: URL: https://github.com/apache/datafusion-comet/pull/4730
## Which issue does this PR close? Closes #. ## Rationale for this change DataFusion's default `PhysicalExprAdapter` inserts a `CastExpr` around every Column reference whenever the logical and physical Arrow Fields differ in any attribute, including metadata-only or nullability-only mismatches. DataFusion itself absorbs this because its `PruningPredicate` analyzer recognizes its own `CastExpr` and peels it to resolve the column against parquet statistics. Comet's `SparkPhysicalExprAdapter::replace_with_spark_cast` then swaps that `CastExpr` for a `datafusion_comet_spark_expr::Cast` because Spark cast semantics diverge from arrow-cast for overflow, null handling, ANSI mode, etc. The Spark `Cast` is a different `PhysicalExpr` type that DataFusion's pruning analyzer does not understand, so `build_pruning_predicates` returns `None` at file open time and no row groups are pruned. With Spark range-derived schemas (non-nullable logical) read from Parquet (nullable physical), this fires on every column reference, silently disabling row-group and page-index stats pruning. For identical source and target data types there is no Spark-specific cast semantics to preserve, so the swap costs us pruning for no benefit. ## What changes are included in this PR? `SparkPhysicalExprAdapter::replace_with_spark_cast` now skips the Spark `Cast` wrap when the physical and target data types are equal. Unwrapping is safe because a `Cast` with equal source and target types is a value-level identity (it does not null-strip or enforce non-null), and Arrow field nullability and metadata are informational, not computational. ## How are these changes tested? `CometNativeReaderSuite` adds a regression test that writes a 1000-row Parquet file with `parquet.block.size=1024`, asserts more than one row group via `ParquetFileReader`, then runs `SELECT ... WHERE c1 > 500` and asserts the scan's `numOutputRows` is strictly less than 1000. Without the fix the scan reads all rows; with the fix row groups whose max is at most 500 are pruned. I will run TPC-DS SF 1000. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
