mbutrovich opened a new pull request, #4730:
URL: https://github.com/apache/datafusion-comet/pull/4730

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   DataFusion's default `PhysicalExprAdapter` inserts a `CastExpr` around every 
Column reference whenever the logical and physical Arrow Fields differ in any 
attribute, including metadata-only or nullability-only mismatches. DataFusion 
itself absorbs this because its `PruningPredicate` analyzer recognizes its own 
`CastExpr` and peels it to resolve the column against parquet statistics.
   
   Comet's `SparkPhysicalExprAdapter::replace_with_spark_cast` then swaps that 
`CastExpr` for a `datafusion_comet_spark_expr::Cast` because Spark cast 
semantics diverge from arrow-cast for overflow, null handling, ANSI mode, etc. 
The Spark `Cast` is a different `PhysicalExpr` type that DataFusion's pruning 
analyzer does not understand, so `build_pruning_predicates` returns `None` at 
file open time and no row groups are pruned. With Spark range-derived schemas 
(non-nullable logical) read from Parquet (nullable physical), this fires on 
every column reference, silently disabling row-group and page-index stats 
pruning.
   
   For identical source and target data types there is no Spark-specific cast 
semantics to preserve, so the swap costs us pruning for no benefit.
   
   ## What changes are included in this PR?
   
   `SparkPhysicalExprAdapter::replace_with_spark_cast` now skips the Spark 
`Cast` wrap when the physical and target data types are equal. Unwrapping is 
safe because a `Cast` with equal source and target types is a value-level 
identity (it does not null-strip or enforce non-null), and Arrow field 
nullability and metadata are informational, not computational.
   
   ## How are these changes tested?
   
   `CometNativeReaderSuite` adds a regression test that writes a 1000-row 
Parquet file with `parquet.block.size=1024`, asserts more than one row group 
via `ParquetFileReader`, then runs `SELECT ... WHERE c1 > 500` and asserts the 
scan's `numOutputRows` is strictly less than 1000. Without the fix the scan 
reads all rows; with the fix row groups whose max is at most 500 are pruned.
   
   I will run TPC-DS SF 1000.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to