felipepessoto opened a new pull request, #12292: URL: https://github.com/apache/gluten/pull/12292
## What changes are proposed in this pull request? Uncovered by the Delta Spark UT pipeline (#12278). `GlutenDeltaJobStatsTracker` builds the per-file statistics aggregation as a `SortAggregateExec -> ProjectExec` plan, runs Gluten's `HeuristicTransform`, then unconditionally casts the result to a `WholeStageTransformer`. When the statistics aggregation cannot be offloaded to Velox -- for example `min`/`max` over a `TIMESTAMP_NTZ` column, as exercised by Delta's `DataSkippingDeltaV1Suite` "data skipping on TIMESTAMP_NTZ near Long.MaxValue" -- the projection stays a vanilla `ProjectExec` and the cast throws: ``` java.lang.ClassCastException: org.apache.spark.sql.execution.ProjectExec cannot be cast to org.apache.gluten.execution.WholeStageTransformer ``` in the per-task tracker constructor (`GlutenDeltaJobStatsTracker.scala`), failing the write. This PR decides on the **driver** whether the aggregation actually offloads: a new `canOffloadStats()` dry-runs the same transform pipeline once and checks whether it collapses into a `WholeStageTransformer`. If it does not, the `DeltaJobStatisticsTracker` is routed to the existing `GlutenDeltaJobStatsFallbackTracker` (columnar-to-row + the original Delta tracker, which produces correct statistics for any type) instead of the native tracker. Evaluating this on the driver also avoids the per-task constructor allocating a single-thread executor and a `NativePlanEvaluator` before the cast. The fix is applied to both the Delta 3.x (`src-delta33`) and Delta 4.x (`src-delta40`) copies. ## How was this patch tested? Added `GlutenDeltaStatsSuite`, which writes a Delta table whose `TIMESTAMP_NTZ` min/max statistics cannot be offloaded to Velox. Before this change the write crashes with the `ClassCastException` above; after it, the write succeeds via the row-based fallback tracker. Locally verified (Spark 3.5, Scala 2.12): the new suite fails without the fix (`Tests: succeeded 0, failed 1`, ClassCastException) and passes with it (`succeeded 1, failed 0`). A companion test-only PR demonstrates the same red/green contrast on CI. Also confirmed end-to-end against Delta's `DataSkippingDeltaV1Suite` "TIMESTAMP_NTZ near Long.MaxValue" (succeeded 2, failed 0 with the fix). `scalafmt`/spotless report no changes. ## Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot CLI (claude-opus-4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
