MaxGekk opened a new pull request, #56846: URL: https://github.com/apache/spark/pull/56846
### What changes were proposed in this pull request? Enable MIN/MAX/COUNT aggregate push-down over `TIME` columns for the Parquet and ORC data sources, computed from file-footer statistics. - Add `TimeType` to the MIN/MAX type allow-list in `AggregatePushDownUtils.getSchemaForPushedAggregation`. This is shared, engine-agnostic code called by both `ParquetScanBuilder` and `OrcScanBuilder`, so the single change enables push-down eligibility for both engines at once. - Add a `TimeType` case to `OrcUtils.getMinMaxFromColumnStatistics`. ORC stores `TIME` as a `LONG`, so its statistics are `IntegerColumnStatistics`; the min/max value is wrapped in a `LongWritable` and converted back to the Spark `TimeType` by `OrcDeserializer`. - No Parquet reader change is needed: `TIME` is stored as Parquet `INT64`, so the existing `INT64` branch in `ParquetUtils.createAggInternalRowFromFooter` feeds the footer stat into a `ParquetRowConverter` built from the footer `PrimitiveType`, which carries the `TIME(MICROS)`/`TIME(NANOS)` logical annotation and maps to `TimeType`. The columnar conversion (`AggregatePushDownUtils.convertAggregatesRowToBatch` via `RowToColumnConverter`) already supports `TimeType` (SPARK-54203), so the columnar path works for `TIME`. ### Why are the changes needed? This is a sub-task of SPARK-57550 (extending support for the `TIME` data type). Aggregate push-down lets Parquet/ORC answer `MIN`/`MAX`/`COUNT` from footer statistics without reading and aggregating `TIME` data at the Spark layer. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization on the aggregate push-down path; query results are unchanged. ### How was this patch tested? Added tests to the shared `FileSourceAggregatePushDownSuite` trait, which is extended by `ParquetV1/V2AggregatePushDownSuite` and `OrcV1/V2AggregatePushDownSuite`, so each test exercises all four engines: - Positive: `MIN`/`MAX`/`COUNT(col)`/`COUNT(*)` push-down over a `TIME` column at precisions 0, 6, 7, and 9, covering both the Parquet micros (precision <= 6) and nanos (precision >= 7) storage paths, with a null row so `COUNT(col)` and `COUNT(*)` differ. - Negative: a data filter on the `TIME` column, an aggregate over a non-column expression, and push-down disabled by config -- all asserting the aggregate is not pushed. Ran: ``` build/sbt 'sql/testOnly *ParquetV1AggregatePushDownSuite *ParquetV2AggregatePushDownSuite *OrcV1AggregatePushDownSuite *OrcV2AggregatePushDownSuite' ``` All 92 tests pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
