MaxGekk opened a new pull request, #56846:
URL: https://github.com/apache/spark/pull/56846

   ### What changes were proposed in this pull request?
   Enable MIN/MAX/COUNT aggregate push-down over `TIME` columns for the Parquet 
and ORC data sources, computed from file-footer statistics.
   
   - Add `TimeType` to the MIN/MAX type allow-list in 
`AggregatePushDownUtils.getSchemaForPushedAggregation`. This is shared, 
engine-agnostic code called by both `ParquetScanBuilder` and `OrcScanBuilder`, 
so the single change enables push-down eligibility for both engines at once.
   - Add a `TimeType` case to `OrcUtils.getMinMaxFromColumnStatistics`. ORC 
stores `TIME` as a `LONG`, so its statistics are `IntegerColumnStatistics`; the 
min/max value is wrapped in a `LongWritable` and converted back to the Spark 
`TimeType` by `OrcDeserializer`.
   - No Parquet reader change is needed: `TIME` is stored as Parquet `INT64`, 
so the existing `INT64` branch in `ParquetUtils.createAggInternalRowFromFooter` 
feeds the footer stat into a `ParquetRowConverter` built from the footer 
`PrimitiveType`, which carries the `TIME(MICROS)`/`TIME(NANOS)` logical 
annotation and maps to `TimeType`.
   
   The columnar conversion 
(`AggregatePushDownUtils.convertAggregatesRowToBatch` via 
`RowToColumnConverter`) already supports `TimeType` (SPARK-54203), so the 
columnar path works for `TIME`.
   
   ### Why are the changes needed?
   This is a sub-task of SPARK-57550 (extending support for the `TIME` data 
type). Aggregate push-down lets Parquet/ORC answer `MIN`/`MAX`/`COUNT` from 
footer statistics without reading and aggregating `TIME` data at the Spark 
layer.
   
   ### Does this PR introduce _any_ user-facing change?
   No. This is an internal optimization on the aggregate push-down path; query 
results are unchanged.
   
   ### How was this patch tested?
   Added tests to the shared `FileSourceAggregatePushDownSuite` trait, which is 
extended by `ParquetV1/V2AggregatePushDownSuite` and 
`OrcV1/V2AggregatePushDownSuite`, so each test exercises all four engines:
   - Positive: `MIN`/`MAX`/`COUNT(col)`/`COUNT(*)` push-down over a `TIME` 
column at precisions 0, 6, 7, and 9, covering both the Parquet micros 
(precision <= 6) and nanos (precision >= 7) storage paths, with a null row so 
`COUNT(col)` and `COUNT(*)` differ.
   - Negative: a data filter on the `TIME` column, an aggregate over a 
non-column expression, and push-down disabled by config -- all asserting the 
aggregate is not pushed.
   
   Ran:
   ```
   build/sbt 'sql/testOnly *ParquetV1AggregatePushDownSuite 
*ParquetV2AggregatePushDownSuite *OrcV1AggregatePushDownSuite 
*OrcV2AggregatePushDownSuite'
   ```
   All 92 tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to