[ 
https://issues.apache.org/jira/browse/SPARK-57568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57568:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support TimeType in Parquet/ORC aggregate push-down
> ---------------------------------------------------
>
>                 Key: SPARK-57568
>                 URL: https://issues.apache.org/jira/browse/SPARK-57568
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h2. What
> Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed 
> from
> file-footer statistics).
> h2. Where
> * Shared eligibility gate: 
> {{AggregatePushDownUtils.getSchemaForPushedAggregation}} 
> (sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the 
> MIN/MAX type allow-list that currently permits 
> Boolean/Byte/Short/Integer/Long/Float/Double/Date.
> * Shared columnar conversion: 
> {{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses 
> {{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
> * Engine-specific footer-stat readers: 
> {{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}} 
> (Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}} 
> (ORC stores TIME as LONG).
> h2. Relationship to SPARK-54203
> The row-to-columnar conversion was gated by SPARK-54203 
> ({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the 
> columnar path supports TimeType. This sub-task adds the push-down eligibility 
> + the footer-stat reading for TIME.
> h2. Scope: why one task for both ORC and Parquet (not split per engine)
> The eligibility decision is made in *shared, engine-agnostic* code: 
> {{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and 
> is called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding 
> {{TimeType}} there turns push-down on for *both* engines at once. Likewise, 
> the push-down test cases live in a *shared trait* 
> {{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC 
> V1/V2), so one TIME test exercises both engines.
> Only the small footer-stat readers differ per engine. Splitting into separate 
> ORC/Parquet tasks would force an artificial engine-aware refactor of the 
> shared gate (so one engine can be enabled independently of the other) and 
> would risk breaking the not-yet-updated engine while the shared gate is 
> flipped on -- the un-updated reader would hit its 
> {{createAggInternalRowFromFooter}} fallback error for TIME aggregates. The 
> isolated benefit (separate review) does not justify that coordination cost, 
> so both engines are handled together here.
> h2. Acceptance criteria
> * MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and 
> returns correct results.
> * Tests added in the Parquet/ORC aggregate push-down suites (shared 
> {{FileSourceAggregatePushDownSuite}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to