[
https://issues.apache.org/jira/browse/SPARK-57568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57568:
-----------------------------------
Labels: pull-request-available (was: )
> Support TimeType in Parquet/ORC aggregate push-down
> ---------------------------------------------------
>
> Key: SPARK-57568
> URL: https://issues.apache.org/jira/browse/SPARK-57568
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h2. What
> Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed
> from
> file-footer statistics).
> h2. Where
> * Shared eligibility gate:
> {{AggregatePushDownUtils.getSchemaForPushedAggregation}}
> (sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the
> MIN/MAX type allow-list that currently permits
> Boolean/Byte/Short/Integer/Long/Float/Double/Date.
> * Shared columnar conversion:
> {{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses
> {{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
> * Engine-specific footer-stat readers:
> {{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}}
> (Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}}
> (ORC stores TIME as LONG).
> h2. Relationship to SPARK-54203
> The row-to-columnar conversion was gated by SPARK-54203
> ({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the
> columnar path supports TimeType. This sub-task adds the push-down eligibility
> + the footer-stat reading for TIME.
> h2. Scope: why one task for both ORC and Parquet (not split per engine)
> The eligibility decision is made in *shared, engine-agnostic* code:
> {{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and
> is called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding
> {{TimeType}} there turns push-down on for *both* engines at once. Likewise,
> the push-down test cases live in a *shared trait*
> {{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC
> V1/V2), so one TIME test exercises both engines.
> Only the small footer-stat readers differ per engine. Splitting into separate
> ORC/Parquet tasks would force an artificial engine-aware refactor of the
> shared gate (so one engine can be enabled independently of the other) and
> would risk breaking the not-yet-updated engine while the shared gate is
> flipped on -- the un-updated reader would hit its
> {{createAggInternalRowFromFooter}} fallback error for TIME aggregates. The
> isolated benefit (separate review) does not justify that coordination cost,
> so both engines are handled together here.
> h2. Acceptance criteria
> * MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and
> returns correct results.
> * Tests added in the Parquet/ORC aggregate push-down suites (shared
> {{FileSourceAggregatePushDownSuite}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]