[
https://issues.apache.org/jira/browse/SPARK-57568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57568:
-----------------------------
Description:
h2. What
Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed from
file-footer statistics).
h2. Where
* Shared eligibility gate:
{{AggregatePushDownUtils.getSchemaForPushedAggregation}}
(sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the
MIN/MAX type allow-list that currently permits
Boolean/Byte/Short/Integer/Long/Float/Double/Date.
* Shared columnar conversion:
{{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses
{{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
* Engine-specific footer-stat readers:
{{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}}
(Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}}
(ORC stores TIME as LONG).
h2. Relationship to SPARK-54203
The row-to-columnar conversion was gated by SPARK-54203
({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the
columnar path supports TimeType. This sub-task adds the push-down eligibility +
the footer-stat reading for TIME.
h2. Scope: why one task for both ORC and Parquet (not split per engine)
The eligibility decision is made in *shared, engine-agnostic* code:
{{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and is
called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding
{{TimeType}} there turns push-down on for *both* engines at once. Likewise, the
push-down test cases live in a *shared trait*
{{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC V1/V2),
so one TIME test exercises both engines.
Only the small footer-stat readers differ per engine. Splitting into separate
ORC/Parquet tasks would force an artificial engine-aware refactor of the shared
gate (so one engine can be enabled independently of the other) and would risk
breaking the not-yet-updated engine while the shared gate is flipped on -- the
un-updated reader would hit its {{createAggInternalRowFromFooter}} fallback
error for TIME aggregates. The isolated benefit (separate review) does not
justify that coordination cost, so both engines are handled together here.
h2. Acceptance criteria
* MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and
returns correct results.
* Tests added in the Parquet/ORC aggregate push-down suites (shared
{{FileSourceAggregatePushDownSuite}}).
was:
h2. What
Support {{TimeType}} in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed
from
file-footer statistics).
h2. Where
{{AggregatePushDownUtils}}
(sql/core/.../execution/datasources/AggregatePushDownUtils.scala, around line
153) builds a
columnar batch from file-footer statistics to answer pushed-down aggregates.
h2. Relationship to SPARK-54203
This area is gated at the row-to-columnar converter level by SPARK-54203
({{RowToColumnConverter.getConverterForType}} in
sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala), which
currently has
no {{TimeType}} case. Once SPARK-54203 lands, the work below becomes possible.
It may also
require TIME support in this component's own layer (Arrow type mapping,
Parquet/ORC logical
types, or Variant encoding).
h2. Acceptance criteria
* MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and
returns correct
results.
* Tests added in the Parquet/ORC aggregate push-down suites.
> Support TimeType in Parquet/ORC aggregate push-down
> ---------------------------------------------------
>
> Key: SPARK-57568
> URL: https://issues.apache.org/jira/browse/SPARK-57568
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. What
> Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed
> from
> file-footer statistics).
> h2. Where
> * Shared eligibility gate:
> {{AggregatePushDownUtils.getSchemaForPushedAggregation}}
> (sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the
> MIN/MAX type allow-list that currently permits
> Boolean/Byte/Short/Integer/Long/Float/Double/Date.
> * Shared columnar conversion:
> {{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses
> {{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
> * Engine-specific footer-stat readers:
> {{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}}
> (Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}}
> (ORC stores TIME as LONG).
> h2. Relationship to SPARK-54203
> The row-to-columnar conversion was gated by SPARK-54203
> ({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the
> columnar path supports TimeType. This sub-task adds the push-down eligibility
> + the footer-stat reading for TIME.
> h2. Scope: why one task for both ORC and Parquet (not split per engine)
> The eligibility decision is made in *shared, engine-agnostic* code:
> {{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and
> is called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding
> {{TimeType}} there turns push-down on for *both* engines at once. Likewise,
> the push-down test cases live in a *shared trait*
> {{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC
> V1/V2), so one TIME test exercises both engines.
> Only the small footer-stat readers differ per engine. Splitting into separate
> ORC/Parquet tasks would force an artificial engine-aware refactor of the
> shared gate (so one engine can be enabled independently of the other) and
> would risk breaking the not-yet-updated engine while the shared gate is
> flipped on -- the un-updated reader would hit its
> {{createAggInternalRowFromFooter}} fallback error for TIME aggregates. The
> isolated benefit (separate review) does not justify that coordination cost,
> so both engines are handled together here.
> h2. Acceptance criteria
> * MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and
> returns correct results.
> * Tests added in the Parquet/ORC aggregate push-down suites (shared
> {{FileSourceAggregatePushDownSuite}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]