[ 
https://issues.apache.org/jira/browse/SPARK-57568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57568:
-----------------------------
    Description: 
h2. What

Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed from
file-footer statistics).

h2. Where

* Shared eligibility gate: 
{{AggregatePushDownUtils.getSchemaForPushedAggregation}} 
(sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the 
MIN/MAX type allow-list that currently permits 
Boolean/Byte/Short/Integer/Long/Float/Double/Date.
* Shared columnar conversion: 
{{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses 
{{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
* Engine-specific footer-stat readers: 
{{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}} 
(Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}} 
(ORC stores TIME as LONG).

h2. Relationship to SPARK-54203

The row-to-columnar conversion was gated by SPARK-54203 
({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the 
columnar path supports TimeType. This sub-task adds the push-down eligibility + 
the footer-stat reading for TIME.

h2. Scope: why one task for both ORC and Parquet (not split per engine)

The eligibility decision is made in *shared, engine-agnostic* code: 
{{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and is 
called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding 
{{TimeType}} there turns push-down on for *both* engines at once. Likewise, the 
push-down test cases live in a *shared trait* 
{{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC V1/V2), 
so one TIME test exercises both engines.

Only the small footer-stat readers differ per engine. Splitting into separate 
ORC/Parquet tasks would force an artificial engine-aware refactor of the shared 
gate (so one engine can be enabled independently of the other) and would risk 
breaking the not-yet-updated engine while the shared gate is flipped on -- the 
un-updated reader would hit its {{createAggInternalRowFromFooter}} fallback 
error for TIME aggregates. The isolated benefit (separate review) does not 
justify that coordination cost, so both engines are handled together here.

h2. Acceptance criteria

* MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and 
returns correct results.
* Tests added in the Parquet/ORC aggregate push-down suites (shared 
{{FileSourceAggregatePushDownSuite}}).

  was:
h2. What

Support {{TimeType}} in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed 
from
file-footer statistics).

h2. Where

{{AggregatePushDownUtils}}
(sql/core/.../execution/datasources/AggregatePushDownUtils.scala, around line 
153) builds a
columnar batch from file-footer statistics to answer pushed-down aggregates.

h2. Relationship to SPARK-54203

This area is gated at the row-to-columnar converter level by SPARK-54203
({{RowToColumnConverter.getConverterForType}} in
sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala), which 
currently has
no {{TimeType}} case. Once SPARK-54203 lands, the work below becomes possible. 
It may also
require TIME support in this component's own layer (Arrow type mapping, 
Parquet/ORC logical
types, or Variant encoding).

h2. Acceptance criteria

* MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and 
returns correct
  results.
* Tests added in the Parquet/ORC aggregate push-down suites.


> Support TimeType in Parquet/ORC aggregate push-down
> ---------------------------------------------------
>
>                 Key: SPARK-57568
>                 URL: https://issues.apache.org/jira/browse/SPARK-57568
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h2. What
> Support TimeType in Parquet/ORC aggregate push-down (MIN/MAX/COUNT computed 
> from
> file-footer statistics).
> h2. Where
> * Shared eligibility gate: 
> {{AggregatePushDownUtils.getSchemaForPushedAggregation}} 
> (sql/core/.../execution/datasources/AggregatePushDownUtils.scala) -- the 
> MIN/MAX type allow-list that currently permits 
> Boolean/Byte/Short/Integer/Long/Float/Double/Date.
> * Shared columnar conversion: 
> {{AggregatePushDownUtils.convertAggregatesRowToBatch}} (uses 
> {{RowToColumnConverter}}, now TimeType-capable via SPARK-54203).
> * Engine-specific footer-stat readers: 
> {{ParquetUtils.createAggInternalRowFromFooter}} / {{getPushedDownAggResult}} 
> (Parquet INT64 logical TIME) and {{OrcUtils.getMinMaxFromColumnStatistics}} 
> (ORC stores TIME as LONG).
> h2. Relationship to SPARK-54203
> The row-to-columnar conversion was gated by SPARK-54203 
> ({{RowToColumnConverter.getConverterForType}}), which is now Resolved, so the 
> columnar path supports TimeType. This sub-task adds the push-down eligibility 
> + the footer-stat reading for TIME.
> h2. Scope: why one task for both ORC and Parquet (not split per engine)
> The eligibility decision is made in *shared, engine-agnostic* code: 
> {{getSchemaForPushedAggregation}} holds a single MIN/MAX type allow-list and 
> is called by both {{ParquetScanBuilder}} and {{OrcScanBuilder}}. Adding 
> {{TimeType}} there turns push-down on for *both* engines at once. Likewise, 
> the push-down test cases live in a *shared trait* 
> {{FileSourceAggregatePushDownSuite}} (extended by Parquet V1/V2 and ORC 
> V1/V2), so one TIME test exercises both engines.
> Only the small footer-stat readers differ per engine. Splitting into separate 
> ORC/Parquet tasks would force an artificial engine-aware refactor of the 
> shared gate (so one engine can be enabled independently of the other) and 
> would risk breaking the not-yet-updated engine while the shared gate is 
> flipped on -- the un-updated reader would hit its 
> {{createAggInternalRowFromFooter}} fallback error for TIME aggregates. The 
> isolated benefit (separate review) does not justify that coordination cost, 
> so both engines are handled together here.
> h2. Acceptance criteria
> * MIN/MAX/COUNT over a TIME column can be pushed down to Parquet and ORC and 
> returns correct results.
> * Tests added in the Parquet/ORC aggregate push-down suites (shared 
> {{FileSourceAggregatePushDownSuite}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to