[
https://issues.apache.org/jira/browse/SPARK-57582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57582:
-----------------------------
Description:
h2. Background
{{TimeType}} Parquet I/O is handled by the types framework:
{{ParquetTypeOps(dt)}} returns {{Some(TimeTypeParquetOps(...))}} for
{{TimeType}}, and the three Parquet entry points dispatch framework-first
({{ParquetTypeOps(dt).map(...).getOrElse(default)}}):
* schema (write): {{SparkToParquetSchemaConverter.convertField}} ->
{{TimeTypeParquetOps.convertToParquetType}}
* value (write): {{ParquetWriteSupport.makeWriter}} ->
{{TimeTypeParquetOps.makeWriter}}
* value (row read): {{ParquetRowConverter.newConverter}} ->
{{TimeTypeParquetOps.newConverter}}
Each entry point still carries an inline {{case _: TimeType}} branch in its
{{*Default}} fallback. Because {{ParquetTypeOps(TimeType)}} always returns
{{Some(...)}}, the framework op always wins and these fallbacks never execute.
h2. Problem
The fallbacks are dead code that nonetheless duplicates real logic. SPARK-57551
extended {{TimeType}} to nanosecond precision and added the {{TIME(MICROS)}}
(precision 0..6) vs {{TIME(NANOS)}} (precision 7..9) bifurcation, plus
read-time precision truncation, to *both* the live framework op and the three
shadowed fallbacks. The duplicated copies are a maintenance hazard:
* an edit to a fallback has no runtime effect (silently ineffective);
* an edit to the live op that misses the fallbacks leaves stale copies that
would become live (and wrong) if the framework dispatch were ever removed for
{{TimeType}}.
Dead, duplicated branches (master after SPARK-57551):
* {{ParquetSchemaConverter.scala}}: {{convertFieldDefault}}, {{case t:
TimeType}}
* {{ParquetWriteSupport.scala}}: {{makeWriterDefault}}, {{case t: TimeType if
t.precision > MICROS_PRECISION}} and {{case _: TimeType}}
* {{ParquetRowConverter.scala}}: {{newConverterDefault}}, {{case t: TimeType}}
h2. Proposal
Remove the shadowed {{TimeType}} fallback branches so {{TimeTypeParquetOps}} is
the single source of truth, or document that the fallbacks are intentionally
retained during the types-framework migration. More broadly, any type fully
migrated to {{ParquetTypeOps}} leaves a dead {{*Default}} fallback case; audit
and drop those consistently as part of the migration.
h2. Notes
Found during code review of SPARK-57551 (PR
https://github.com/apache/spark/pull/56622). No user-facing behavior change;
cleanup and maintainability only.
> Remove dead TimeType fallback branches shadowed by the Parquet types
> framework dispatch
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-57582
> URL: https://issues.apache.org/jira/browse/SPARK-57582
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. Background
> {{TimeType}} Parquet I/O is handled by the types framework:
> {{ParquetTypeOps(dt)}} returns {{Some(TimeTypeParquetOps(...))}} for
> {{TimeType}}, and the three Parquet entry points dispatch framework-first
> ({{ParquetTypeOps(dt).map(...).getOrElse(default)}}):
> * schema (write): {{SparkToParquetSchemaConverter.convertField}} ->
> {{TimeTypeParquetOps.convertToParquetType}}
> * value (write): {{ParquetWriteSupport.makeWriter}} ->
> {{TimeTypeParquetOps.makeWriter}}
> * value (row read): {{ParquetRowConverter.newConverter}} ->
> {{TimeTypeParquetOps.newConverter}}
> Each entry point still carries an inline {{case _: TimeType}} branch in its
> {{*Default}} fallback. Because {{ParquetTypeOps(TimeType)}} always returns
> {{Some(...)}}, the framework op always wins and these fallbacks never execute.
> h2. Problem
> The fallbacks are dead code that nonetheless duplicates real logic.
> SPARK-57551 extended {{TimeType}} to nanosecond precision and added the
> {{TIME(MICROS)}} (precision 0..6) vs {{TIME(NANOS)}} (precision 7..9)
> bifurcation, plus read-time precision truncation, to *both* the live
> framework op and the three shadowed fallbacks. The duplicated copies are a
> maintenance hazard:
> * an edit to a fallback has no runtime effect (silently ineffective);
> * an edit to the live op that misses the fallbacks leaves stale copies that
> would become live (and wrong) if the framework dispatch were ever removed for
> {{TimeType}}.
> Dead, duplicated branches (master after SPARK-57551):
> * {{ParquetSchemaConverter.scala}}: {{convertFieldDefault}}, {{case t:
> TimeType}}
> * {{ParquetWriteSupport.scala}}: {{makeWriterDefault}}, {{case t: TimeType if
> t.precision > MICROS_PRECISION}} and {{case _: TimeType}}
> * {{ParquetRowConverter.scala}}: {{newConverterDefault}}, {{case t: TimeType}}
> h2. Proposal
> Remove the shadowed {{TimeType}} fallback branches so {{TimeTypeParquetOps}}
> is the single source of truth, or document that the fallbacks are
> intentionally retained during the types-framework migration. More broadly,
> any type fully migrated to {{ParquetTypeOps}} leaves a dead {{*Default}}
> fallback case; audit and drop those consistently as part of the migration.
> h2. Notes
> Found during code review of SPARK-57551 (PR
> https://github.com/apache/spark/pull/56622). No user-facing behavior change;
> cleanup and maintainability only.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]