[
https://issues.apache.org/jira/browse/SPARK-57416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089286#comment-18089286
]
Anupam Yadav commented on SPARK-57416:
--------------------------------------
Thanks for letting me take this, [~maxgekk]. I reproduced it at the file level
(wrote a raw Parquet file with message root { optional int64 t
(TIME(MICROS,true)); } via parquet-mr) on current master, and the behavior is:
* Schema inference fails: spark.read.parquet(path) produces
[PARQUET_TYPE_ILLEGAL] Illegal Parquet type: INT64 (TIME(MICROS,true)). The
guard is in ParquetToSparkSchemaConverter (time.getUnit == MICROS &&
!time.isAdjustedToUTC).
* Explicit-schema read succeeds: spark.read.schema("t TIME(6)").parquet(path)
returns the correct 23:59:59.123456, because ParquetRowConverter and the
vectorized updater only check the unit, not isAdjustedToUTC.
One note: the ticket frames this as a framework ON/OFF divergence, but since
spark.sql.types.framework.enabled was removed in SPARK-57372 (framework
always-on), what remains is an inconsistency within the always-on path; the
schema converter rejects isAdjustedToUTC=true while the read converters accept
it and return correct values.
My proposed direction is to relax the schema-converter guard to accept
isAdjustedToUTC true or false, making inference consistent with the
explicit-read path that already reads these files correctly (the stored value
is microseconds-since-midnight regardless of the flag, and TimeType has no
timezone). The alternative (tightening the read path to also reject) would
break reads that work today.
Does relaxing inference match your intent, or do you prefer rejecting
isAdjustedToUTC=true consistently across both paths? Happy to go either way.
> Types Framework - Resolve TimeType Parquet read-path guard ON/OFF divergence
> for TIME(MICROS, isAdjustedToUTC=true)
> -------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-57416
> URL: https://issues.apache.org/jira/browse/SPARK-57416
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> *Parent:* SPARK-55444 (Types Framework), follow-up of PR apache/spark#55326
> (Phase 3a - Storage Formats - Parquet).
> h3. Problem
> When the Types Framework integrates TimeType into the Parquet read path,
> TimeTypeParquetOps.requireCompatibleParquetType enforces a stricter guard than
> the inline guard it replaces in ParquetRowConverter:
> * New guard (framework ON): INT64 && unit == MICROS && !isAdjustedToUTC
> * Original guard (framework OFF): TimeLogicalTypeAnnotation && unit == MICROS
> (does NOT check isAdjustedToUTC)
> The new guard is strictly tighter, even though the code comment claims it
> "mirrors the inline guard that existed in ParquetRowConverter before the
> framework dispatch".
> h3. Observed behavior difference
> Reading a TIME(MICROS, isAdjustedToUTC=true) column as TimeType via an
> explicit
> read schema (reachable because schema inference already maps
> isAdjustedToUTC=true
> to illegalType()):
> * Framework OFF -> succeeds (returns e.g. 23:59:59.123456)
> * Framework ON -> fails with FAILED_READ_FILE /
> cannotCreateParquetConverterForDataTypeError
> This contradicts the "behavior is identical in both cases" guarantee of the
> framework refactor. It is an edge case (only via explicit read schema) and is
> orthogonal to the framework wiring done in #55326, hence deferred to this
> follow-up.
> h3. Scope / required decision
> Decide the intended behavior and align both paths:
> # If ON/OFF equivalence is the intent: relax the framework guard to mirror the
> original (drop the !isAdjustedToUTC / INT64-specific checks) so framework ON
> and OFF behave identically.
> # If the stricter check is intentional: apply the same tightening to the
> *Default (non-framework) path, add a test documenting the (now intended)
> behavior change, and update the misleading "mirrors the inline guard"
> comment.
> h3. Test cleanup
> Update TimeTypeParquetOpsSuite: its scaladoc and the comment around the
> isAdjustedToUTC case currently describe behavior in terms of the #55326 review
> history ("whichever resolution lands..."). Reword to state the chosen
> invariant
> as intended behavior and reference this ticket instead of the PR thread.
> TimeTypeParquetOpsSuite already pins this case as a regression hook.
> h3. Does this introduce a user-facing change?
> Potentially yes, depending on the chosen resolution (reading
> TIME(MICROS, isAdjustedToUTC=true) as TimeType via explicit read schema). The
> resolution should make ON/OFF behavior consistent and document the final
> semantics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]