Max Gekk created SPARK-57661:
--------------------------------

             Summary: Preserve TIME precision in the Spark <-> Arrow type 
mapping
                 Key: SPARK-57661
                 URL: https://issues.apache.org/jira/browse/SPARK-57661
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h2. What

Carry the {{TimeType(p)}} fractional-second precision {{p}} (in [0, 9]) across 
the Spark <-> Arrow type mapping so that a {{TIME(p)}} column round-trips back 
to the same {{TIME(p)}}, instead of collapsing to the canonical {{TIME(6)}}.

h2. Why

{{ArrowUtils}} / the Types Framework currently map every {{TimeType(p)}} to 
{{ArrowType.Time(TimeUnit.NANOSECOND, 64)}} (no precision field), and 
{{TypeApiOps.fromArrowType}} maps {{ArrowType.Time(NANOSECOND, 64)}} back to a 
fixed {{TimeType(TimeType.MICROS_PRECISION)}} (= 6). As a result the declared 
precision is lost on any Arrow round-trip ({{TIME(0)}}, {{TIME(3)}}, 
{{TIME(9)}}, ... all read back as {{TIME(6)}}), so Arrow-based schema transfer 
(Connect schema/results, createDataFrame from Arrow, mapInArrow, etc.) silently 
widens or narrows the type label. The stored value is already 
nanosecond-resolution and is unaffected; this is purely a type-fidelity gap.

Arrow's {{Time}} logical type only encodes (unit, bitWidth) and has no 
fractional-precision field, so the precision cannot live in the {{ArrowType}} 
itself. It can, however, be carried in the Arrow {{Field}} metadata, the same 
channel Spark already uses to reconstruct parameterized logical types 
(Geometry/Geography recover {{srid}}; the nanosecond timestamp types carry 
their precision under {{SPARK::timestampNanosPrecision}} per SPARK-57159).

h2. Scope

{{sql/api/.../util/ArrowUtils.scala}}: in {{toArrowField}}, tag {{TimeType(p)}} 
fields with a precision metadata key (e.g. {{SPARK::timePrecision}}), merged 
with the column metadata; in {{fromArrowField}}, read that key to reconstruct 
{{TimeType(p)}}.
{{sql/api/.../types/ops/TimeTypeApiOps.scala}} and 
{{TypeApiOps.fromArrowType}}: keep {{toArrowType}} producing {{Time(NANOSECOND, 
64)}}; keep the metadata-less {{fromArrowType}} as the canonical fallback.
Reuse the precision-in-field-metadata pattern introduced for the nanosecond 
timestamp types (SPARK-57159) for consistency.
h2. Behavior on read-back

Metadata present: reconstruct the exact {{TimeType(p)}}.
Metadata absent (foreign Arrow data) or out of [0, 9]: fall back to the current 
canonical {{TimeType(MICROS_PRECISION)}} (= 6), preserving today's behavior for 
non-Spark producers.
h2. Out of scope

Value semantics / rounding: values are carried verbatim at nanosecond 
resolution; no change to how {{TIME(p)}} values are truncated (that already 
happens upstream).
PySpark Arrow/pandas conversion and Spark Connect proto/converters (separate 
sub-tasks), beyond what the shared {{ArrowUtils}} mapping provides.
h2. How tested

{{ArrowUtilsSuite}}: round-trip {{TIME(p)}} for {{p}} in {0, 3, 6, 9} preserves 
{{p}}; a {{Time(NANOSECOND)}} field with no precision metadata falls back to 
{{TIME(6)}}; the precision key does not leak into the reconstructed column 
{{Metadata}}.
h2. Does this introduce any user-facing change

Yes (minor): a {{TIME(p)}} column transferred over Arrow now retains its 
declared precision instead of always reading back as {{TIME(6)}}. No change to 
stored values. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to