[
https://issues.apache.org/jira/browse/SPARK-57661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk reassigned SPARK-57661:
--------------------------------
Assignee: Max Gekk
> Preserve TIME precision in the Spark <-> Arrow type mapping
> -----------------------------------------------------------
>
> Key: SPARK-57661
> URL: https://issues.apache.org/jira/browse/SPARK-57661
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h2. What
> Carry the {{TimeType(p)}} fractional-second precision {{p}} (in [0, 9])
> across the Spark <-> Arrow type mapping so that a {{TIME(p)}} column
> round-trips back to the same {{TIME(p)}}, instead of collapsing to the
> canonical {{TIME(6)}}.
> h2. Why
> {{ArrowUtils}} / the Types Framework currently map every {{TimeType(p)}} to
> {{ArrowType.Time(TimeUnit.NANOSECOND, 64)}} (no precision field), and
> {{TypeApiOps.fromArrowType}} maps {{ArrowType.Time(NANOSECOND, 64)}} back to
> a fixed {{TimeType(TimeType.MICROS_PRECISION)}} (= 6). As a result the
> declared precision is lost on any Arrow round-trip ({{TIME(0)}}, {{TIME(3)}},
> {{TIME(9)}}, ... all read back as {{TIME(6)}}), so Arrow-based schema
> transfer (Connect schema/results, createDataFrame from Arrow, mapInArrow,
> etc.) silently widens or narrows the type label. The stored value is already
> nanosecond-resolution and is unaffected; this is purely a type-fidelity gap.
> Arrow's {{Time}} logical type only encodes (unit, bitWidth) and has no
> fractional-precision field, so the precision cannot live in the {{ArrowType}}
> itself. It can, however, be carried in the Arrow {{Field}} metadata, the same
> channel Spark already uses to reconstruct parameterized logical types
> (Geometry/Geography recover {{srid}}; the nanosecond timestamp types carry
> their precision under {{SPARK::timestampNanos::precision}} per SPARK-57159).
> h2. Scope
> {{sql/api/.../util/ArrowUtils.scala}}: in {{toArrowField}}, tag
> {{TimeType(p)}} fields with the precision metadata key
> {{SPARK::time::precision}}, merged with the column metadata; in
> {{fromArrowField}}, read that key to reconstruct {{TimeType(p)}}.
> {{sql/api/.../types/ops/TimeTypeApiOps.scala}} and
> {{TypeApiOps.fromArrowType}}: keep {{toArrowType}} producing
> {{Time(NANOSECOND, 64)}}; keep the metadata-less {{fromArrowType}} as the
> canonical fallback.
> Reuse the precision-in-field-metadata pattern introduced for the nanosecond
> timestamp types (SPARK-57159) for consistency.
> h2. Behavior on read-back
> Metadata present: reconstruct the exact {{TimeType(p)}}.
> Metadata absent (foreign Arrow data) or out of [0, 9]: fall back to the
> current canonical {{TimeType(MICROS_PRECISION)}} (= 6), preserving today's
> behavior for non-Spark producers.
> h2. Out of scope
> Value semantics / rounding: values are carried verbatim at nanosecond
> resolution; no change to how {{TIME(p)}} values are truncated (that already
> happens upstream).
> PySpark Arrow/pandas conversion and Spark Connect proto/converters (separate
> sub-tasks), beyond what the shared {{ArrowUtils}} mapping provides.
> h2. How tested
> {{ArrowUtilsSuite}}: round-trip {{TIME(p)}} for {{p}} in {0, 3, 6, 9}
> preserves {{p}}; a {{Time(NANOSECOND)}} field with no precision metadata
> falls back to {{TIME(6)}}; the precision key does not leak into the
> reconstructed column {{Metadata}}.
> h2. Does this introduce any user-facing change
> Yes (minor): a {{TIME(p)}} column transferred over Arrow now retains its
> declared precision instead of always reading back as {{TIME(6)}}. No change
> to stored values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]