Max Gekk created SPARK-57159:
--------------------------------
Summary: Add Arrow type mapping for nanosecond-capable timestamp
types
Key: SPARK-57159
URL: https://issues.apache.org/jira/browse/SPARK-57159
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h2. What
Wire {{TimestampNTZNanosType(p)}} and {{TimestampLTZNanosType(p)}} (p in [7,
9]) into the
Spark <-> Arrow type mapping and the InternalRow -> Arrow vector writer path,
so that
nanosecond-capable timestamps can be carried over Arrow.
This is the shared Arrow prerequisite for Spark Connect (parent: SPARK-56822).
It also
benefits the classic Arrow paths ({{Dataset.collectAsArrowToПython}}-style
transfer,
{{createDataFrame}} from Arrow, {{mapInArrow}}, etc.).
h2. Why
Spark Connect transfers query results and local relations as Arrow IPC batches.
The
mapping between Spark types and Arrow types ({{ArrowUtils}}) and the
row-to-vector writers
({{ArrowWriter}}) currently have no support for the nanosecond timestamp types,
so any plan
whose schema contains them fails to serialize. Until this mapping exists, no
Connect path
(schema response, result batches, local relations) can handle the new types,
regardless of
the protocol/converter work.
The internal value is {{org.apache.spark.unsafe.types.TimestampNanosVal}}
({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]). Arrow's
nanosecond
timestamp unit is the natural target.
h2. Scope
* {{sql/api/.../util/ArrowUtils.scala}}: extend {{toArrowType}} /
{{fromArrowType}}
** {{TimestampNTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND,
null)}}
** {{TimestampLTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND,
sessionTimeZoneId)}}
** Reverse mapping for {{ArrowType.Timestamp(NANOSECOND, tz)}} -> NTZ/LTZ nanos
type
* {{sql/api/.../types/ops/TypeApiOps.scala}}: register the two types; add
{{TimestampNTZNanosTypeApiOps}} / {{TimestampLTZNanosTypeApiOps}} (mirroring
{{TimeTypeApiOps}}).
* {{sql/catalyst/.../types/ops/TypeOps.scala}}: register server-side ops; add
{{TimestampNTZNanosTypeOps}} / {{TimestampLTZNanosTypeOps}} (mirroring
{{TimeTypeOps}}).
* {{sql/catalyst/.../execution/arrow/ArrowWriter.scala}}: field writers that
read via
{{InternalRow.getTimestampNTZNanos}} / {{getTimestampLTZNanos}} and write into
{{TimeStampNanoVector}} / {{TimeStampNanoTZVector}}.
* Arrow column reader/accessor for reading those vectors back into
{{InternalRow}}.
h2. Out of scope
* Spark Connect proto definitions and proto<->Catalyst converters (separate
sub-task).
* PySpark Arrow/pandas conversion (separate sub-task).
* {{ColumnVector}} / vectorized Parquet reader (tracked in SPARK-57100).
* Any rounding/truncation semantics for precision < 9; this ticket preserves
the full
{{TimestampNanosVal}} value (nanosecond Arrow unit).
h2. Design notes
* Follow the {{TimeType}} "Types Framework" pattern end to end (Ops classes +
registration) rather than scattering one-off match cases; coordinate
registration with
SPARK-57101 so the types are not registered twice.
* LTZ and NTZ share the physical value but must map to distinct Arrow timezone
settings
(LTZ uses the session time zone; NTZ uses {{null}}).
* Gate behind {{spark.sql.timestampNanosTypes.enabled}} consistent with the
rest of the
feature.
h2. How should the changes tested
* {{ArrowUtilsSuite}}: round-trip {{toArrowType}}/{{fromArrowType}} for NTZ and
LTZ nanos
types (including session time zone for LTZ).
* {{ArrowWriter}} / {{ArrowConverters}} round-trip: write rows containing nanos
timestamps
to Arrow and read them back, asserting {{epochMicros}} and {{nanosWithinMicro}}
are
preserved, including boundary values (Long.MinValue/MaxValue micros,
nanosWithinMicro 0 and
999) and pre-epoch instants.
h2. Does this introduce any user-facing change
No. The types remain gated behind {{spark.sql.timestampNanosTypes.enabled}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]