Max Gekk created SPARK-57159:
--------------------------------

             Summary: Add Arrow type mapping for nanosecond-capable timestamp 
types
                 Key: SPARK-57159
                 URL: https://issues.apache.org/jira/browse/SPARK-57159
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h2. What

Wire {{TimestampNTZNanosType(p)}} and {{TimestampLTZNanosType(p)}} (p in [7, 
9]) into the
Spark <-> Arrow type mapping and the InternalRow -> Arrow vector writer path, 
so that
nanosecond-capable timestamps can be carried over Arrow.

This is the shared Arrow prerequisite for Spark Connect (parent: SPARK-56822). 
It also
benefits the classic Arrow paths ({{Dataset.collectAsArrowToПython}}-style 
transfer,
{{createDataFrame}} from Arrow, {{mapInArrow}}, etc.).

h2. Why

Spark Connect transfers query results and local relations as Arrow IPC batches. 
The
mapping between Spark types and Arrow types ({{ArrowUtils}}) and the 
row-to-vector writers
({{ArrowWriter}}) currently have no support for the nanosecond timestamp types, 
so any plan
whose schema contains them fails to serialize. Until this mapping exists, no 
Connect path
(schema response, result batches, local relations) can handle the new types, 
regardless of
the protocol/converter work.

The internal value is {{org.apache.spark.unsafe.types.TimestampNanosVal}}
({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]). Arrow's 
nanosecond
timestamp unit is the natural target.

h2. Scope

* {{sql/api/.../util/ArrowUtils.scala}}: extend {{toArrowType}} / 
{{fromArrowType}}
** {{TimestampNTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND, 
null)}}
** {{TimestampLTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND, 
sessionTimeZoneId)}}
** Reverse mapping for {{ArrowType.Timestamp(NANOSECOND, tz)}} -> NTZ/LTZ nanos 
type
* {{sql/api/.../types/ops/TypeApiOps.scala}}: register the two types; add
{{TimestampNTZNanosTypeApiOps}} / {{TimestampLTZNanosTypeApiOps}} (mirroring
{{TimeTypeApiOps}}).
* {{sql/catalyst/.../types/ops/TypeOps.scala}}: register server-side ops; add
{{TimestampNTZNanosTypeOps}} / {{TimestampLTZNanosTypeOps}} (mirroring 
{{TimeTypeOps}}).
* {{sql/catalyst/.../execution/arrow/ArrowWriter.scala}}: field writers that 
read via
{{InternalRow.getTimestampNTZNanos}} / {{getTimestampLTZNanos}} and write into
{{TimeStampNanoVector}} / {{TimeStampNanoTZVector}}.
* Arrow column reader/accessor for reading those vectors back into 
{{InternalRow}}.

h2. Out of scope

* Spark Connect proto definitions and proto<->Catalyst converters (separate 
sub-task).
* PySpark Arrow/pandas conversion (separate sub-task).
* {{ColumnVector}} / vectorized Parquet reader (tracked in SPARK-57100).
* Any rounding/truncation semantics for precision < 9; this ticket preserves 
the full
{{TimestampNanosVal}} value (nanosecond Arrow unit).

h2. Design notes

* Follow the {{TimeType}} "Types Framework" pattern end to end (Ops classes +
registration) rather than scattering one-off match cases; coordinate 
registration with
SPARK-57101 so the types are not registered twice.
* LTZ and NTZ share the physical value but must map to distinct Arrow timezone 
settings
(LTZ uses the session time zone; NTZ uses {{null}}).
* Gate behind {{spark.sql.timestampNanosTypes.enabled}} consistent with the 
rest of the
feature.

h2. How should the changes tested

* {{ArrowUtilsSuite}}: round-trip {{toArrowType}}/{{fromArrowType}} for NTZ and 
LTZ nanos
types (including session time zone for LTZ).
* {{ArrowWriter}} / {{ArrowConverters}} round-trip: write rows containing nanos 
timestamps
to Arrow and read them back, asserting {{epochMicros}} and {{nanosWithinMicro}} 
are
preserved, including boundary values (Long.MinValue/MaxValue micros, 
nanosWithinMicro 0 and
999) and pre-epoch instants.

h2. Does this introduce any user-facing change

No. The types remain gated behind {{spark.sql.timestampNanosTypes.enabled}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to