[ 
https://issues.apache.org/jira/browse/SPARK-57159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57159:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add Arrow type mapping for nanosecond-capable timestamp types
> -------------------------------------------------------------
>
>                 Key: SPARK-57159
>                 URL: https://issues.apache.org/jira/browse/SPARK-57159
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h2. What
> Wire {{TimestampNTZNanosType(p)}} and {{TimestampLTZNanosType(p)}} (p in [7, 
> 9]) into the
> Spark <-> Arrow type mapping and the InternalRow -> Arrow vector writer path, 
> so that
> nanosecond-capable timestamps can be carried over Arrow.
> This is the shared Arrow prerequisite for Spark Connect (parent: 
> SPARK-56822). It also
> benefits the classic Arrow paths ({{Dataset.collectAsArrowToPython}}-style 
> transfer,
> {{createDataFrame}} from Arrow, {{mapInArrow}}, etc.).
> h2. Why
> Spark Connect transfers query results and local relations as Arrow IPC 
> batches. The
> mapping between Spark types and Arrow types ({{ArrowUtils}}) and the 
> row-to-vector writers
> ({{ArrowWriter}}) currently have no support for the nanosecond timestamp 
> types, so any plan
> whose schema contains them fails to serialize. Until this mapping exists, no 
> Connect path
> (schema response, result batches, local relations) can handle the new types, 
> regardless of
> the protocol/converter work.
> The internal value is {{org.apache.spark.unsafe.types.TimestampNanosVal}}
> ({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]). Arrow's 
> nanosecond
> timestamp unit is the natural target.
> h2. Scope
> * {{sql/api/.../util/ArrowUtils.scala}}: extend {{toArrowType}} / 
> {{fromArrowType}}
> ** {{TimestampNTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND, 
> null)}}
> ** {{TimestampLTZNanosType(p)}} -> {{ArrowType.Timestamp(TimeUnit.NANOSECOND, 
> sessionTimeZoneId)}}
> ** Reverse mapping for {{ArrowType.Timestamp(NANOSECOND, tz)}} -> NTZ/LTZ 
> nanos type
> * {{sql/api/.../types/ops/TypeApiOps.scala}}: register the two types; add
> {{TimestampNTZNanosTypeApiOps}} / {{TimestampLTZNanosTypeApiOps}} (mirroring
> {{TimeTypeApiOps}}).
> * {{sql/catalyst/.../types/ops/TypeOps.scala}}: register server-side ops; add
> {{TimestampNTZNanosTypeOps}} / {{TimestampLTZNanosTypeOps}} (mirroring 
> {{TimeTypeOps}}).
> * {{sql/catalyst/.../execution/arrow/ArrowWriter.scala}}: field writers that 
> read via
> {{InternalRow.getTimestampNTZNanos}} / {{getTimestampLTZNanos}} and write into
> {{TimeStampNanoVector}} / {{TimeStampNanoTZVector}}.
> * Arrow column reader/accessor for reading those vectors back into 
> {{InternalRow}}.
> h2. Out of scope
> * Spark Connect proto definitions and proto<->Catalyst converters (separate 
> sub-task).
> * PySpark Arrow/pandas conversion (separate sub-task).
> * {{ColumnVector}} / vectorized Parquet reader (tracked in SPARK-57100).
> * Any rounding/truncation semantics for precision < 9; this ticket preserves 
> the full
> {{TimestampNanosVal}} value (nanosecond Arrow unit).
> h2. Design notes
> * Follow the {{TimeType}} "Types Framework" pattern end to end (Ops classes +
> registration) rather than scattering one-off match cases; coordinate 
> registration with
> SPARK-57101 so the types are not registered twice.
> * LTZ and NTZ share the physical value but must map to distinct Arrow 
> timezone settings
> (LTZ uses the session time zone; NTZ uses {{null}}).
> * Gate behind {{spark.sql.timestampNanosTypes.enabled}} consistent with the 
> rest of the
> feature.
> h2. How should the changes tested
> * {{ArrowUtilsSuite}}: round-trip {{toArrowType}}/{{fromArrowType}} for NTZ 
> and LTZ nanos
> types (including session time zone for LTZ).
> * {{ArrowWriter}} / {{ArrowConverters}} round-trip: write rows containing 
> nanos timestamps
> to Arrow and read them back, asserting {{epochMicros}} and 
> {{nanosWithinMicro}} are
> preserved, including boundary values (Long.MinValue/MaxValue micros, 
> nanosWithinMicro 0 and
> 999) and pre-epoch instants.
> h2. Does this introduce any user-facing change
> No. The types remain gated behind {{spark.sql.timestampNanosTypes.enabled}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to