[jira] [Commented] (SPARK-57160) Add Spark Connect protocol support for nanosecond-capable timestamp types and literals

Max Gekk (Jira) Thu, 25 Jun 2026 00:06:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-57160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091424#comment-18091424
 ]


Max Gekk commented on SPARK-57160:
----------------------------------

Follow-up from the PR review for SPARK-57159 
(https://github.com/apache/spark/pull/56739): once the Spark Connect / 
DataFrame wiring for the nanosecond timestamp types lands, add an end-to-end 
Arrow batch test for TimestampNTZNanosType / TimestampLTZNanosType covering 
Dataset.toArrowBatchRdd and the DataFrame-level ArrowConverters path 
(ArrowConvertersSuite).

Already covered under SPARK-57159: the low-level Spark<->Arrow type mapping 
(ArrowUtilsSuite), the row<->vector writers and ArrowColumnVector accessors 
with an end-to-end ArrowConverters IPC batch round-trip at the InternalRow 
level (ArrowConvertersSuite, ArrowWriterSuite), and the shared 
DateTimeUtils.timestampNanosToEpochNanos helper (DateTimeUtilsSuite). This note 
tracks the remaining DataFrame/Dataset-level e2e coverage that depends on the 
Connect result-transfer path.

> Add Spark Connect protocol support for nanosecond-capable timestamp types and 
> literals
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-57160
>                 URL: https://issues.apache.org/jira/browse/SPARK-57160
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Connect
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h2. What
> Extend the Spark Connect protobuf protocol to represent 
> {{TimestampNTZNanosType(p)}} and
> {{TimestampLTZNanosType(p)}} (p in [7, 9]) both as data types (in schemas, 
> casts, UDF I/O,
> etc.) and as literal values, and regenerate the language stubs.
> This is the protocol foundation for Spark Connect support of nanosecond 
> timestamps
> (parent: SPARK-56822). The Scala/Python converters that consume these proto 
> messages are
> separate sub-tasks and depend on this one.
> h2. Why
> Today the Connect {{DataType}} message has only microsecond timestamp kinds
> ({{timestamp}}, {{timestamp_ntz}}) with no precision field, and the 
> {{Expression.Literal}}
> message encodes timestamp literals as a single {{int64}} of microseconds. 
> There is no way
> to express a nanosecond-capable timestamp type or a sub-microsecond literal 
> over the wire,
> so no Connect client/server path can carry the new types. The protocol must 
> be extended
> before any converter, Arrow, or client work can proceed.
> The Catalyst physical value is 
> {{org.apache.spark.unsafe.types.TimestampNanosVal}}
> ({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]), so a 
> literal cannot be
> represented by a single {{int64}}; the proto literal must carry both 
> components plus the
> declared precision.
> h2. Scope
> {{sql/connect/common/src/main/protobuf/spark/connect/types.proto}}
> * Add two nested messages and {{oneof kind}} arms (next free field numbers, 
> e.g. 29/30):
> {code}
> message TimestampNTZNanos {
>   optional int32 precision = 1;          // 7..9
>   uint32 type_variation_reference = 2;
> }
> message TimestampLTZNanos {
>   optional int32 precision = 1;          // 7..9
>   uint32 type_variation_reference = 2;
> }
> // in oneof kind:
> TimestampNTZNanos timestamp_ntz_nanos = 29;
> TimestampLTZNanos timestamp_ltz_nanos = 30;
> {code}
> * NTZ and LTZ are distinct kinds even though the physical value is identical, 
> mirroring
> {{timestamp}} vs {{timestamp_ntz}} and the parameterized {{Time}} message.
> {{sql/connect/common/src/main/protobuf/spark/connect/expressions.proto}}
> * Add literal support carrying the full physical value plus precision. Use 
> separate NTZ and
> LTZ arms so the literal kind is self-describing (do NOT reuse the 
> single-{{int64}}
> {{timestamp}}/{{timestamp_ntz}} fields):
> {code}
> message TimestampNTZNanos {
>   int64 epoch_micros = 1;
>   int32 nanos_within_micro = 2;          // 0..999
>   optional int32 precision = 3;          // 7..9
> }
> message TimestampLTZNanos {
>   int64 epoch_micros = 1;
>   int32 nanos_within_micro = 2;          // 0..999
>   optional int32 precision = 3;          // 7..9
> }
> // new oneof literal_type arms (after time = 26; note 27/28 reserved for 
> geospatial)
> {code}
> * Keep the existing {{DataType data_type}} root field usable for 
> disambiguation where the
> literal kind alone is insufficient.
> Stub regeneration
> * Run {{./dev/connect-gen-protos.sh}} and commit the regenerated Python stubs 
> under
> {{python/pyspark/sql/connect/proto/}} ({{types_pb2.py}}, {{types_pb2.pyi}},
> {{expressions_pb2.py}}, {{expressions_pb2.pyi}}).
> * Java/Scala {{org.apache.spark.connect.proto.*}} classes are generated at
> {{connect-common}} compile time; no checked-in change.
> * Ensure {{dev/check-protos.py}} passes (Python stubs in sync).
> * Follow {{sql/connect/common/src/main/protobuf/README.md}}.
> h2. Out of scope
> * proto <-> Catalyst conversion in {{DataTypeProtoConverter}} / 
> {{LiteralValueProtoConverter}}
> and the {{ConnectTypeOps}} registration (separate sub-task).
> * PySpark Connect Python-side conversion ({{connect/types.py}}, 
> {{connect/expressions.py}})
> beyond committing the regenerated stubs (separate sub-task).
> * Arrow IPC type mapping (separate sub-task).
> h2. Design notes
> * Field numbers must be appended (never reuse/renumber existing fields) to 
> preserve wire
> compatibility.
> * Final field numbers and the exact literal message shape should be confirmed 
> against the
> {{Time}} precedent during review.
> * No behavior change until the consuming converters land; this PR only adds 
> proto surface.
> h2. How should this tested
> * {{build/sbt connect-common/compile}} succeeds (Java/Scala proto classes 
> generate).
> * {{./dev/connect-gen-protos.sh}} produces the committed Python stubs; 
> {{dev/check-protos.py}}
> reports no drift.
> * No functional tests in this PR (no consumers yet); covered by the converter 
> and
> end-to-end sub-tasks.
> h2. Does this introduce any user-facing change
> No. This only adds protobuf message definitions; the new types remain gated 
> behind
> {{spark.sql.timestampNanosTypes.enabled}} once the consuming paths are 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-57160) Add Spark Connect protocol support for nanosecond-capable timestamp types and literals

Reply via email to