Max Gekk created SPARK-57160:
--------------------------------

             Summary: Add Spark Connect protocol support for nanosecond-capable 
timestamp types and literals
                 Key: SPARK-57160
                 URL: https://issues.apache.org/jira/browse/SPARK-57160
             Project: Spark
          Issue Type: Sub-task
          Components: Connect
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h2. What

Extend the Spark Connect protobuf protocol to represent 
{{TimestampNTZNanosType(p)}} and
{{TimestampLTZNanosType(p)}} (p in [7, 9]) both as data types (in schemas, 
casts, UDF I/O,
etc.) and as literal values, and regenerate the language stubs.

This is the protocol foundation for Spark Connect support of nanosecond 
timestamps
(parent: SPARK-56822). The Scala/Python converters that consume these proto 
messages are
separate sub-tasks and depend on this one.

h2. Why

Today the Connect {{DataType}} message has only microsecond timestamp kinds
({{timestamp}}, {{timestamp_ntz}}) with no precision field, and the 
{{Expression.Literal}}
message encodes timestamp literals as a single {{int64}} of microseconds. There 
is no way
to express a nanosecond-capable timestamp type or a sub-microsecond literal 
over the wire,
so no Connect client/server path can carry the new types. The protocol must be 
extended
before any converter, Arrow, or client work can proceed.

The Catalyst physical value is 
{{org.apache.spark.unsafe.types.TimestampNanosVal}}
({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]), so a literal 
cannot be
represented by a single {{int64}}; the proto literal must carry both components 
plus the
declared precision.

h2. Scope

{{sql/connect/common/src/main/protobuf/spark/connect/types.proto}}
* Add two nested messages and {{oneof kind}} arms (next free field numbers, 
e.g. 29/30):
{code}
message TimestampNTZNanos {
  optional int32 precision = 1;          // 7..9
  uint32 type_variation_reference = 2;
}
message TimestampLTZNanos {
  optional int32 precision = 1;          // 7..9
  uint32 type_variation_reference = 2;
}
// in oneof kind:
TimestampNTZNanos timestamp_ntz_nanos = 29;
TimestampLTZNanos timestamp_ltz_nanos = 30;
{code}
* NTZ and LTZ are distinct kinds even though the physical value is identical, 
mirroring
{{timestamp}} vs {{timestamp_ntz}} and the parameterized {{Time}} message.

{{sql/connect/common/src/main/protobuf/spark/connect/expressions.proto}}
* Add literal support carrying the full physical value plus precision. Use 
separate NTZ and
LTZ arms so the literal kind is self-describing (do NOT reuse the 
single-{{int64}}
{{timestamp}}/{{timestamp_ntz}} fields):
{code}
message TimestampNTZNanos {
  int64 epoch_micros = 1;
  int32 nanos_within_micro = 2;          // 0..999
  optional int32 precision = 3;          // 7..9
}
message TimestampLTZNanos {
  int64 epoch_micros = 1;
  int32 nanos_within_micro = 2;          // 0..999
  optional int32 precision = 3;          // 7..9
}
// new oneof literal_type arms (after time = 26; note 27/28 reserved for 
geospatial)
{code}
* Keep the existing {{DataType data_type}} root field usable for disambiguation 
where the
literal kind alone is insufficient.

Stub regeneration
* Run {{./dev/connect-gen-protos.sh}} and commit the regenerated Python stubs 
under
{{python/pyspark/sql/connect/proto/}} ({{types_pb2.py}}, {{types_pb2.pyi}},
{{expressions_pb2.py}}, {{expressions_pb2.pyi}}).
* Java/Scala {{org.apache.spark.connect.proto.*}} classes are generated at
{{connect-common}} compile time; no checked-in change.
* Ensure {{dev/check-protos.py}} passes (Python stubs in sync).
* Follow {{sql/connect/common/src/main/protobuf/README.md}}.

h2. Out of scope

* proto <-> Catalyst conversion in {{DataTypeProtoConverter}} / 
{{LiteralValueProtoConverter}}
and the {{ConnectTypeOps}} registration (separate sub-task).
* PySpark Connect Python-side conversion ({{connect/types.py}}, 
{{connect/expressions.py}})
beyond committing the regenerated stubs (separate sub-task).
* Arrow IPC type mapping (separate sub-task).

h2. Design notes

* Field numbers must be appended (never reuse/renumber existing fields) to 
preserve wire
compatibility.
* Final field numbers and the exact literal message shape should be confirmed 
against the
{{Time}} precedent during review.
* No behavior change until the consuming converters land; this PR only adds 
proto surface.

h2. How should this tested

* {{build/sbt connect-common/compile}} succeeds (Java/Scala proto classes 
generate).
* {{./dev/connect-gen-protos.sh}} produces the committed Python stubs; 
{{dev/check-protos.py}}
reports no drift.
* No functional tests in this PR (no consumers yet); covered by the converter 
and
end-to-end sub-tasks.

h2. Does this introduce any user-facing change

No. This only adds protobuf message definitions; the new types remain gated 
behind
{{spark.sql.timestampNanosTypes.enabled}} once the consuming paths are 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to