Max Gekk created SPARK-57160:
--------------------------------
Summary: Add Spark Connect protocol support for nanosecond-capable
timestamp types and literals
Key: SPARK-57160
URL: https://issues.apache.org/jira/browse/SPARK-57160
Project: Spark
Issue Type: Sub-task
Components: Connect
Affects Versions: 4.3.0
Reporter: Max Gekk
h2. What
Extend the Spark Connect protobuf protocol to represent
{{TimestampNTZNanosType(p)}} and
{{TimestampLTZNanosType(p)}} (p in [7, 9]) both as data types (in schemas,
casts, UDF I/O,
etc.) and as literal values, and regenerate the language stubs.
This is the protocol foundation for Spark Connect support of nanosecond
timestamps
(parent: SPARK-56822). The Scala/Python converters that consume these proto
messages are
separate sub-tasks and depend on this one.
h2. Why
Today the Connect {{DataType}} message has only microsecond timestamp kinds
({{timestamp}}, {{timestamp_ntz}}) with no precision field, and the
{{Expression.Literal}}
message encodes timestamp literals as a single {{int64}} of microseconds. There
is no way
to express a nanosecond-capable timestamp type or a sub-microsecond literal
over the wire,
so no Connect client/server path can carry the new types. The protocol must be
extended
before any converter, Arrow, or client work can proceed.
The Catalyst physical value is
{{org.apache.spark.unsafe.types.TimestampNanosVal}}
({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]), so a literal
cannot be
represented by a single {{int64}}; the proto literal must carry both components
plus the
declared precision.
h2. Scope
{{sql/connect/common/src/main/protobuf/spark/connect/types.proto}}
* Add two nested messages and {{oneof kind}} arms (next free field numbers,
e.g. 29/30):
{code}
message TimestampNTZNanos {
optional int32 precision = 1; // 7..9
uint32 type_variation_reference = 2;
}
message TimestampLTZNanos {
optional int32 precision = 1; // 7..9
uint32 type_variation_reference = 2;
}
// in oneof kind:
TimestampNTZNanos timestamp_ntz_nanos = 29;
TimestampLTZNanos timestamp_ltz_nanos = 30;
{code}
* NTZ and LTZ are distinct kinds even though the physical value is identical,
mirroring
{{timestamp}} vs {{timestamp_ntz}} and the parameterized {{Time}} message.
{{sql/connect/common/src/main/protobuf/spark/connect/expressions.proto}}
* Add literal support carrying the full physical value plus precision. Use
separate NTZ and
LTZ arms so the literal kind is self-describing (do NOT reuse the
single-{{int64}}
{{timestamp}}/{{timestamp_ntz}} fields):
{code}
message TimestampNTZNanos {
int64 epoch_micros = 1;
int32 nanos_within_micro = 2; // 0..999
optional int32 precision = 3; // 7..9
}
message TimestampLTZNanos {
int64 epoch_micros = 1;
int32 nanos_within_micro = 2; // 0..999
optional int32 precision = 3; // 7..9
}
// new oneof literal_type arms (after time = 26; note 27/28 reserved for
geospatial)
{code}
* Keep the existing {{DataType data_type}} root field usable for disambiguation
where the
literal kind alone is insufficient.
Stub regeneration
* Run {{./dev/connect-gen-protos.sh}} and commit the regenerated Python stubs
under
{{python/pyspark/sql/connect/proto/}} ({{types_pb2.py}}, {{types_pb2.pyi}},
{{expressions_pb2.py}}, {{expressions_pb2.pyi}}).
* Java/Scala {{org.apache.spark.connect.proto.*}} classes are generated at
{{connect-common}} compile time; no checked-in change.
* Ensure {{dev/check-protos.py}} passes (Python stubs in sync).
* Follow {{sql/connect/common/src/main/protobuf/README.md}}.
h2. Out of scope
* proto <-> Catalyst conversion in {{DataTypeProtoConverter}} /
{{LiteralValueProtoConverter}}
and the {{ConnectTypeOps}} registration (separate sub-task).
* PySpark Connect Python-side conversion ({{connect/types.py}},
{{connect/expressions.py}})
beyond committing the regenerated stubs (separate sub-task).
* Arrow IPC type mapping (separate sub-task).
h2. Design notes
* Field numbers must be appended (never reuse/renumber existing fields) to
preserve wire
compatibility.
* Final field numbers and the exact literal message shape should be confirmed
against the
{{Time}} precedent during review.
* No behavior change until the consuming converters land; this PR only adds
proto surface.
h2. How should this tested
* {{build/sbt connect-common/compile}} succeeds (Java/Scala proto classes
generate).
* {{./dev/connect-gen-protos.sh}} produces the committed Python stubs;
{{dev/check-protos.py}}
reports no drift.
* No functional tests in this PR (no consumers yet); covered by the converter
and
end-to-end sub-tasks.
h2. Does this introduce any user-facing change
No. This only adds protobuf message definitions; the new types remain gated
behind
{{spark.sql.timestampNanosTypes.enabled}} once the consuming paths are
implemented.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]