[
https://issues.apache.org/jira/browse/SPARK-57160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091424#comment-18091424
]
Max Gekk commented on SPARK-57160:
----------------------------------
Follow-up from the PR review for SPARK-57159
(https://github.com/apache/spark/pull/56739): once the Spark Connect /
DataFrame wiring for the nanosecond timestamp types lands, add an end-to-end
Arrow batch test for TimestampNTZNanosType / TimestampLTZNanosType covering
Dataset.toArrowBatchRdd and the DataFrame-level ArrowConverters path
(ArrowConvertersSuite).
Already covered under SPARK-57159: the low-level Spark<->Arrow type mapping
(ArrowUtilsSuite), the row<->vector writers and ArrowColumnVector accessors
with an end-to-end ArrowConverters IPC batch round-trip at the InternalRow
level (ArrowConvertersSuite, ArrowWriterSuite), and the shared
DateTimeUtils.timestampNanosToEpochNanos helper (DateTimeUtilsSuite). This note
tracks the remaining DataFrame/Dataset-level e2e coverage that depends on the
Connect result-transfer path.
> Add Spark Connect protocol support for nanosecond-capable timestamp types and
> literals
> --------------------------------------------------------------------------------------
>
> Key: SPARK-57160
> URL: https://issues.apache.org/jira/browse/SPARK-57160
> Project: Spark
> Issue Type: Sub-task
> Components: Connect
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. What
> Extend the Spark Connect protobuf protocol to represent
> {{TimestampNTZNanosType(p)}} and
> {{TimestampLTZNanosType(p)}} (p in [7, 9]) both as data types (in schemas,
> casts, UDF I/O,
> etc.) and as literal values, and regenerate the language stubs.
> This is the protocol foundation for Spark Connect support of nanosecond
> timestamps
> (parent: SPARK-56822). The Scala/Python converters that consume these proto
> messages are
> separate sub-tasks and depend on this one.
> h2. Why
> Today the Connect {{DataType}} message has only microsecond timestamp kinds
> ({{timestamp}}, {{timestamp_ntz}}) with no precision field, and the
> {{Expression.Literal}}
> message encodes timestamp literals as a single {{int64}} of microseconds.
> There is no way
> to express a nanosecond-capable timestamp type or a sub-microsecond literal
> over the wire,
> so no Connect client/server path can carry the new types. The protocol must
> be extended
> before any converter, Arrow, or client work can proceed.
> The Catalyst physical value is
> {{org.apache.spark.unsafe.types.TimestampNanosVal}}
> ({{epochMicros: Long}} + {{nanosWithinMicro: Short}} in [0, 999]), so a
> literal cannot be
> represented by a single {{int64}}; the proto literal must carry both
> components plus the
> declared precision.
> h2. Scope
> {{sql/connect/common/src/main/protobuf/spark/connect/types.proto}}
> * Add two nested messages and {{oneof kind}} arms (next free field numbers,
> e.g. 29/30):
> {code}
> message TimestampNTZNanos {
> optional int32 precision = 1; // 7..9
> uint32 type_variation_reference = 2;
> }
> message TimestampLTZNanos {
> optional int32 precision = 1; // 7..9
> uint32 type_variation_reference = 2;
> }
> // in oneof kind:
> TimestampNTZNanos timestamp_ntz_nanos = 29;
> TimestampLTZNanos timestamp_ltz_nanos = 30;
> {code}
> * NTZ and LTZ are distinct kinds even though the physical value is identical,
> mirroring
> {{timestamp}} vs {{timestamp_ntz}} and the parameterized {{Time}} message.
> {{sql/connect/common/src/main/protobuf/spark/connect/expressions.proto}}
> * Add literal support carrying the full physical value plus precision. Use
> separate NTZ and
> LTZ arms so the literal kind is self-describing (do NOT reuse the
> single-{{int64}}
> {{timestamp}}/{{timestamp_ntz}} fields):
> {code}
> message TimestampNTZNanos {
> int64 epoch_micros = 1;
> int32 nanos_within_micro = 2; // 0..999
> optional int32 precision = 3; // 7..9
> }
> message TimestampLTZNanos {
> int64 epoch_micros = 1;
> int32 nanos_within_micro = 2; // 0..999
> optional int32 precision = 3; // 7..9
> }
> // new oneof literal_type arms (after time = 26; note 27/28 reserved for
> geospatial)
> {code}
> * Keep the existing {{DataType data_type}} root field usable for
> disambiguation where the
> literal kind alone is insufficient.
> Stub regeneration
> * Run {{./dev/connect-gen-protos.sh}} and commit the regenerated Python stubs
> under
> {{python/pyspark/sql/connect/proto/}} ({{types_pb2.py}}, {{types_pb2.pyi}},
> {{expressions_pb2.py}}, {{expressions_pb2.pyi}}).
> * Java/Scala {{org.apache.spark.connect.proto.*}} classes are generated at
> {{connect-common}} compile time; no checked-in change.
> * Ensure {{dev/check-protos.py}} passes (Python stubs in sync).
> * Follow {{sql/connect/common/src/main/protobuf/README.md}}.
> h2. Out of scope
> * proto <-> Catalyst conversion in {{DataTypeProtoConverter}} /
> {{LiteralValueProtoConverter}}
> and the {{ConnectTypeOps}} registration (separate sub-task).
> * PySpark Connect Python-side conversion ({{connect/types.py}},
> {{connect/expressions.py}})
> beyond committing the regenerated stubs (separate sub-task).
> * Arrow IPC type mapping (separate sub-task).
> h2. Design notes
> * Field numbers must be appended (never reuse/renumber existing fields) to
> preserve wire
> compatibility.
> * Final field numbers and the exact literal message shape should be confirmed
> against the
> {{Time}} precedent during review.
> * No behavior change until the consuming converters land; this PR only adds
> proto surface.
> h2. How should this tested
> * {{build/sbt connect-common/compile}} succeeds (Java/Scala proto classes
> generate).
> * {{./dev/connect-gen-protos.sh}} produces the committed Python stubs;
> {{dev/check-protos.py}}
> reports no drift.
> * No functional tests in this PR (no consumers yet); covered by the converter
> and
> end-to-end sub-tasks.
> h2. Does this introduce any user-facing change
> No. This only adds protobuf message definitions; the new types remain gated
> behind
> {{spark.sql.timestampNanosTypes.enabled}} once the consuming paths are
> implemented.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]