Max Gekk created SPARK-57100:
--------------------------------
Summary: Add columnar (ColumnVector) support for nanosecond
timestamp types
Key: SPARK-57100
URL: https://issues.apache.org/jira/browse/SPARK-57100
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h3. Summary
SPARK-56981 added physical row storage for TimestampNTZNanosType(p) and
TimestampLTZNanosType(p) (p in [7, 9]) via TimestampNanosVal and UnsafeRow.
Columnar execution still cannot hold or move these values:
ColumnVector.getTimestampNTZNanos / getTimestampLTZNanos throw
SparkUnsupportedOperationException, and RowToColumnConverter /
ColumnVectorUtils have no support.
This issue implements the columnar layer so ColumnarBatch can store nanosecond
timestamps and interoperate with InternalRow / UnsafeRow (ColumnarToRow,
RowToColumnar, whole-stage codegen paths that read column vectors).
Parquet vectorized decode (ParquetVectorUpdaterFactory, TIMESTAMP(NANOS) pages)
is a separate follow-up that depends on this issue.
h3. Background
* Logical types and parser: SPARK-56876, SPARK-56965
* Physical / UnsafeRow layer: SPARK-56981 (merged, PR #56059)
* SPIP composite value: epochMicros (long) + nanosWithinMicro (short, 0-999)
* UnsafeRow uses a 16-byte variable-length payload; column batches should use a
fixed struct-like layout (see below), not the UnsafeRow blob layout.
h3. Recommended column layout
Mirror CalendarInterval (multi-child column), not a single primitive column:
|| Child || Spark type || Field ||
| 0 | LongType | epochMicros |
| 1 | IntegerType | nanosWithinMicro (0-999) |
NTZ and LTZ share the same physical column layout; SQL semantics stay on the
logical type (same pattern as row layer).
h3. What to do
*ColumnVector API (sql/catalyst)*
* Implement default getTimestampNTZNanos / getTimestampLTZNanos on ColumnVector
using getChild(0).getLong + getChild(1).getInt (remove throw).
* WritableColumnVector: allocate two child columns for TimestampNTZNanosType /
TimestampLTZNanosType in the constructor (like CalendarIntervalType).
* Add putTimestampNanos (or putTimestampNTZNanos / LTZ) and append paths
writing both children.
*On-heap / off-heap vectors (sql/core)*
* OnHeapColumnVector / OffHeapColumnVector: read/write/append for nanos columns.
* ConstantColumnVector: set/get for constant nanos values.
* MutableColumnarRow: ensure setters write through to WritableColumnVector
(getters already delegate).
*Row <-> column bridges*
* RowToColumnConverter (Columnar.scala): TimestampNanosConverter (like
CalendarConverter) using row.getTimestampNTZNanos / LTZ.
* ColumnVectorUtils: populate and appendValue for PhysicalTimestampNTZNanosType
/ PhysicalTimestampLTZNanosType.
*Columnar surface stubs*
* ColumnVector / ColumnarRow / ColumnarArray / ColumnarBatchRow: already
delegate to ColumnVector; verify after base implementation.
* ColumnVector stubs that still throw UnsupportedOperationException until
vectorized Parquet/columnar writers land may remain documented; this ticket
focuses on read/get/put/append and row roundtrip.
*Codegen*
* CodeGenerator already emits getTimestampNTZNanos / getTimestampLTZNanos for
columnar inputs; no change expected once ColumnVector implements getters.
h3. Tests
* Unit tests: write/read/append/null handling on OnHeapColumnVector (and
OffHeap if enabled in tests).
* RowToColumnar -> ColumnarToRow -> UnsafeProjection roundtrip for NTZ and LTZ
nanos types (null and non-null).
* Regression: microsecond TimestampType / TimestampNTZType column vectors
unchanged.
h3. Acceptance criteria
* ColumnarBatch can be built from InternalRow rows containing TimestampNanosVal
for nanos timestamp columns.
* ColumnarBatch.rowIterator() + UnsafeProjection produces UnsafeRow values
equal to the source row for nanos columns.
* getTimestampNTZNanos / getTimestampLTZNanos on column vectors return correct
TimestampNanosVal for batch rows.
* RowToColumnConverter no longer throws unsupportedDataTypeError for
TimestampNTZNanosType / TimestampLTZNanosType.
h3. Unblocks
* Parquet vectorized read of TIMESTAMP(NANOS) into ColumnarBatch.
* Vectorized scan performance for nanos columns; RowToColumnarExec /
ColumnarToRowExec in nanos pipelines.
h3. References
* Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
* Precedent: CalendarInterval column layout in WritableColumnVector and
Columnar.scala
* Physical value: org.apache.spark.unsafe.types.TimestampNanosVal
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]