Max Gekk created SPARK-57102:
--------------------------------
Summary: Read Parquet TIMESTAMP(NANOS) via non-vectorized reader
for NTZ and LTZ nanosecond types
Key: SPARK-57102
URL: https://issues.apache.org/jira/browse/SPARK-57102
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h3. Summary
Enable reading Parquet files that store timestamps as INT64 with logical type
TIMESTAMP(NANOS), produced by external tools (e.g. PyArrow or pandas), into
Spark's nanosecond timestamp types TimestampNTZNanosType and
TimestampLTZNanosType. Implementation is limited to the non-vectorized
(row-based) read path (ParquetRowConverter). Reuse existing Parquet datetime
rebasing for TIMESTAMP_LTZ; TIMESTAMP_NTZ does not rebase.
Today, TIMESTAMP(NANOS) is either rejected (PARQUET_TYPE_ILLEGAL) or mapped to
LongType when spark.sql.legacy.parquet.nanosAsLong is true (SPARK-40819). This
issue delivers native nanos type read for real-world interop files.
h3. Background
* Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
* Physical row layer: SPARK-56981 (TimestampNanosVal, InternalRow / UnsafeRow
accessors)
* Logical types: SPARK-56876 (TimestampNTZNanosType / TimestampLTZNanosType, p
in [7, 9])
* On-wire Parquet: epoch nanoseconds as INT64; Spark internal value is
(epochMicros, nanosWithinMicro) in TimestampNanosVal
* Existing test resource: test-data/timestamp-nanos.parquet (TIMESTAMP(NANOS,
true) only; nanosAsLong path)
h3. What to do
h4. 1. External Parquet test fixtures
* Add committed Parquet file(s) under sql/core/src/test/resources (alongside or
extending existing timestamp-nanos.parquet).
* Generate with an external tool (PyArrow recommended), not Spark
df.write.parquet.
* Include at least:
** ts_ltz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true) ->
TimestampLTZNanosType(9)
** ts_ntz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=false) ->
TimestampNTZNanosType(9)
* Row values should cover: sub-micro fractional part (non-zero
nanosWithinMicro), negative epoch-nanos, and at least one LTZ instant that
differs under LEGACY vs CORRECTED datetime rebase (same class of dates as
existing Parquet microsecond rebase tests).
* Set Parquet file metadata keys Spark already uses for datetime rebase (e.g.
spark.sql.parquet.datetimeRebaseMode) so RebaseSpec is exercised.
* Provide a small Python regeneration script (documented header; checked-in
files are the source of truth for CI).
h4. 2. Epoch-nanos conversion helpers
* Package-private helpers, e.g. epochNanosToTimestampNanosVal(epochNanos:
Long): TimestampNanosVal and inverse for test oracles.
* Use Math.floorDiv / floorMod for negative timestamps; nanosWithinMicro in [0,
999].
* Unit tests without Parquet I/O.
h4. 3. Schema mapping (ParquetSchemaConverter)
When spark.sql.legacy.parquet.nanosAsLong is false (default):
|| Parquet logical type || Spark type (schema inference) ||
| TIMESTAMP(NANOS, isAdjustedToUTC=true) | TimestampLTZNanosType (default
precision 9) |
| TIMESTAMP(NANOS, isAdjustedToUTC=false) | TimestampNTZNanosType (default
precision 9) |
* Keep nanosAsLong=true -> LongType behavior (SPARK-40819).
* Apply preview / SQLConf gating from SPARK-56969 if required for user-facing
analysis; tests may enable the conf explicitly.
* Update Parquet schema inference tests accordingly.
h4. 4. Non-vectorized read (ParquetRowConverter)
* Add ParquetPrimitiveConverter branches for TimeUnit.NANOS only on the row
converter path (not ParquetVectorUpdaterFactory / vectorized reader).
* TimestampNTZNanosType: addLong(epochNanos) -> convert to TimestampNanosVal ->
updater; no timestampRebaseFunc (same policy as TimestampNTZType + MICROS).
* TimestampLTZNanosType: addLong(epochNanos) -> decompose to epochMicros +
nanosWithinMicro -> apply existing timestampRebaseFunc from ParquetRowConverter
(DataSourceUtils.createTimestampRebaseFuncInRead / datetimeRebaseSpec) on
epochMicros -> reassemble TimestampNanosVal. Do not add a separate rebase
implementation.
* Wire updaters for nanos types in nested converters as needed.
h4. 5. Integration tests
* Force non-vectorized read: spark.sql.parquet.enableVectorizedReader=false
(and legacy.parquet.nanosAsLong=false).
* Read LTZ and NTZ columns from fixtures; assert TimestampNanosVal matches
precomputed oracle.
* Rebase: same file with datetimeRebaseMode LEGACY vs CORRECTED for LTZ column;
behavior aligned with microsecond LTZ Parquet rebase tests.
* SPARK-40819: with nanosAsLong=false, read succeeds and returns nanos types;
with nanosAsLong=true, schema remains LongType.
* Values readable via getTimestampNTZNanos / getTimestampLTZNanos on collected
rows.
h3. Acceptance criteria
* spark.read.parquet on committed external fixtures returns
TimestampNTZNanosType and TimestampLTZNanosType columns when nanosAsLong is
false.
* Non-vectorized path populates TimestampNanosVal with correct (epochMicros,
nanosWithinMicro).
* LTZ columns use existing Parquet datetime rebase spec; NTZ columns do not
rebase.
* Vectorized reader disabled in tests passes; no requirement to support
vectorized reader in this issue.
* spark.sql.legacy.parquet.nanosAsLong=true unchanged (LongType).
* Microsecond TimestampType / TimestampNTZType Parquet behavior unchanged.
h3. Out of scope
* Parquet vectorized reader (ParquetVectorUpdaterFactory,
VectorizedParquetRecordReader) — follow-up after columnar ColumnVector support
for nanos types
* Parquet write of TIMESTAMP(NANOS) native types
* Cast matrix, string parsing, Dataset java.time encoders (SPARK-57032,
SPARK-57033)
* INT96-as-timestamp nanos carrier (focus on TIMESTAMP(NANOS) INT64)
* Changing UnsafeRow 16-byte payload layout
h3. Depends on
* SPARK-56981 (physical row storage and TimestampNanosVal)
h3. Related
* SPARK-56969 (preview SQLConf gating, if analysis must be enabled for ad-hoc
reads)
* SPARK-40819 (existing timestamp-nanos.parquet and nanosAsLong behavior)
h3. References
* ParquetSchemaConverter — TIMESTAMP(NANOS) handling today
* ParquetRowConverter — timestampRebaseFunc, TimestampNTZType / TimestampType
MICROS precedents
* org.apache.spark.unsafe.types.TimestampNanosVal
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]