Max Gekk created SPARK-57102:
--------------------------------

             Summary: Read Parquet TIMESTAMP(NANOS) via non-vectorized reader 
for NTZ and LTZ nanosecond types
                 Key: SPARK-57102
                 URL: https://issues.apache.org/jira/browse/SPARK-57102
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h3. Summary

Enable reading Parquet files that store timestamps as INT64 with logical type 
TIMESTAMP(NANOS), produced by external tools (e.g. PyArrow or pandas), into 
Spark's nanosecond timestamp types TimestampNTZNanosType and 
TimestampLTZNanosType. Implementation is limited to the non-vectorized 
(row-based) read path (ParquetRowConverter). Reuse existing Parquet datetime 
rebasing for TIMESTAMP_LTZ; TIMESTAMP_NTZ does not rebase.

Today, TIMESTAMP(NANOS) is either rejected (PARQUET_TYPE_ILLEGAL) or mapped to 
LongType when spark.sql.legacy.parquet.nanosAsLong is true (SPARK-40819). This 
issue delivers native nanos type read for real-world interop files.

h3. Background

* Parent: SPARK-56822 (SPIP: Timestamps with nanosecond precision)
* Physical row layer: SPARK-56981 (TimestampNanosVal, InternalRow / UnsafeRow 
accessors)
* Logical types: SPARK-56876 (TimestampNTZNanosType / TimestampLTZNanosType, p 
in [7, 9])
* On-wire Parquet: epoch nanoseconds as INT64; Spark internal value is 
(epochMicros, nanosWithinMicro) in TimestampNanosVal
* Existing test resource: test-data/timestamp-nanos.parquet (TIMESTAMP(NANOS, 
true) only; nanosAsLong path)

h3. What to do

h4. 1. External Parquet test fixtures

* Add committed Parquet file(s) under sql/core/src/test/resources (alongside or 
extending existing timestamp-nanos.parquet).
* Generate with an external tool (PyArrow recommended), not Spark 
df.write.parquet.
* Include at least:
** ts_ltz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true) -> 
TimestampLTZNanosType(9)
** ts_ntz — INT64 TIMESTAMP(NANOS, isAdjustedToUTC=false) -> 
TimestampNTZNanosType(9)
* Row values should cover: sub-micro fractional part (non-zero 
nanosWithinMicro), negative epoch-nanos, and at least one LTZ instant that 
differs under LEGACY vs CORRECTED datetime rebase (same class of dates as 
existing Parquet microsecond rebase tests).
* Set Parquet file metadata keys Spark already uses for datetime rebase (e.g. 
spark.sql.parquet.datetimeRebaseMode) so RebaseSpec is exercised.
* Provide a small Python regeneration script (documented header; checked-in 
files are the source of truth for CI).

h4. 2. Epoch-nanos conversion helpers

* Package-private helpers, e.g. epochNanosToTimestampNanosVal(epochNanos: 
Long): TimestampNanosVal and inverse for test oracles.
* Use Math.floorDiv / floorMod for negative timestamps; nanosWithinMicro in [0, 
999].
* Unit tests without Parquet I/O.

h4. 3. Schema mapping (ParquetSchemaConverter)

When spark.sql.legacy.parquet.nanosAsLong is false (default):

|| Parquet logical type || Spark type (schema inference) ||
| TIMESTAMP(NANOS, isAdjustedToUTC=true) | TimestampLTZNanosType (default 
precision 9) |
| TIMESTAMP(NANOS, isAdjustedToUTC=false) | TimestampNTZNanosType (default 
precision 9) |

* Keep nanosAsLong=true -> LongType behavior (SPARK-40819).
* Apply preview / SQLConf gating from SPARK-56969 if required for user-facing 
analysis; tests may enable the conf explicitly.
* Update Parquet schema inference tests accordingly.

h4. 4. Non-vectorized read (ParquetRowConverter)

* Add ParquetPrimitiveConverter branches for TimeUnit.NANOS only on the row 
converter path (not ParquetVectorUpdaterFactory / vectorized reader).
* TimestampNTZNanosType: addLong(epochNanos) -> convert to TimestampNanosVal -> 
updater; no timestampRebaseFunc (same policy as TimestampNTZType + MICROS).
* TimestampLTZNanosType: addLong(epochNanos) -> decompose to epochMicros + 
nanosWithinMicro -> apply existing timestampRebaseFunc from ParquetRowConverter 
(DataSourceUtils.createTimestampRebaseFuncInRead / datetimeRebaseSpec) on 
epochMicros -> reassemble TimestampNanosVal. Do not add a separate rebase 
implementation.
* Wire updaters for nanos types in nested converters as needed.

h4. 5. Integration tests

* Force non-vectorized read: spark.sql.parquet.enableVectorizedReader=false 
(and legacy.parquet.nanosAsLong=false).
* Read LTZ and NTZ columns from fixtures; assert TimestampNanosVal matches 
precomputed oracle.
* Rebase: same file with datetimeRebaseMode LEGACY vs CORRECTED for LTZ column; 
behavior aligned with microsecond LTZ Parquet rebase tests.
* SPARK-40819: with nanosAsLong=false, read succeeds and returns nanos types; 
with nanosAsLong=true, schema remains LongType.
* Values readable via getTimestampNTZNanos / getTimestampLTZNanos on collected 
rows.

h3. Acceptance criteria

* spark.read.parquet on committed external fixtures returns 
TimestampNTZNanosType and TimestampLTZNanosType columns when nanosAsLong is 
false.
* Non-vectorized path populates TimestampNanosVal with correct (epochMicros, 
nanosWithinMicro).
* LTZ columns use existing Parquet datetime rebase spec; NTZ columns do not 
rebase.
* Vectorized reader disabled in tests passes; no requirement to support 
vectorized reader in this issue.
* spark.sql.legacy.parquet.nanosAsLong=true unchanged (LongType).
* Microsecond TimestampType / TimestampNTZType Parquet behavior unchanged.

h3. Out of scope

* Parquet vectorized reader (ParquetVectorUpdaterFactory, 
VectorizedParquetRecordReader) — follow-up after columnar ColumnVector support 
for nanos types
* Parquet write of TIMESTAMP(NANOS) native types
* Cast matrix, string parsing, Dataset java.time encoders (SPARK-57032, 
SPARK-57033)
* INT96-as-timestamp nanos carrier (focus on TIMESTAMP(NANOS) INT64)
* Changing UnsafeRow 16-byte payload layout

h3. Depends on

* SPARK-56981 (physical row storage and TimestampNanosVal)

h3. Related

* SPARK-56969 (preview SQLConf gating, if analysis must be enabled for ad-hoc 
reads)
* SPARK-40819 (existing timestamp-nanos.parquet and nanosAsLong behavior)

h3. References

* ParquetSchemaConverter — TIMESTAMP(NANOS) handling today
* ParquetRowConverter — timestampRebaseFunc, TimestampNTZType / TimestampType 
MICROS precedents
* org.apache.spark.unsafe.types.TimestampNanosVal



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to