Xiaoxuan Li created SPARK-56159:
-----------------------------------
Summary: Support Nano Second Timestamp Data Types
Key: SPARK-56159
URL: https://issues.apache.org/jira/browse/SPARK-56159
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Xiaoxuan Li
Add two new nanosecond-precision timestamp types to Spark SQL:
{{TimestampNSType}} (with local timezone) and {{TimestampNTZNSType}} (without
timezone), following the same pattern as {{TimestampNTZType}} (SPARK-35662).
Both types store epoch nanoseconds as a single {{Long}} (INT64). This directly
matches Parquet {{{}TIMESTAMP(NANOS){}}}, Arrow {{{}Timestamp(NANOSECOND){}}},
Iceberg V3 {{{}timestamp_ns{}}}, DuckDB {{{}TIMESTAMP_NS{}}}, and ClickHouse
{{DateTime64(9)}} -- enabling zero-conversion-overhead read/write. The INT64
representation fits in UnsafeRow's 8-byte fixed-width slot, requiring no
changes to Tungsten memory layout or CodeGen.
The representable range is ~1677-2262 (same as the Parquet INT64 nanos
specification). Users needing wider range need to use existing microsecond
types (0001-9999).
We also support {{TIMESTAMP(p)}} parameterized SQL syntax: p=0..6 maps to
existing microsecond types, p=7..9 maps to the new nanosecond types.
h3. *Milestone 1 -- Core type system (TimestampNSType / TimestampNTZNSType
meets or exceeds all function of the existing TimestampType /
TimestampNTZType):*
* Add new DataType implementations for TimestampNSType and TimestampNTZNSType
* Support {{TIMESTAMP(p)}} parameterized SQL syntax (p=0..6 -> micros, p=7..9
-> nanos)
* TimestampNSType / TimestampNTZNSType literals
* TimestampNSType arithmetic (e.g. TimestampNSType - TimestampNSType,
TimestampNSType - Date)
* Datetime functions/operators: dayofweek, weekofyear, year, etc
* Cast to and from TimestampNSType / TimestampNTZNSType (Long, String,
TimestampType, TimestampNTZType, DateType), with the SQL syntax to specify the
types
* Support sorting and hashing TimestampNSType / TimestampNTZNSType
* Type coercion rules for mixed-precision expressions
* {{timestamp_nanos()}} and {{unix_nanos()}} functions
h3. *Milestone 2 -- Persistence:*
* Ability to create tables of type TimestampNSType / TimestampNTZNSType
* Ability to read/write Parquet files with {{TIMESTAMP(NANOS, true/false)}}
columns
* Ability to read/write ORC, CSV, JSON, Avro files with nanosecond timestamps
* Arrow type mapping: {{Timestamp(NANOSECOND)}} <-> TimestampNSType /
TimestampNTZNSType
* INSERT, SELECT, UPDATE, MERGE
* Configuration: {{spark.sql.parquet.inferTimestampNS.enabled}}
* Iceberg V3 {{TIMESTAMP_NANO}} type support
* Precision enforcement for {{TIMESTAMP(p)}} using CHAR/VARCHAR metadata
pattern
* Discovery (schema inference)
h3. *Milestone 3 -- Client support*
* JDBC support
* Hive Thrift server
* Spark Connect protocol support
h3. *Milestone 4 -- PySpark integration*
* Python UDF can take and return TimestampNSType / TimestampNTZNSType
* PySpark {{toPandas()}} / {{createDataFrame()}} nanosecond support
* Pandas UDF support
* DataFrame support
* Dataset/Encoder support
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]