Max Gekk created SPARK-57166:
--------------------------------
Summary: Reject nanosecond-capable timestamp types in built-in
datasources and JDBC
Key: SPARK-57166
URL: https://issues.apache.org/jira/browse/SPARK-57166
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h2. What
Until each datasource implements real read/write support for the
nanosecond-capable timestamp types ({{TimestampNTZNanosType}} /
{{TimestampLTZNanosType}}), make all built-in file datasources and JDBC
explicitly *reject* these types on both the read and write paths, with the
existing {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} error.
h2. Why
This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond precision).
The preview flag {{spark.sql.timestampNanosTypes.enabled}} is gated centrally in
{{TypeUtils.failUnsupportedDataType}} (throws {{FEATURE_NOT_ENABLED}} when off).
But when the flag is *on*, the nanos types extend
{{DatetimeType extends AtomicType}}, so every file source's
{{case _: AtomicType => true}} catch-all silently *accepts* them and then
misbehaves at read/write time (no real support exists yet). Users get confusing
downstream failures or silent precision issues instead of a clear, actionable
error.
h2. Approach
Add an explicit arm before the {{AtomicType}} catch-all in each datasource's
{{supportDataType}} / {{supportsDataType}}:
{code}
// Nanosecond-capable timestamps are not yet supported by this datasource.
case _: TimestampNTZNanosType | _: TimestampLTZNanosType => false
{code}
Read and write are both covered:
* V1 {{FileFormat.supportReadDataType}} defaults to {{supportDataType}}, so one
edit blocks both paths.
* V2 {{FileTable}} validates via a single {{supportsDataType}} for read and
write.
Rejection is *unconditional* (not flag-dependent): these sources do not support
nanos regardless of the preview flag; the flag only governs whether the type can
exist. As each source adds support later (e.g. Parquet read via SPARK-57102), it
carves out its own exception (e.g. by overriding {{supportReadDataType}}),
without conflicting with this guardrail.
h2. Files to change
V1 {{FileFormat.supportDataType}}:
* {{ParquetFileFormat}}, {{OrcFileFormat}}, {{JsonFileFormat}},
{{XmlFileFormat}}
* {{CSVFileFormat}} - add to the private {{supportDataType(dataType,
allowVariant)}}
(covers both {{supportDataType}} and {{supportReadDataType}})
* {{AvroUtils.supportsDataType}} - single edit covers V1 {{AvroFileFormat}} and
V2 {{AvroTable}} (both delegate to it)
* {{sql/hive}} {{OrcFileFormat.supportDataType}} (Hive ORC serde)
V2 {{FileTable.supportsDataType}}:
* {{ParquetTable}}, {{OrcTable}}, {{JsonTable}}, {{CSVTable}}
No change needed:
* {{TextFileFormat}} / {{TextTable}} ({{StringType}} only) already reject nanos.
* JDBC read never yields nanos ({{getCatalystType}} maps {{TIMESTAMP}} to
micros).
JDBC write:
* Already fails fast - {{JdbcUtils.getCommonJDBCType}} returns {{None}} for
nanos,
so {{getJdbcType}} throws {{cannotGetJdbcTypeError}}. No code change strictly
required; add a test to lock the behavior. (Optional: add an explicit nanos
case for a clearer message - decide in review.)
h2. To verify during implementation
* {{XmlTable}} (V2) has no {{supportsDataType}} override and inherits
{{FileTable}}'s default {{true}}; confirm whether a writable V2 XML path
exists
and add a rejection if so.
* Confirm there is no separate Hive *Parquet* serde {{FileFormat}} with its own
{{supportDataType}} (only {{hive/orc/OrcFileFormat}} was found; native
{{ParquetFileFormat}} is reused for Hive Parquet).
* Ensure {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} are imported in
each edited file.
h2. Tests
* Extend {{FileBasedDataSourceSuite}} mirroring the existing "Geospatial types
are not supported in file data sources other than Parquet" test: with
{{TIMESTAMP_NANOS_TYPES_ENABLED=true}}, iterate v1 and v2
({{USE_V1_SOURCE_LIST}}) over the built-in formats and assert both write and
read of a {{TIMESTAMP_NTZ(9)}} / {{TIMESTAMP_LTZ(9)}} column fail with
{{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} ({{columnType}} rendered as
{{"TIMESTAMP_NTZ(9)"}} etc.).
** Build the nanos-typed column via a nanos {{Literal}}
({{Literal.create(new TimestampNanosVal(0L, 0.toShort),
TimestampNTZNanosType(9))}})
rather than relying on {{CAST}} (cast support for nanos may be incomplete).
* Add an equivalent assertion in the Avro test suite (not in
{{allFileBasedDataSources}}).
* Add a JDBC write test (e.g. {{JDBCWriteSuite}}) asserting nanos columns are
rejected.
h2. Acceptance criteria
* With the preview flag enabled, writing or reading a column of
{{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} through Parquet, ORC,
Avro, JSON, CSV, XML (v1 and v2) and Hive ORC fails with
{{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}}.
* JDBC write of such a column fails with a clear error.
* Existing supported-type behavior is unchanged for all other types.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]