Max Gekk created SPARK-57166:
--------------------------------

             Summary: Reject nanosecond-capable timestamp types in built-in 
datasources and JDBC
                 Key: SPARK-57166
                 URL: https://issues.apache.org/jira/browse/SPARK-57166
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


h2. What

Until each datasource implements real read/write support for the
nanosecond-capable timestamp types ({{TimestampNTZNanosType}} /
{{TimestampLTZNanosType}}), make all built-in file datasources and JDBC
explicitly *reject* these types on both the read and write paths, with the
existing {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} error.

h2. Why

This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond precision).

The preview flag {{spark.sql.timestampNanosTypes.enabled}} is gated centrally in
{{TypeUtils.failUnsupportedDataType}} (throws {{FEATURE_NOT_ENABLED}} when off).
But when the flag is *on*, the nanos types extend
{{DatetimeType extends AtomicType}}, so every file source's
{{case _: AtomicType => true}} catch-all silently *accepts* them and then
misbehaves at read/write time (no real support exists yet). Users get confusing
downstream failures or silent precision issues instead of a clear, actionable
error.

h2. Approach

Add an explicit arm before the {{AtomicType}} catch-all in each datasource's
{{supportDataType}} / {{supportsDataType}}:

{code}
// Nanosecond-capable timestamps are not yet supported by this datasource.
case _: TimestampNTZNanosType | _: TimestampLTZNanosType => false
{code}

Read and write are both covered:
* V1 {{FileFormat.supportReadDataType}} defaults to {{supportDataType}}, so one
  edit blocks both paths.
* V2 {{FileTable}} validates via a single {{supportsDataType}} for read and
  write.

Rejection is *unconditional* (not flag-dependent): these sources do not support
nanos regardless of the preview flag; the flag only governs whether the type can
exist. As each source adds support later (e.g. Parquet read via SPARK-57102), it
carves out its own exception (e.g. by overriding {{supportReadDataType}}),
without conflicting with this guardrail.

h2. Files to change

V1 {{FileFormat.supportDataType}}:
* {{ParquetFileFormat}}, {{OrcFileFormat}}, {{JsonFileFormat}},
  {{XmlFileFormat}}
* {{CSVFileFormat}} - add to the private {{supportDataType(dataType, 
allowVariant)}}
  (covers both {{supportDataType}} and {{supportReadDataType}})
* {{AvroUtils.supportsDataType}} - single edit covers V1 {{AvroFileFormat}} and
  V2 {{AvroTable}} (both delegate to it)
* {{sql/hive}} {{OrcFileFormat.supportDataType}} (Hive ORC serde)

V2 {{FileTable.supportsDataType}}:
* {{ParquetTable}}, {{OrcTable}}, {{JsonTable}}, {{CSVTable}}

No change needed:
* {{TextFileFormat}} / {{TextTable}} ({{StringType}} only) already reject nanos.
* JDBC read never yields nanos ({{getCatalystType}} maps {{TIMESTAMP}} to 
micros).

JDBC write:
* Already fails fast - {{JdbcUtils.getCommonJDBCType}} returns {{None}} for 
nanos,
  so {{getJdbcType}} throws {{cannotGetJdbcTypeError}}. No code change strictly
  required; add a test to lock the behavior. (Optional: add an explicit nanos
  case for a clearer message - decide in review.)

h2. To verify during implementation

* {{XmlTable}} (V2) has no {{supportsDataType}} override and inherits
  {{FileTable}}'s default {{true}}; confirm whether a writable V2 XML path 
exists
  and add a rejection if so.
* Confirm there is no separate Hive *Parquet* serde {{FileFormat}} with its own
  {{supportDataType}} (only {{hive/orc/OrcFileFormat}} was found; native
  {{ParquetFileFormat}} is reused for Hive Parquet).
* Ensure {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} are imported in
  each edited file.

h2. Tests

* Extend {{FileBasedDataSourceSuite}} mirroring the existing "Geospatial types
  are not supported in file data sources other than Parquet" test: with
  {{TIMESTAMP_NANOS_TYPES_ENABLED=true}}, iterate v1 and v2
  ({{USE_V1_SOURCE_LIST}}) over the built-in formats and assert both write and
  read of a {{TIMESTAMP_NTZ(9)}} / {{TIMESTAMP_LTZ(9)}} column fail with
  {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} ({{columnType}} rendered as
  {{"TIMESTAMP_NTZ(9)"}} etc.).
** Build the nanos-typed column via a nanos {{Literal}}
   ({{Literal.create(new TimestampNanosVal(0L, 0.toShort), 
TimestampNTZNanosType(9))}})
   rather than relying on {{CAST}} (cast support for nanos may be incomplete).
* Add an equivalent assertion in the Avro test suite (not in
  {{allFileBasedDataSources}}).
* Add a JDBC write test (e.g. {{JDBCWriteSuite}}) asserting nanos columns are
  rejected.

h2. Acceptance criteria

* With the preview flag enabled, writing or reading a column of
  {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} through Parquet, ORC,
  Avro, JSON, CSV, XML (v1 and v2) and Hive ORC fails with
  {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}}.
* JDBC write of such a column fails with a clear error.
* Existing supported-type behavior is unchanged for all other types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to