[
https://issues.apache.org/jira/browse/SPARK-57166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk reassigned SPARK-57166:
--------------------------------
Assignee: Max Gekk
> Reject nanosecond-capable timestamp types in built-in datasources and JDBC
> --------------------------------------------------------------------------
>
> Key: SPARK-57166
> URL: https://issues.apache.org/jira/browse/SPARK-57166
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: pull-request-available
>
> h2. What
> Until each datasource implements real read/write support for the
> nanosecond-capable timestamp types ({{TimestampNTZNanosType}} /
> {{TimestampLTZNanosType}}), make all built-in file datasources and JDBC
> explicitly *reject* these types on both the read and write paths, with the
> existing {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} error.
> h2. Why
> This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond
> precision).
> The preview flag {{spark.sql.timestampNanosTypes.enabled}} is gated centrally
> in
> {{TypeUtils.failUnsupportedDataType}} (throws {{FEATURE_NOT_ENABLED}} when
> off).
> But when the flag is *on*, the nanos types extend
> {{DatetimeType extends AtomicType}}, so every file source's
> {{case _: AtomicType => true}} catch-all silently *accepts* them and then
> misbehaves at read/write time (no real support exists yet). Users get
> confusing
> downstream failures or silent precision issues instead of a clear, actionable
> error.
> h2. Approach
> Add an explicit arm before the {{AtomicType}} catch-all in each datasource's
> {{supportDataType}} / {{supportsDataType}}:
> {code}
> // Nanosecond-capable timestamps are not yet supported by this datasource.
> case _: TimestampNTZNanosType | _: TimestampLTZNanosType => false
> {code}
> Read and write are both covered:
> * V1 {{FileFormat.supportReadDataType}} defaults to {{supportDataType}}, so
> one
> edit blocks both paths.
> * V2 {{FileTable}} validates via a single {{supportsDataType}} for read and
> write.
> Rejection is *unconditional* (not flag-dependent): these sources do not
> support
> nanos regardless of the preview flag; the flag only governs whether the type
> can
> exist. As each source adds support later (e.g. Parquet read via SPARK-57102),
> it
> carves out its own exception (e.g. by overriding {{supportReadDataType}}),
> without conflicting with this guardrail.
> h2. Files to change
> V1 {{FileFormat.supportDataType}}:
> * {{ParquetFileFormat}}, {{OrcFileFormat}}, {{JsonFileFormat}},
> {{XmlFileFormat}}
> * {{CSVFileFormat}} - add to the private {{supportDataType(dataType,
> allowVariant)}}
> (covers both {{supportDataType}} and {{supportReadDataType}})
> * {{AvroUtils.supportsDataType}} - single edit covers V1 {{AvroFileFormat}}
> and
> V2 {{AvroTable}} (both delegate to it)
> * {{sql/hive}} {{OrcFileFormat.supportDataType}} (Hive ORC serde)
> V2 {{FileTable.supportsDataType}}:
> * {{ParquetTable}}, {{OrcTable}}, {{JsonTable}}, {{CSVTable}}
> No change needed:
> * {{TextFileFormat}} / {{TextTable}} ({{StringType}} only) already reject
> nanos.
> * JDBC read never yields nanos ({{getCatalystType}} maps {{TIMESTAMP}} to
> micros).
> JDBC write:
> * Already fails fast - {{JdbcUtils.getCommonJDBCType}} returns {{None}} for
> nanos,
> so {{getJdbcType}} throws {{cannotGetJdbcTypeError}}. No code change
> strictly
> required; add a test to lock the behavior. (Optional: add an explicit nanos
> case for a clearer message - decide in review.)
> h2. To verify during implementation
> * {{XmlTable}} (V2) has no {{supportsDataType}} override and inherits
> {{FileTable}}'s default {{true}}; confirm whether a writable V2 XML path
> exists
> and add a rejection if so.
> * Confirm there is no separate Hive *Parquet* serde {{FileFormat}} with its
> own
> {{supportDataType}} (only {{hive/orc/OrcFileFormat}} was found; native
> {{ParquetFileFormat}} is reused for Hive Parquet).
> * Ensure {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} are imported in
> each edited file.
> h2. Tests
> * Extend {{FileBasedDataSourceSuite}} mirroring the existing "Geospatial types
> are not supported in file data sources other than Parquet" test: with
> {{TIMESTAMP_NANOS_TYPES_ENABLED=true}}, iterate v1 and v2
> ({{USE_V1_SOURCE_LIST}}) over the built-in formats and assert both write and
> read of a {{TIMESTAMP_NTZ(9)}} / {{TIMESTAMP_LTZ(9)}} column fail with
> {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} ({{columnType}} rendered as
> {{"TIMESTAMP_NTZ(9)"}} etc.).
> ** Build the nanos-typed column via a nanos {{Literal}}
> ({{Literal.create(new TimestampNanosVal(0L, 0.toShort),
> TimestampNTZNanosType(9))}})
> rather than relying on {{CAST}} (cast support for nanos may be incomplete).
> * Add an equivalent assertion in the Avro test suite (not in
> {{allFileBasedDataSources}}).
> * Add a JDBC write test (e.g. {{JDBCWriteSuite}}) asserting nanos columns are
> rejected.
> h2. Acceptance criteria
> * With the preview flag enabled, writing or reading a column of
> {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} through Parquet, ORC,
> Avro, JSON, CSV, XML (v1 and v2) and Hive ORC fails with
> {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}}.
> * JDBC write of such a column fails with a clear error.
> * Existing supported-type behavior is unchanged for all other types.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]