[ 
https://issues.apache.org/jira/browse/SPARK-57166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-57166:
--------------------------------

    Assignee: Max Gekk

> Reject nanosecond-capable timestamp types in built-in datasources and JDBC
> --------------------------------------------------------------------------
>
>                 Key: SPARK-57166
>                 URL: https://issues.apache.org/jira/browse/SPARK-57166
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>
> h2. What
> Until each datasource implements real read/write support for the
> nanosecond-capable timestamp types ({{TimestampNTZNanosType}} /
> {{TimestampLTZNanosType}}), make all built-in file datasources and JDBC
> explicitly *reject* these types on both the read and write paths, with the
> existing {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} error.
> h2. Why
> This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond 
> precision).
> The preview flag {{spark.sql.timestampNanosTypes.enabled}} is gated centrally 
> in
> {{TypeUtils.failUnsupportedDataType}} (throws {{FEATURE_NOT_ENABLED}} when 
> off).
> But when the flag is *on*, the nanos types extend
> {{DatetimeType extends AtomicType}}, so every file source's
> {{case _: AtomicType => true}} catch-all silently *accepts* them and then
> misbehaves at read/write time (no real support exists yet). Users get 
> confusing
> downstream failures or silent precision issues instead of a clear, actionable
> error.
> h2. Approach
> Add an explicit arm before the {{AtomicType}} catch-all in each datasource's
> {{supportDataType}} / {{supportsDataType}}:
> {code}
> // Nanosecond-capable timestamps are not yet supported by this datasource.
> case _: TimestampNTZNanosType | _: TimestampLTZNanosType => false
> {code}
> Read and write are both covered:
> * V1 {{FileFormat.supportReadDataType}} defaults to {{supportDataType}}, so 
> one
>   edit blocks both paths.
> * V2 {{FileTable}} validates via a single {{supportsDataType}} for read and
>   write.
> Rejection is *unconditional* (not flag-dependent): these sources do not 
> support
> nanos regardless of the preview flag; the flag only governs whether the type 
> can
> exist. As each source adds support later (e.g. Parquet read via SPARK-57102), 
> it
> carves out its own exception (e.g. by overriding {{supportReadDataType}}),
> without conflicting with this guardrail.
> h2. Files to change
> V1 {{FileFormat.supportDataType}}:
> * {{ParquetFileFormat}}, {{OrcFileFormat}}, {{JsonFileFormat}},
>   {{XmlFileFormat}}
> * {{CSVFileFormat}} - add to the private {{supportDataType(dataType, 
> allowVariant)}}
>   (covers both {{supportDataType}} and {{supportReadDataType}})
> * {{AvroUtils.supportsDataType}} - single edit covers V1 {{AvroFileFormat}} 
> and
>   V2 {{AvroTable}} (both delegate to it)
> * {{sql/hive}} {{OrcFileFormat.supportDataType}} (Hive ORC serde)
> V2 {{FileTable.supportsDataType}}:
> * {{ParquetTable}}, {{OrcTable}}, {{JsonTable}}, {{CSVTable}}
> No change needed:
> * {{TextFileFormat}} / {{TextTable}} ({{StringType}} only) already reject 
> nanos.
> * JDBC read never yields nanos ({{getCatalystType}} maps {{TIMESTAMP}} to 
> micros).
> JDBC write:
> * Already fails fast - {{JdbcUtils.getCommonJDBCType}} returns {{None}} for 
> nanos,
>   so {{getJdbcType}} throws {{cannotGetJdbcTypeError}}. No code change 
> strictly
>   required; add a test to lock the behavior. (Optional: add an explicit nanos
>   case for a clearer message - decide in review.)
> h2. To verify during implementation
> * {{XmlTable}} (V2) has no {{supportsDataType}} override and inherits
>   {{FileTable}}'s default {{true}}; confirm whether a writable V2 XML path 
> exists
>   and add a rejection if so.
> * Confirm there is no separate Hive *Parquet* serde {{FileFormat}} with its 
> own
>   {{supportDataType}} (only {{hive/orc/OrcFileFormat}} was found; native
>   {{ParquetFileFormat}} is reused for Hive Parquet).
> * Ensure {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} are imported in
>   each edited file.
> h2. Tests
> * Extend {{FileBasedDataSourceSuite}} mirroring the existing "Geospatial types
>   are not supported in file data sources other than Parquet" test: with
>   {{TIMESTAMP_NANOS_TYPES_ENABLED=true}}, iterate v1 and v2
>   ({{USE_V1_SOURCE_LIST}}) over the built-in formats and assert both write and
>   read of a {{TIMESTAMP_NTZ(9)}} / {{TIMESTAMP_LTZ(9)}} column fail with
>   {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}} ({{columnType}} rendered as
>   {{"TIMESTAMP_NTZ(9)"}} etc.).
> ** Build the nanos-typed column via a nanos {{Literal}}
>    ({{Literal.create(new TimestampNanosVal(0L, 0.toShort), 
> TimestampNTZNanosType(9))}})
>    rather than relying on {{CAST}} (cast support for nanos may be incomplete).
> * Add an equivalent assertion in the Avro test suite (not in
>   {{allFileBasedDataSources}}).
> * Add a JDBC write test (e.g. {{JDBCWriteSuite}}) asserting nanos columns are
>   rejected.
> h2. Acceptance criteria
> * With the preview flag enabled, writing or reading a column of
>   {{TimestampNTZNanosType}} / {{TimestampLTZNanosType}} through Parquet, ORC,
>   Avro, JSON, CSV, XML (v1 and v2) and Hive ORC fails with
>   {{UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE}}.
> * JDBC write of such a column fails with a clear error.
> * Existing supported-type behavior is unchanged for all other types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to