[ 
https://issues.apache.org/jira/browse/SPARK-57164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57164:
-----------------------------
    Description: 
h2. What

Add focused test coverage asserting that the nanosecond-capable timestamp
spellings ({{TIMESTAMP_NTZ(p)}}, {{TIMESTAMP_LTZ(p)}}, and the
{{TIMESTAMP(p) WITH[OUT] [LOCAL] TIME ZONE}} aliases, p in [7, 9]) parse
consistently across every public string-to-DataType entry point, and that
out-of-range precisions are rejected identically everywhere.

h2. Why

This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond precision).

Spark parses data-type strings through two independent parser families:

* *Family A - ANTLR {{DataTypeAstBuilder}}*: the bare/zoned {{TIMESTAMP(p)}}
  handling lives in one place, but it is reached through many distinct public
  surfaces (see below). Each surface is a separate user-facing contract.
* *Family B - JSON {{nameToType}}* in {{DataType.scala}}: a second,
  hand-maintained parser with its own {{TIMESTAMP_LTZ_NANOS_TYPE}} /
  {{TIMESTAMP_NTZ_NANOS_TYPE}} regex branches. This is where precision/error
  semantics can silently drift from Family A.

Today the nanos parsing is exercised mainly via
{{CatalystSqlParser.parseDataType}} in {{DataTypeParserSuite}}. The other public
entry points have no explicit assertions, so a regression on any one of them
(or drift between Family A and Family B) would go unnoticed.

h2. Entry points to cover

Family A (ANTLR {{DataTypeAstBuilder}}):
* {{DataType.fromDDL}} and {{StructType.fromDDL}}
* {{StructType.add(name, "TIMESTAMP_NTZ(9)")}}
* {{Column.cast(String)}} and {{Column.try_cast(String)}}
* {{DataFrameReader.schema(String)}} (and {{DataStreamReader.schema(String)}})
* {{SparkSession.sessionState.sqlParser.parseDataType(String)}} - the 
programmatic
  catalog-string entry point
* DDL/SQL schema strings passed to {{from_json}} / {{from_csv}} / {{from_xml}}
  (XML is a built-in datasource; {{from_xml}} takes a schema string just like 
the
  other two)
* SQL via the full {{AstBuilder}}: {{CAST(x AS TIMESTAMP_NTZ(9))}},
  {{TRY_CAST(x AS TIMESTAMP_LTZ(9))}}, {{CREATE TABLE ... c TIMESTAMP_LTZ(7)}},
  {{ALTER TABLE ... ADD COLUMNS (c TIMESTAMP_NTZ(9))}}, {{ALTER TABLE ... ALTER 
COLUMN}},
  and a column {{DEFAULT}} declared with a nanos type

Shared wrapper (bridges Family A and Family B):
* {{DataType.parseTypeWithFallback}} - the DDL-then-JSON fallback used by
  {{DataFrameReader.schema(String)}} and the {{from_*}} expressions. Asserting 
it
  directly is the single best guard against Family A and Family B drifting.

Family B (JSON):
* {{DataType.fromJson}} / {{DataTypeJsonUtils}} round-trip
  ({{typeName}}/{{json}} <-> {{DataType}})

h2. Acceptance criteria

* For p in {7, 8, 9}, every entry point above resolves:
** {{TIMESTAMP_NTZ(p)}} -> {{TimestampNTZNanosType(p)}}
** {{TIMESTAMP_LTZ(p)}} -> {{TimestampLTZNanosType(p)}}
** {{TIMESTAMP(p) WITHOUT TIME ZONE}} -> {{TimestampNTZNanosType(p)}}
** {{TIMESTAMP(p) WITH LOCAL TIME ZONE}} -> {{TimestampLTZNanosType(p)}}
** {{TIMESTAMP(p)}} (bare) -> {{TimestampLTZNanosType(p)}} or 
{{TimestampNTZNanosType(p)}}
   depending on {{spark.sql.timestampType}} (assert both config values)
* All entry points reject out-of-range precision (e.g. {{(6)}}, {{(10)}})
  with {{INVALID_TIMESTAMP_PRECISION}}, with identical parameters across
  Family A and Family B. (If the separate {{TIMESTAMP_*(6)}} mapping task has
  landed, update the {{(6)}} expectations to the microsecond types instead.)
* All entry points reject the spellings with {{FEATURE_NOT_ENABLED}} when
  {{spark.sql.timestampNanosTypes.enabled = false}}.
* A round-trip test confirms Family B agrees with Family A:
  {{DataType.fromJson(t.json)}} == {{t}} for the nanos types, and the
  {{typeName}} of a nanos type re-parses to the same type.

h2. Where to add tests

* {{sql/catalyst/.../parser/DataTypeParserSuite.scala}} - {{fromDDL}},
  {{StructType.fromDDL}}, {{StructType.add(String)}}.
* {{sql/catalyst/.../types/DataTypeSuite.scala}} - {{fromJson}}/{{json}}
  round-trip (Family B).
* {{Column.cast(String)}} / {{DataFrameReader.schema(String)}} /
  {{from_json}} / {{from_csv}} / {{from_xml}} DDL-schema cases in the 
appropriate
  {{sql/core}} suite (gated by the preview flag via {{withSQLConf}}).
* {{DataType.parseTypeWithFallback}} direct assertions (DDL path and JSON 
fallback).

h2. Out of scope

* Behavior changes. This task only adds assertions for the current contract
  (any intended behavior change for {{p}} = 6 is handled by its own task).
* Spark Connect proto conversion (tracked separately under SPARK-57160 /
  SPARK-57161).
* Related parse entry points that flow through the same parser but whose
  datasources reject nanos today; covered by their own tasks: JDBC 
{{customSchema}}
  option (SPARK-57460), ORC catalyst-type attribute round-trip (SPARK-57455), 
and
  Hive metastore type strings.

h2. Notes for first-time contributors

Good first issue - test-only, no production code changes. Enable the preview
flag in tests with:

{code}
withSQLConf(SQLConf.TIMESTAMP_NANOS_TYPES_ENABLED.key -> "true") { ... }
{code}

Run an affected suite with SBT:

{code}
build/sbt 'catalyst/testOnly *DataTypeParserSuite *DataTypeSuite'
{code}

  was:
h2. What

Add focused test coverage asserting that the nanosecond-capable timestamp
spellings ({{TIMESTAMP_NTZ(p)}}, {{TIMESTAMP_LTZ(p)}}, and the
{{TIMESTAMP(p) WITH[OUT] [LOCAL] TIME ZONE}} aliases, p in [7, 9]) parse
consistently across every public string-to-DataType entry point, and that
out-of-range precisions are rejected identically everywhere.

h2. Why

This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond precision).

Spark parses data-type strings through two independent parser families:

* *Family A - ANTLR {{DataTypeAstBuilder}}*: the bare/zoned {{TIMESTAMP(p)}}
  handling lives in one place, but it is reached through many distinct public
  surfaces (see below). Each surface is a separate user-facing contract.
* *Family B - JSON {{nameToType}}* in {{DataType.scala}}: a second,
  hand-maintained parser with its own {{TIMESTAMP_LTZ_NANOS_TYPE}} /
  {{TIMESTAMP_NTZ_NANOS_TYPE}} regex branches. This is where precision/error
  semantics can silently drift from Family A.

Today the nanos parsing is exercised mainly via
{{CatalystSqlParser.parseDataType}} in {{DataTypeParserSuite}}. The other public
entry points have no explicit assertions, so a regression on any one of them
(or drift between Family A and Family B) would go unnoticed.

h2. Entry points to cover

Family A (ANTLR {{DataTypeAstBuilder}}):
* {{DataType.fromDDL}} and {{StructType.fromDDL}}
* {{StructType.add(name, "TIMESTAMP_NTZ(9)")}}
* {{Column.cast(String)}} and {{Column.try_cast(String)}}
* {{DataFrameReader.schema(String)}} (and {{DataStreamReader.schema(String)}})
* DDL/SQL schema strings passed to {{from_json}} / {{from_csv}}
* SQL via the full {{AstBuilder}}: {{CAST(x AS TIMESTAMP_NTZ(9))}},
  {{CREATE TABLE ... c TIMESTAMP_LTZ(7)}}

Family B (JSON):
* {{DataType.fromJson}} / {{DataTypeJsonUtils}} round-trip
  ({{typeName}}/{{json}} <-> {{DataType}})

h2. Acceptance criteria

* For p in {7, 8, 9}, every entry point above resolves:
** {{TIMESTAMP_NTZ(p)}} -> {{TimestampNTZNanosType(p)}}
** {{TIMESTAMP_LTZ(p)}} -> {{TimestampLTZNanosType(p)}}
** {{TIMESTAMP(p) WITHOUT TIME ZONE}} -> {{TimestampNTZNanosType(p)}}
** {{TIMESTAMP(p) WITH LOCAL TIME ZONE}} -> {{TimestampLTZNanosType(p)}}
* All entry points reject out-of-range precision (e.g. {{(6)}}, {{(10)}})
  with {{INVALID_TIMESTAMP_PRECISION}}, with identical parameters across
  Family A and Family B. (If the separate {{TIMESTAMP_*(6)}} mapping task has
  landed, update the {{(6)}} expectations to the microsecond types instead.)
* All entry points reject the spellings with {{FEATURE_NOT_ENABLED}} when
  {{spark.sql.timestampNanosTypes.enabled = false}}.
* A round-trip test confirms Family B agrees with Family A:
  {{DataType.fromJson(t.json)}} == {{t}} for the nanos types, and the
  {{typeName}} of a nanos type re-parses to the same type.

h2. Where to add tests

* {{sql/catalyst/.../parser/DataTypeParserSuite.scala}} - {{fromDDL}},
  {{StructType.fromDDL}}, {{StructType.add(String)}}.
* {{sql/catalyst/.../types/DataTypeSuite.scala}} - {{fromJson}}/{{json}}
  round-trip (Family B).
* {{Column.cast(String)}} / {{DataFrameReader.schema(String)}} /
  {{from_json}} DDL-schema cases in the appropriate {{sql/core}} suite
  (gated by the preview flag via {{withSQLConf}}).

h2. Out of scope

* Behavior changes. This task only adds assertions for the current contract
  (any intended behavior change for {{p}} = 6 is handled by its own task).
* Spark Connect proto conversion (tracked separately under SPARK-57160 /
  SPARK-57161).

h2. Notes for first-time contributors

Good first issue - test-only, no production code changes. Enable the preview
flag in tests with:

{code}
withSQLConf(SQLConf.TIMESTAMP_NANOS_TYPES_ENABLED.key -> "true") { ... }
{code}

Run an affected suite with SBT:

{code}
build/sbt 'catalyst/testOnly *DataTypeParserSuite *DataTypeSuite'
{code}


> Add parser test coverage for nanosecond-capable timestamp types across all 
> data-type string entry points
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57164
>                 URL: https://issues.apache.org/jira/browse/SPARK-57164
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL, Tests
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Minor
>              Labels: starter
>
> h2. What
> Add focused test coverage asserting that the nanosecond-capable timestamp
> spellings ({{TIMESTAMP_NTZ(p)}}, {{TIMESTAMP_LTZ(p)}}, and the
> {{TIMESTAMP(p) WITH[OUT] [LOCAL] TIME ZONE}} aliases, p in [7, 9]) parse
> consistently across every public string-to-DataType entry point, and that
> out-of-range precisions are rejected identically everywhere.
> h2. Why
> This is a sub-task of SPARK-56822 (SPIP: Timestamps with nanosecond 
> precision).
> Spark parses data-type strings through two independent parser families:
> * *Family A - ANTLR {{DataTypeAstBuilder}}*: the bare/zoned {{TIMESTAMP(p)}}
>   handling lives in one place, but it is reached through many distinct public
>   surfaces (see below). Each surface is a separate user-facing contract.
> * *Family B - JSON {{nameToType}}* in {{DataType.scala}}: a second,
>   hand-maintained parser with its own {{TIMESTAMP_LTZ_NANOS_TYPE}} /
>   {{TIMESTAMP_NTZ_NANOS_TYPE}} regex branches. This is where precision/error
>   semantics can silently drift from Family A.
> Today the nanos parsing is exercised mainly via
> {{CatalystSqlParser.parseDataType}} in {{DataTypeParserSuite}}. The other 
> public
> entry points have no explicit assertions, so a regression on any one of them
> (or drift between Family A and Family B) would go unnoticed.
> h2. Entry points to cover
> Family A (ANTLR {{DataTypeAstBuilder}}):
> * {{DataType.fromDDL}} and {{StructType.fromDDL}}
> * {{StructType.add(name, "TIMESTAMP_NTZ(9)")}}
> * {{Column.cast(String)}} and {{Column.try_cast(String)}}
> * {{DataFrameReader.schema(String)}} (and {{DataStreamReader.schema(String)}})
> * {{SparkSession.sessionState.sqlParser.parseDataType(String)}} - the 
> programmatic
>   catalog-string entry point
> * DDL/SQL schema strings passed to {{from_json}} / {{from_csv}} / {{from_xml}}
>   (XML is a built-in datasource; {{from_xml}} takes a schema string just like 
> the
>   other two)
> * SQL via the full {{AstBuilder}}: {{CAST(x AS TIMESTAMP_NTZ(9))}},
>   {{TRY_CAST(x AS TIMESTAMP_LTZ(9))}}, {{CREATE TABLE ... c 
> TIMESTAMP_LTZ(7)}},
>   {{ALTER TABLE ... ADD COLUMNS (c TIMESTAMP_NTZ(9))}}, {{ALTER TABLE ... 
> ALTER COLUMN}},
>   and a column {{DEFAULT}} declared with a nanos type
> Shared wrapper (bridges Family A and Family B):
> * {{DataType.parseTypeWithFallback}} - the DDL-then-JSON fallback used by
>   {{DataFrameReader.schema(String)}} and the {{from_*}} expressions. 
> Asserting it
>   directly is the single best guard against Family A and Family B drifting.
> Family B (JSON):
> * {{DataType.fromJson}} / {{DataTypeJsonUtils}} round-trip
>   ({{typeName}}/{{json}} <-> {{DataType}})
> h2. Acceptance criteria
> * For p in {7, 8, 9}, every entry point above resolves:
> ** {{TIMESTAMP_NTZ(p)}} -> {{TimestampNTZNanosType(p)}}
> ** {{TIMESTAMP_LTZ(p)}} -> {{TimestampLTZNanosType(p)}}
> ** {{TIMESTAMP(p) WITHOUT TIME ZONE}} -> {{TimestampNTZNanosType(p)}}
> ** {{TIMESTAMP(p) WITH LOCAL TIME ZONE}} -> {{TimestampLTZNanosType(p)}}
> ** {{TIMESTAMP(p)}} (bare) -> {{TimestampLTZNanosType(p)}} or 
> {{TimestampNTZNanosType(p)}}
>    depending on {{spark.sql.timestampType}} (assert both config values)
> * All entry points reject out-of-range precision (e.g. {{(6)}}, {{(10)}})
>   with {{INVALID_TIMESTAMP_PRECISION}}, with identical parameters across
>   Family A and Family B. (If the separate {{TIMESTAMP_*(6)}} mapping task has
>   landed, update the {{(6)}} expectations to the microsecond types instead.)
> * All entry points reject the spellings with {{FEATURE_NOT_ENABLED}} when
>   {{spark.sql.timestampNanosTypes.enabled = false}}.
> * A round-trip test confirms Family B agrees with Family A:
>   {{DataType.fromJson(t.json)}} == {{t}} for the nanos types, and the
>   {{typeName}} of a nanos type re-parses to the same type.
> h2. Where to add tests
> * {{sql/catalyst/.../parser/DataTypeParserSuite.scala}} - {{fromDDL}},
>   {{StructType.fromDDL}}, {{StructType.add(String)}}.
> * {{sql/catalyst/.../types/DataTypeSuite.scala}} - {{fromJson}}/{{json}}
>   round-trip (Family B).
> * {{Column.cast(String)}} / {{DataFrameReader.schema(String)}} /
>   {{from_json}} / {{from_csv}} / {{from_xml}} DDL-schema cases in the 
> appropriate
>   {{sql/core}} suite (gated by the preview flag via {{withSQLConf}}).
> * {{DataType.parseTypeWithFallback}} direct assertions (DDL path and JSON 
> fallback).
> h2. Out of scope
> * Behavior changes. This task only adds assertions for the current contract
>   (any intended behavior change for {{p}} = 6 is handled by its own task).
> * Spark Connect proto conversion (tracked separately under SPARK-57160 /
>   SPARK-57161).
> * Related parse entry points that flow through the same parser but whose
>   datasources reject nanos today; covered by their own tasks: JDBC 
> {{customSchema}}
>   option (SPARK-57460), ORC catalyst-type attribute round-trip (SPARK-57455), 
> and
>   Hive metastore type strings.
> h2. Notes for first-time contributors
> Good first issue - test-only, no production code changes. Enable the preview
> flag in tests with:
> {code}
> withSQLConf(SQLConf.TIMESTAMP_NANOS_TYPES_ENABLED.key -> "true") { ... }
> {code}
> Run an affected suite with SBT:
> {code}
> build/sbt 'catalyst/testOnly *DataTypeParserSuite *DataTypeSuite'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to