[
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maxim Gekk updated SPARK-27790:
-------------------------------
Description:
Spark has an INTERVAL data type, but it is “broken”:
# It cannot be persisted
# It is not comparable because it crosses the month day line. That is there is
no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not all
months have the same number of days.
I propose here to introduce the two flavours of INTERVAL as described in the
ANSI SQL Standard and deprecate the Sparks interval type.
* ANSI describes two non overlapping “classes”:
* YEAR-MONTH,
* DAY-SECOND ranges
* Members within each class can be compared and sorted.
* Supports datetime arithmetic
* Can be persisted.
The old and new flavors of INTERVAL can coexist until Spark INTERVAL is
eventually retired. Also any semantic “breakage” can be controlled via legacy
config settings.
*Milestone 1* -- Spark Interval equivalency ( The new interval types meet or
exceed all function of the existing SQL Interval):
* Add two new DataType implementations for interval year-month and day-second.
Includes the JSON format and DLL string.
* Infra support: check the caller sides of DateType/TimestampType
* Support the two new interval types in Dataset/UDF.
* Interval literals (with a legacy config to still allow mixed year-month
day-seconds fields and return legacy interval values)
* Interval arithmetic(interval * num, interval / num, interval +/- interval)
* Datetime functions/operators: Datetime - Datetime (to days or day second),
Datetime +/- interval
* Cast to and from the new two interval types, cast string to interval, cast
interval to string (pretty printing), with the SQL syntax to specify the types
* Support sorting intervals.
*Milestone 2* -- Persistence:
* Ability to create tables of type interval
* Ability to write to common file formats such as Parquet and JSON.
* INSERT, SELECT, UPDATE, MERGE
* Discovery
*Milestone 3* -- Client support
* JDBC support
* Hive Thrift server
*Milestone 4* -- PySpark and Spark R integration
* Python UDF can take and return intervals
* DataFrame support
was:
Spark has an INTERVAL data type, but it is “broken”:
# It cannot be persisted
# It is not comparable because it crosses the month day line. That is there is
no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not all
months have the same number of days.
I propose here to introduce the two flavours of INTERVAL as described in the
ANSI SQL Standard and deprecate the Sparks interval type.
* ANSI describes two non overlapping “classes”:
* YEAR-MONTH,
* DAY-SECOND ranges
* Members within each class can be compared and sorted.
* Supports datetime arithmetic
* Can be persisted.
The old and new flavors of INTERVAL can coexist until Spark INTERVAL is
eventually retired. Also any semantic “breakage” can be controlled via legacy
config settings.
*Milestone 1* -- Spark Interval equivalency ( The new interval types meet or
exceed all function of the existing SQL Interval):
* Add two new DataType implementations for interval year-month and day-second.
Includes the JSON format and DLL string.
* Infra support: check the caller sides of DateType/TimestampType
* Support the two new interval types in Dataset/UDF.
* Interval literals (with a legacy config to still allow mixed year-month
day-seconds fields and return legacy interval values)
* Interval arithmetic(interval * num, interval / num, interval +/- interval)
* Datetime functions/operators: Datetime - Datetime (to days or day second),
Datetime +/- interval
* Cast to and from the new two interval types, cast string to interval, cast
interval to string (pretty printing), with the SQL syntax to specify the types
* Support sorting intervals.
*Milestone 2* -- Persistence:
* Ability to create tables of type interval
* Ability to write to common file formats such as Parquet and JSON.
* INSERT, SELECT, UPDATE, MERGE
* Discovery
*Milestone 3* -- Client support
* JDBC support
* Hive Thrift server
*Milestone 4* -- PySpark and Spark R integration
* Python UDF can take and return intervals
* DataFrame support
> Support ANSI SQL INTERVAL types
> -------------------------------
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: Maxim Gekk
> Assignee: Apache Spark
> Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”:
> * YEAR-MONTH,
> * DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is
> eventually retired. Also any semantic “breakage” can be controlled via legacy
> config settings.
> *Milestone 1* -- Spark Interval equivalency ( The new interval types meet
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second),
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* -- Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]