[
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuanjian Li updated SPARK-31030:
--------------------------------
Description:
*Background*
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion
are performed by using the hybrid calendar ([Julian +
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java
8 API classes [the java.time packages that are based on [ISO
chronology|[https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]]].
The switching job is completed in SPARK-26651.
*Problem*
Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4
and earlier when parsing datetime. Moreover, for the build-in SQL expressions
like to_date, to_timestamp and etc, in the existing implementation of Spark
3.0 will catch all the exceptions and return `null` when hitting the parsing
errors. This will cause the silent result changes, which are hard to debug for
end-users when the data volume is huge and business logics are complex.
*Solution*
To avoid unexpected result changes after the underlying datetime API switch, we
propose the following solution.
* Introduce the fallback mechanism: when the Java 8-based parser fails, we
need to detect these behavior differences by falling back to the legacy parser,
and fail with a user-friendly error message to tell users what gets changed and
how to fix the pattern.
* Document the Spark’s datetime patterns: The date-time formatter of Spark is
decoupled with the Java patterns. The Spark’s patterns are mainly based on the
[Java 7’s
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
[for better backward compatibility] with the customized logic [caused by the
breaking changes between[ Java
7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
and[ Java
8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
pattern string]. Below are the customized rules:
||Pattern||Java 7||Java 8|| Example||Rule||
|u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u
accept a negative value to represent BC, while y should be used together with G
to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’ to
‘e’ and use Java 8 parser to parse the string. If parsable, return the result;
otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to parse.
When it is successfully parsed, throw an exception and ask users to change the
pattern strings or turn on the legacy mode; otherwise, return NULL as what
Spark 2.4 does.|
| z| General time zone which also accepts
[RFC 822 time zones\|#rfc822timezone]]|Only accept time-zone name, e.g.
Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The semantics
of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 follows the
semantics of Java 8.
Use Java 8 to parse the string. If parsable, return the result; otherwise, use
the legacy Java 7 parser to parse. When it is successfully parsed, throw an
exception and ask users to change the pattern strings or turn on the legacy
mode; otherwise, return NULL as what Spark 2.4 does.|
was:
*Background*
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion
are performed by using the hybrid calendar ([Julian +
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java
8 API classes [the java.time packages that are based on[ ISO
chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]].
The switching job is completed in SPARK-26651.
*Problem*
Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4
and earlier when parsing datetime. Moreover, for the build-in SQL expressions
like to_date, to_timestamp and etc, in the existing implementation of Spark
3.0 will catch all the exceptions and return `null` when hitting the parsing
errors. This will cause the silent result changes, which are hard to debug for
end-users when the data volume is huge and business logics are complex.
*Solution*
To avoid unexpected result changes after the underlying datetime API switch, we
propose the following solution.
* Introduce the fallback mechanism: when the Java 8-based parser fails, we
need to detect these behavior differences by falling back to the legacy parser,
and fail with a user-friendly error message to tell users what gets changed and
how to fix the pattern.
* Document the Spark’s datetime patterns: The date-time formatter of Spark is
decoupled with the Java patterns. The Spark’s patterns are mainly based on the
[Java 7’s
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
[for better backward compatibility] with the customized logic [caused by the
breaking changes between[ Java
7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
and[ Java
8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
pattern string]. Below are the customized rules:
||Pattern||Java 7||Java 8|| Example||Rule||
|u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u
accept a negative value to represent BC, while y should be used together with G
to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’ to
‘e’ and use Java 8 parser to parse the string. If parsable, return the result;
otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to parse.
When it is successfully parsed, throw an exception and ask users to change the
pattern strings or turn on the legacy mode; otherwise, return NULL as what
Spark 2.4 does.|
| z| General time zone which also accepts
[RFC 822 time zones\|#rfc822timezone]]|Only accept time-zone name, e.g.
Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The semantics
of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 follows the
semantics of Java 8.
Use Java 8 to parse the string. If parsable, return the result; otherwise, use
the legacy Java 7 parser to parse. When it is successfully parsed, throw an
exception and ask users to change the pattern strings or turn on the legacy
mode; otherwise, return NULL as what Spark 2.4 does.|
> Backward Compatibility for Parsing Datetime
> -------------------------------------------
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Yuanjian Li
> Priority: Major
> Attachments: image-2020-03-04-10-54-05-208.png,
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion
> are performed by using the hybrid calendar ([Julian +
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by
> using Java 8 API classes [the java.time packages that are based on [ISO
> chronology|[https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]]].
> The switching job is completed in SPARK-26651.
>
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark
> 2.4 and earlier when parsing datetime. Moreover, for the build-in SQL
> expressions like to_date, to_timestamp and etc, in the existing
> implementation of Spark 3.0 will catch all the exceptions and return `null`
> when hitting the parsing errors. This will cause the silent result changes,
> which are hard to debug for end-users when the data volume is huge and
> business logics are complex.
>
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch,
> we propose the following solution.
> * Introduce the fallback mechanism: when the Java 8-based parser fails, we
> need to detect these behavior differences by falling back to the legacy
> parser, and fail with a user-friendly error message to tell users what gets
> changed and how to fix the pattern.
> * Document the Spark’s datetime patterns: The date-time formatter of Spark
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on
> the [Java 7’s
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
> [for better backward compatibility] with the customized logic [caused by the
> breaking changes between[ Java
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
> and[ Java
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
> pattern string]. Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u
> accept a negative value to represent BC, while y should be used together with
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to
> parse. When it is successfully parsed, throw an exception and ask users to
> change the pattern strings or turn on the legacy mode; otherwise, return NULL
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
> [RFC 822 time zones\|#rfc822timezone]]|Only accept time-zone name, e.g.
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0
> follows the semantics of Java 8.
> Use Java 8 to parse the string. If parsable, return the result; otherwise,
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw
> an exception and ask users to change the pattern strings or turn on the
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]