[GitHub] [spark] MaxGekk commented on a change in pull request #24181: [SPARK-27242][SQL] Make formatting TIMESTAMP/DATE literals independent from the default time zone

GitBox Tue, 26 Mar 2019 06:11:00 -0700

MaxGekk commented on a change in pull request #24181: [SPARK-27242][SQL] Make 
formatting TIMESTAMP/DATE literals independent from the default time zone
URL: https://github.com/apache/spark/pull/24181#discussion_r269092290

##########
File path: docs/sql-migration-guide-upgrade.md
##########
@@ -96,13 +96,17 @@ displayTitle: Spark SQL Upgrading Guide
- The `weekofyear`, `weekday`, `dayofweek`, `date_trunc`,
`from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use
java.time API for calculation week number of year, day number of week as well
for conversion from/to TimestampType values in UTC time zone.

- the JDBC options `lowerBound` and `upperBound` are converted to
TimestampType/DateType values in the same way as casting strings to
TimestampType/DateType values. The conversion is based on Proleptic Gregorian
calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`.
In Spark version 2.4 and earlier, the conversion is based on the hybrid
calendar (Julian + Gregorian) and on default system time zone.
+
+ - Formatting of `TIMESTAMP` and `DATE` literals.

- In Spark version 2.4 and earlier, invalid time zone ids are silently
ignored and replaced by GMT time zone, for example, in the from_utc_timestamp
function. Since Spark 3.0, such time zone ids are rejected, and Spark throws
`java.time.DateTimeException`.

- In Spark version 2.4 and earlier, the `current_timestamp` function returns
a timestamp with millisecond resolution only. Since Spark 3.0, the function can
return the result with microsecond resolution if the underlying clock available
on the system offers such resolution.

- In Spark version 2.4 abd earlier, when reading a Hive Serde table with
Spark native data sources(parquet/orc), Spark will infer the actual file schema
and update the table schema in metastore. Since Spark 3.0, Spark doesn't infer
the schema anymore. This should not cause any problems to end users, but if it
does, please set `spark.sql.hive.caseSensitiveInferenceMode` to
`INFER_AND_SAVE`.

+ - Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the
SQL config `spark.sql.session.timeZone`, and `DATE` literals are formatted
using the UTC time zone. In Spark version 2.4 and earlier, both conversions use
the default time zone of the Java virtual machine.

Review comment:
In this PR, I made rendering of `DATE` type consistent to other places in
Spark. If we look at how dates are parsed, so, there are 2 mechanism:
- `stringToDate` - parses dates from `UTF8String` using UTC time zone:
https://github.com/apache/spark/blob/a529be2930b1d69015f1ac8f85e590f197cf53cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L414
. For example, `CAST`, and `DATE` type literal uses it.
- `DateFormatter` - parses dates from `String` using UTC time zone:
https://github.com/apache/spark/blob/60be6d2ea3560143515c1ce9d0a7da416f8f595a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L38
. It is used in CSV/JSON datasource and in other places.
- There was another mechanism via `Date.valueOf` which system default time
zone of JVM but I remove it in most places except some test suite.

So, dates come to Spark from textual representation in UTC time zone always,
and Spark stores them as number of days since epoch in UTC time zone as values
of `DateType`.

The rendering of dates is time zone independent too:
- `DateFormatter` uses UTC time zone to convert `DateType` to `String`:
https://github.com/apache/spark/blob/60be6d2ea3560143515c1ce9d0a7da416f8f595a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L45
. The `DateFormatter` is used in JSON/CSV/JDBC datasource , Hive results,
partition discovery
- and `CAST` to `UTF8String` also uses `DateFormatter` and `UTC` time zone:
https://github.com/apache/spark/blob/8efc5ec72e2f5899547941010e22c023d6cb86b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L242

So, Spark renders `DATE` values in `UTC` time zone always except this `DATE`
typed literal. And this PR addresses this inconsistency.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #24181: [SPARK-27242][SQL] Make formatting TIMESTAMP/DATE literals independent from the default time zone

Reply via email to