MaxGekk commented on a change in pull request #24181: [SPARK-27242][SQL] Make 
formatting TIMESTAMP/DATE literals independent from the default time zone
URL: https://github.com/apache/spark/pull/24181#discussion_r269092290
 
 

 ##########
 File path: docs/sql-migration-guide-upgrade.md
 ##########
 @@ -96,13 +96,17 @@ displayTitle: Spark SQL Upgrading Guide
     - The `weekofyear`, `weekday`, `dayofweek`, `date_trunc`, 
`from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use 
java.time API for calculation week number of year, day number of week as well 
for conversion from/to TimestampType values in UTC time zone.
 
     - the JDBC options `lowerBound` and `upperBound` are converted to 
TimestampType/DateType values in the same way as casting strings to 
TimestampType/DateType values. The conversion is based on Proleptic Gregorian 
calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`. 
In Spark version 2.4 and earlier, the conversion is based on the hybrid 
calendar (Julian + Gregorian) and on default system time zone.
+    
+    - Formatting of `TIMESTAMP` and `DATE` literals.
 
   - In Spark version 2.4 and earlier, invalid time zone ids are silently 
ignored and replaced by GMT time zone, for example, in the from_utc_timestamp 
function. Since Spark 3.0, such time zone ids are rejected, and Spark throws 
`java.time.DateTimeException`.
 
   - In Spark version 2.4 and earlier, the `current_timestamp` function returns 
a timestamp with millisecond resolution only. Since Spark 3.0, the function can 
return the result with microsecond resolution if the underlying clock available 
on the system offers such resolution.
 
   - In Spark version 2.4 abd earlier, when reading a Hive Serde table with 
Spark native data sources(parquet/orc), Spark will infer the actual file schema 
and update the table schema in metastore. Since Spark 3.0, Spark doesn't infer 
the schema anymore. This should not cause any problems to end users, but if it 
does, please set `spark.sql.hive.caseSensitiveInferenceMode` to 
`INFER_AND_SAVE`.
 
+  - Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the 
SQL config `spark.sql.session.timeZone`, and `DATE` literals are formatted 
using the UTC time zone. In Spark version 2.4 and earlier, both conversions use 
the default time zone of the Java virtual machine.
 
 Review comment:
   In this PR, I made rendering of `DATE` type consistent to other places in 
Spark. If we look at how dates are parsed, so, there are 2 mechanism:
   - `stringToDate` - parses dates from `UTF8String` using UTC time zone: 
https://github.com/apache/spark/blob/a529be2930b1d69015f1ac8f85e590f197cf53cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L414
 . For example, `CAST`, and `DATE` type literal uses it.
   - `DateFormatter` - parses dates from `String` using UTC time zone: 
https://github.com/apache/spark/blob/60be6d2ea3560143515c1ce9d0a7da416f8f595a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L38
 . It is used in CSV/JSON datasource and in other places.
   - There was another mechanism via `Date.valueOf` which system default time 
zone of JVM but I remove it in most places except some test suite.
   
   So, dates come to Spark from textual representation in UTC time zone always, 
and Spark stores them as number of days since epoch in UTC time zone as values 
of `DateType`.
   
   The rendering of dates is time zone independent too:
   - `DateFormatter` uses UTC time zone to convert `DateType` to `String`: 
https://github.com/apache/spark/blob/60be6d2ea3560143515c1ce9d0a7da416f8f595a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L45
 . The `DateFormatter` is used in JSON/CSV/JDBC datasource , Hive results, 
partition discovery
   - and `CAST` to `UTF8String` also uses `DateFormatter` and `UTC` time zone: 
https://github.com/apache/spark/blob/8efc5ec72e2f5899547941010e22c023d6cb86b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L242
   
   So, Spark renders `DATE` values in `UTC` time zone always except this `DATE` 
typed literal. And this PR addresses this inconsistency.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to