bersprockets commented on pull request #33588:
URL: https://github.com/apache/spark/pull/33588#issuecomment-978719988
@beliefer
Sorry, doing a post-commit review...
I don't think this is working quite as you expected.
The display value is affected by the time zone of the reader, which should
not be the case.
For example, run this code in local mode:
```
import java.util.TimeZone
TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
sql("set spark.sql.session.timeZone=America/Los_Angeles")
val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp
'2021-06-01 00:00:00' ts")
df.write.mode("overwrite").orc("ts_ntz_orc")
df.write.mode("overwrite").parquet("ts_ntz_parquet")
df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
val query = """
select 'orc', *
from `orc`.`ts_ntz_orc`
union all
select 'parquet', *
from `parquet`.`ts_ntz_parquet`
union all
select 'avro', *
from `avro`.`ts_ntz_avro`
"""
val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
for (tz <- tzs) {
TimeZone.setDefault(TimeZone.getTimeZone(tz))
sql(s"set spark.sql.session.timeZone=$tz")
println(s"Time zone is ${TimeZone.getDefault.getID}")
sql(query).show(false)
}
```
You will see:
```
Time zone is America/Los_Angeles
+-------+-------------------+-------------------+
|orc |ts_ntz |ts |
+-------+-------------------+-------------------+
|orc |2021-06-01 00:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
|avro |2021-06-01 00:00:00|2021-06-01 00:00:00|
+-------+-------------------+-------------------+
Time zone is UTC
+-------+-------------------+-------------------+
|orc |ts_ntz |ts |
+-------+-------------------+-------------------+
|orc |2021-05-31 17:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
|avro |2021-06-01 00:00:00|2021-06-01 07:00:00|
+-------+-------------------+-------------------+
Time zone is Europe/Amsterdam
+-------+-------------------+-------------------+
|orc |ts_ntz |ts |
+-------+-------------------+-------------------+
|orc |2021-05-31 15:00:00|2021-06-01 00:00:00|
|parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
|avro |2021-06-01 00:00:00|2021-06-01 09:00:00|
+-------+-------------------+-------------------+
```
Note how the display value of ts_ntz varies as the reader's time zone
changes, but only for ORC.
This is due to [this code in
ORC](https://github.com/apache/orc/blob/rel/release-1.7.1/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1290)
For TIMESTAMPNTZ, Spark treats all values as though UTC is the only time
zone, regardless of the actual time zone. However, ORC cares about the reader
and writer's actual time zone (because we don't set the `useUTCTimestamp`
option). ORC remember's the writer's time zone. When the reader has a different
time zone than the writer, ORC "adjusts" the value accordingly (which is why
the ts column above behaves more like a TIMESTAMPNTZ type than the ts_ntz
column).
To confirm that this is the issue, [I added some
code](https://github.com/apache/spark/compare/master...bersprockets:orc_ntz_issue_play)
to shift the value to local time on write, and to re-shift to UTC on read.
With that code, the ts_ntz column's display value does not vary across time
zones. Not necessarily a proposed fix, just to confirm the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]