[GitHub] [spark] bersprockets commented on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source

GitBox Wed, 24 Nov 2021 17:29:01 -0800


bersprockets commented on pull request #33588:
URL: https://github.com/apache/spark/pull/33588#issuecomment-978719988



   @beliefer 
   
   Sorry, doing a post-commit review...
   
   I don't think this is working quite as you expected.
   
   The display value is affected by the time zone of the reader, which should 
not be the case.
   
   For example, run this code in local mode:
   
   ```
   import java.util.TimeZone
   
   TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
   sql("set spark.sql.session.timeZone=America/Los_Angeles")
   
   val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
'2021-06-01 00:00:00' ts")
   
   df.write.mode("overwrite").orc("ts_ntz_orc")
   df.write.mode("overwrite").parquet("ts_ntz_parquet")
   df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
   
   val query = """
     select 'orc', *
     from `orc`.`ts_ntz_orc`
     union all
     select 'parquet', *
     from `parquet`.`ts_ntz_parquet`
     union all
     select 'avro', *
     from `avro`.`ts_ntz_avro`
   """
   
   val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
   for (tz <- tzs) {
     TimeZone.setDefault(TimeZone.getTimeZone(tz))
     sql(s"set spark.sql.session.timeZone=$tz")
   
     println(s"Time zone is ${TimeZone.getDefault.getID}")
     sql(query).show(false)
   }
   ```
   
   You will see:
   
   ```
   Time zone is America/Los_Angeles
   +-------+-------------------+-------------------+
   |orc    |ts_ntz             |ts                 |
   +-------+-------------------+-------------------+
   |orc    |2021-06-01 00:00:00|2021-06-01 00:00:00|
   |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
   |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
   +-------+-------------------+-------------------+
   
   Time zone is UTC
   +-------+-------------------+-------------------+
   |orc    |ts_ntz             |ts                 |
   +-------+-------------------+-------------------+
   |orc    |2021-05-31 17:00:00|2021-06-01 00:00:00|
   |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
   |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
   +-------+-------------------+-------------------+
   
   Time zone is Europe/Amsterdam
   +-------+-------------------+-------------------+
   |orc    |ts_ntz             |ts                 |
   +-------+-------------------+-------------------+
   |orc    |2021-05-31 15:00:00|2021-06-01 00:00:00|
   |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
   |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
   +-------+-------------------+-------------------+
   ```
   
   Note how the display value of ts_ntz varies as the reader's time zone 
changes, but only for ORC.
   
   This is due to [this code in 
ORC](https://github.com/apache/orc/blob/rel/release-1.7.1/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1290)
   
   
   For TIMESTAMPNTZ, Spark treats all values as though UTC is the only time 
zone, regardless of the actual time zone. However, ORC cares about the reader 
and writer's actual time zone (because we don't set the `useUTCTimestamp` 
option). ORC remember's the writer's time zone. When the reader has a different 
time zone than the writer, ORC "adjusts" the value accordingly (which is why 
the ts column above behaves more like a TIMESTAMPNTZ type than the ts_ntz 
column).
   
   To confirm that this is the issue, [I added some 
code](https://github.com/apache/spark/compare/master...bersprockets:orc_ntz_issue_play)
 to shift the value to local time on write, and to re-shift to UTC on read. 
With that code, the ts_ntz column's display value does not vary across time 
zones. Not necessarily a proposed fix, just to confirm the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets commented on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source

Reply via email to