bersprockets commented on a change in pull request #34712:
URL: https://github.com/apache/spark/pull/34712#discussion_r757083285



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -214,6 +213,21 @@ class OrcFileFormat
     }
   }
 
+  // This method references createRecordReader of OrcInputFormat.
+  // Just for call useUTCTimestamp of OrcFile.ReaderOptions.
+  private def createRecordReader[V <: WritableComparable[_]](
+      inputSplit: InputSplit,
+      taskAttemptContext: TaskAttemptContext): 
mapreduce.OrcMapreduceRecordReader[V] = {
+    val split = inputSplit.asInstanceOf[FileSplit]
+    val conf = taskAttemptContext.getConfiguration()
+    val readOptions = OrcFile.readerOptions(conf)
+      .maxLength(OrcConf.MAX_FILE_LENGTH.getLong(conf)).useUTCTimestamp(true)

Review comment:
       While `useUTCTimestamp(true)` makes TIMESTAMP_NTZ consistent between ORC 
and Parquet/Avro, it also makes TIMESTAMP consistent between ORC and 
Parquet/Avro. While that seems like a good thing, it's a behavior change 
between 3.2 (which is consistent with earlier versions, including 2.x) and the 
upcoming 3.3
   
   (Up to and including 3.2, TIMESTAMP with ORC has behaved more like 
TIMESTAMP_NTZ).
   
   This also seems to break reading ORC files written by earlier Spark versions.
   
   For example, In Spark 3.2.0, in the America/Los_Angeles timezone, or some 
other non-UTC timezone, write a timestamp to ORC:
   ```
   sql("select timestamp '2021-06-01 00:00:00' 
ts").write.mode("overwrite").format("orc").save("/tmp/testdata/ts_orc_spark32")
   ```
   In Spark with this PR, running in the _same_ timezone as above, read the 
timestamp:
   ```
   scala> sql("select * from `orc`.`/tmp/testdata/ts_orc_spark32`").show(false)
   +-------------------+
   |ts                 |
   +-------------------+
   |2021-05-31 17:00:00| 
   +-------------------+
   
   scala> 
   ```
   (Note that the above will appear to work correctly if your local timezone is 
set to UTC).
   
   I think the fix to TIMESTAMP might need to be done with a feature flag, and 
the fix for TIMESTAMP_NTZ might need to be more targeted to just TIMESTAMP_NTZ.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to