Did you re-create your df when you update the timezone conf? On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia <shubh.chaura...@gmail.com> wrote:
> Writing: > scala> df.write.orc("<some_path>") > > For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( > https://orc.apache.org/docs/java-tools.html) > > On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> How did you read/write the timestamp value from/to ORC file? >> >> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia < >> shubh.chaura...@gmail.com> wrote: >> >>> Hi All, >>> >>> Consider the following(spark v2.4.0): >>> >>> Basically I change values of `spark.sql.session.timeZone` and perform an >>> orc write. Here are 3 samples:- >>> >>> 1) >>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") >>> >>> scala> val df = sc.parallelize(Seq("2019-04-23 >>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp")) >>> df: org.apache.spark.sql.DataFrame = [ts: timestamp] >>> >>> df.show() Output ORC File Contents >>> ------------------------------------------------------------- >>> 2019-04-23 09:15:04 {"ts":"2019-04-23 09:15:04.0"} >>> >>> 2) >>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC") >>> >>> df.show() Output ORC File Contents >>> ------------------------------------------------------------- >>> 2019-04-23 03:45:04 {"ts":"2019-04-23 09:15:04.0"} >>> >>> 3) >>> scala> spark.conf.set("spark.sql.session.timeZone", >>> "America/Los_Angeles") >>> >>> df.show() Output ORC File Contents >>> ------------------------------------------------------------- >>> 2019-04-22 20:45:04 {"ts":"2019-04-23 09:15:04.0"} >>> >>> It can be seen that in all the three cases it stores {"ts":"2019-04-23 >>> 09:15:04.0"} in orc file. I understand that orc file also contains writer >>> timezone with respect to which spark is able to convert back to actual time >>> when it reads orc.(and that is equal to df.show()) >>> >>> But it's problematic in the sense that it is not adjusting(plus/minus) >>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23 >>> 09:15:04.0"} in ORC file. I mean loading data to any system other than >>> spark would be a problem. >>> >>> Any ideas/suggestions on that? >>> >>> PS: For csv files, it stores exactly what we see as the output of >>> df.show() >>> >>> Thanks, >>> Shubham >>> >>>