Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Wenchen Fan Wed, 24 Apr 2019 08:33:33 -0700

Did you re-create your df when you update the timezone conf?

On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia <shubh.chaura...@gmail.com>
wrote:


> Writing:
> scala> df.write.orc("<some_path>")
>
> For looking into contents, I used orc-tools-X.Y.Z-uber.jar (
> https://orc.apache.org/docs/java-tools.html)
>
> On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> How did you read/write the timestamp value from/to ORC file?
>>
>> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <
>> shubh.chaura...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Consider the following(spark v2.4.0):
>>>
>>> Basically I change values of `spark.sql.session.timeZone` and perform an
>>> orc write. Here are 3 samples:-
>>>
>>> 1)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>>>
>>> scala> val df = sc.parallelize(Seq("2019-04-23
>>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
>>> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>>>
>>> df.show() Output                  ORC File Contents
>>> -------------------------------------------------------------
>>> 2019-04-23 09:15:04           {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 2)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>>>
>>> df.show() Output                  ORC File Contents
>>> -------------------------------------------------------------
>>> 2019-04-23 03:45:04           {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 3)
>>> scala> spark.conf.set("spark.sql.session.timeZone",
>>> "America/Los_Angeles")
>>>
>>> df.show() Output                  ORC File Contents
>>> -------------------------------------------------------------
>>> 2019-04-22 20:45:04           {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> It can be seen that in all the three cases it stores {"ts":"2019-04-23
>>> 09:15:04.0"} in orc file. I understand that orc file also contains writer
>>> timezone with respect to which spark is able to convert back to actual time
>>> when it reads orc.(and that is equal to df.show())
>>>
>>> But it's problematic in the sense that it is not adjusting(plus/minus)
>>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
>>> 09:15:04.0"} in ORC file. I mean loading data to any system other than
>>> spark would be a problem.
>>>
>>> Any ideas/suggestions on that?
>>>
>>> PS: For csv files, it stores exactly what we see as the output of
>>> df.show()
>>>
>>> Thanks,
>>> Shubham
>>>
>>>

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Reply via email to