Change default timestamp offset on data load

Jack Goodson Tue, 05 Sep 2023 16:42:18 -0700

Hi,

I've got a number of tables that I'm loading in from a SQL server. The
timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these as
parquet files in our raw storage location and pick them up in Databricks.
When I load the data in databricks, the dataframe/spark assumes UTC or
+0000 on the timestamp like 2003-11-24T09:02:32+0000 the time and date is
the same as what's in SQL server however the offset is incorrect


I've tried various methods like the below code to set the JVM timezone to
my local timezone but when viewing the data it seems to just subtract the
offset from the timestamp and add it to the offset part like
2003-11-24T09:02:32+0000
-> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)

spark = pyspark.sql.SparkSession \
.Builder()\
.appName('test') \
.master('local') \
.config('spark.driver.extraJavaOptions', '-Duser.timezone=Pacific/Auckland')
\
.config('spark.executor.extraJavaOptions',
'-Duser.timezone=Pacific/Auckland') \
.config('spark.sql.session.timeZone', 'Pacific/Auckland') \
.getOrCreate()



I understand that in Parquet these are stored as UNIX time and aren't
timezone aware, however are there any settings that I can set in spark that
would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000
to 2003-11-24T09:02:32+1300 I know this can be done with transformations
however I'm trying to avoid doing transformations for every timestamp on
100's tables

Any help much appreciated, thanks,

Jack

Change default timestamp offset on data load

Reply via email to