Nasir Ali created SPARK-33863:
---------------------------------
Summary: Pyspark UDF changes timestamps to UTC
Key: SPARK-33863
URL: https://issues.apache.org/jira/browse/SPARK-33863
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1
Environment: MAC/Linux
Standalone cluster / local machine
Reporter: Nasir Ali
*Problem*:
If I create a new column using udf, pyspark udf changes timestamps into UTC
time. I have used following configs to let spark know the timestamps are in UTC:
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone",
"UTC").getOrCreate()
df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
("usr1",13.00, "2018-03-11T12:27:18+00:00"),
("usr1",25.00, "2018-03-12T11:27:18+00:00"),
("usr1",20.00, "2018-03-13T15:27:18+00:00"),
("usr1",17.00, "2018-03-14T12:27:18+00:00"),
("usr2",99.00, "2018-03-15T11:27:18+00:00"),
("usr2",156.00, "2018-03-22T11:27:18+00:00"),
("usr2",17.00, "2018-03-31T11:27:18+00:00"),
("usr2",25.00, "2018-03-15T11:27:18+00:00"),
("usr2",25.00, "2018-03-16T11:27:18+00:00")
],
["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)
def some_time_udf(i):
tmp=""
if datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp="Morning -> "+str(i)
elif datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
tmp= "Afternoon -> "+str(i)
elif datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
tmp= "Evening -> "+str(i)
elif datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
tmp= "Night -> "+str(i)
elif datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
tmp= "Night -> "+str(i)
return tmpsometimeudf =
F.udf(some_time_udf,StringType())df.withColumn("day_part",
sometimeudf("ts")).show(truncate=False)
{code}
I have concatenated timestamps with the string to show that pyspark pass
timestamps as UTC.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]