[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25244:
----------------------------------
    Description: 
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master('local[1]')
         .config("spark.sql.session.timeZone", "UTC")
         .getOrCreate()
        )

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master('local[1]')
         .config("spark.sql.session.timeZone", "UTC")
         .getOrCreate()
        )

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> ----------------------------------------------------------------------
>
>                 Key: SPARK-25244
>                 URL: https://issues.apache.org/jira/browse/SPARK-25244
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>            Reporter: Anton Daitche
>            Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>          .sql
>          .SparkSession
>          .builder
>          .master('local[1]')
>          .config("spark.sql.session.timeZone", "UTC")
>          .getOrCreate()
>         )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to