[
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anton Daitche updated SPARK-25244:
----------------------------------
Description:
The setting `spark.sql.session.timeZone` is respected by PySpark when
converting from and to Pandas, as described
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
However, when timestamps are converted directly to Pythons `datetime` objects,
its ignored and the systems timezone is used.
This can be checked by the following code snippet
{code:java}
import pyspark.sql
spark = (pyspark
.sql
.SparkSession
.builder
.master('local[1]')
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
)
df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))
print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system,
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the
method `collect` ignored it and converted the timestamp to my systems timezone.
The cause for this behaviour is that the methods `toInternal` and
`fromInternal` of PySparks `TimestampType` class don't take into account the
setting `spark.sql.session.timeZone` and use the system timezone.
If the maintainers agree that this should be fixed, I would be happy to
contribute a patch.
was:
The setting `spark.sql.session.timeZone` is respected by PySpark when
converting from and to Pandas, as described
[here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].]
However, when timestamps are converted directly to Pythons `datetime` objects,
its ignored and the systems timezone is used.
This can be checked by the following code snippet
{code:java}
import pyspark.sql
spark = (pyspark
.sql
.SparkSession
.builder
.master('local[1]')
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
)
df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))
print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system,
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the
method `collect` ignored it and converted the timestamp to my systems timezone.
The cause for this behaviour is that the methods `toInternal` and
`fromInternal` of PySparks `TimestampType` class don't take into account the
setting `spark.sql.session.timeZone` and use the system timezone.
If the maintainers agree that this should be fixed, I would be happy to
contribute a patch.
> [Python] Setting `spark.sql.session.timeZone` only partially respected
> ----------------------------------------------------------------------
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.1
> Reporter: Anton Daitche
> Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when
> converting from and to Pandas, as described
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
> However, when timestamps are converted directly to Pythons `datetime`
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
> .sql
> .SparkSession
> .builder
> .master('local[1]')
> .config("spark.sql.session.timeZone", "UTC")
> .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system,
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the
> method `collect` ignored it and converted the timestamp to my systems
> timezone.
> The cause for this behaviour is that the methods `toInternal` and
> `fromInternal` of PySparks `TimestampType` class don't take into account the
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would be happy to
> contribute a patch.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]