Sebastian Eckweiler created SPARK-35367:
-------------------------------------------
Summary: Window group-key and pandas inconsistent
Key: SPARK-35367
URL: https://issues.apache.org/jira/browse/SPARK-35367
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1
Reporter: Sebastian Eckweiler
Not completely sure whether this is a bug or a configuration/usage issue:
We are seeing inconsistent timezone-treatment when using a windowed group by
aggregation in combiation with a pandas udf.
A minimal example:
{code:java}
// code placeholder
def a_udf(group_key, pdf: pd.DataFrame) -> pd.DataFrame:
w_start = group_key[0]["start"]
w_end = group_key[0]["end"]
print(f"Pandas : {pdf['window_start'].iloc[0]} to
{pdf['window_end'].iloc[0]}")
print(f"Group key: {w_start} to {w_end}")
print(f"Data : {pdf['time'].min()} to {pdf['time'].max()}")
assert (pdf["time"] >= w_start).all()
assert (pdf["time"] < w_end).all()
# some result
return pd.DataFrame.from_records([{"result": 1}])
df = spark.createDataFrame([(datetime.datetime(2020, 1, 1, 12, 30, 0),)],
schema=["time"])
w = window("time", "60 minutes")
df.withColumn("window_start", w.start).withColumn("window_end",
w.end).groupby(w).applyInPandas(a_udf, schema="result int").show()
{code}
Produces:
{code:java}
Pandas : 2020-01-01 12:00:00 to 2020-01-01 13:00:00
Group key: 2020-01-01 11:00:00 to 2020-01-01 12:00:00
Data : 2020-01-01 12:30:00 to 2020-01-01 12:30:00
{code}
And the assertions fail. It seems the group-key goes through same timezone-
(and dst?) conversion that ends up being one hour off.
This is without any specific timezone configuration and with CEST as the local
timezone.
Is this working as expected?
Setting
{code:java}
"spark.sql.session.timeZone": "UTC"
"spark.driver.extraJavaOptions": "-Duser.timezone=UTC"
"spark.executor.extraJavaOptions": "-Duser.timezone=UTC"
{code}
seems to be workaround.
Using "Europe/Berlin" for all three timezone settings however reproduces the
inconsistent behaviour.
I would assume though, that running this in a non-UTC timezone should generally
be possible?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]