asddfl created SPARK-54372:
------------------------------
Summary: PySpark: incorrect `avg(<timestamp>)` query result
Key: SPARK-54372
URL: https://issues.apache.org/jira/browse/SPARK-54372
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.0.1
Environment: Platform: Ubuntu 24.04
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025,
22:29:10) [GCC 14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
sharing)
pyspark 4.0.1
duckdb 1.4.2
pandas 2.3.3
pyarrow 22.0.0
Reporter: asddfl
The `avg(<timestamp>)` query result of PySpark is incorrect.
I consider the query results from PySpark should be the same as those from
DuckDBSpark, returning a timestamp value.
{code:python}
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
sql_text = "SELECT AVG(CAST(t0.c0 AS TIMESTAMP)) FROM t0"
pd_df = pd.DataFrame({
'c0': ['1969-12-21'],
})
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("t0")
print("PySpark SQL result:")
pyspark_result = spark.sql(sql_text)
pyspark_result.show()
print("PySpark API result:")
pyspark_result = spark.table("t0").select(F.avg(F.col("c0").cast("timestamp")))
pyspark_result.show()
duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
df = duckdb_spark.createDataFrame(pd_df)
df.createOrReplaceTempView("t0")
print("Duckdb Spark SQL result: ")
duckdb_spark_result = duckdb_spark.sql(sql_text)
duckdb_spark_result.show()
{code}
{code:bash}
PySpark SQL result:
+--------------------------+
|avg(CAST(c0 AS TIMESTAMP))|
+--------------------------+
| -979200.0|
+--------------------------+
PySpark API result:
+--------------------------+
|avg(CAST(c0 AS TIMESTAMP))|
+--------------------------+
| -979200.0|
+--------------------------+
Duckdb Spark SQL result:
┌────────────────┐
│ avg(CAST(t0.c0 AS TIMESTAMP)) │
│ timestamp │
├────────────────┤
│ 1969-12-21 00:00:00 │
└────────────────┘
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]