asddfl created SPARK-54372:
------------------------------

             Summary: PySpark: incorrect `avg(<timestamp>)` query result
                 Key: SPARK-54372
                 URL: https://issues.apache.org/jira/browse/SPARK-54372
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.0.1
         Environment: Platform:            Ubuntu 24.04 
Linux-6.14.0-35-generic-x86_64-with-glibc2.39
Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
22:29:10) [GCC 14.3.0]
openjdk version "17.0.17-internal" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
sharing)
pyspark                  4.0.1
duckdb                   1.4.2
pandas                   2.3.3
pyarrow                  22.0.0

            Reporter: asddfl


The `avg(<timestamp>)` query result of PySpark is incorrect.
I consider the query results from PySpark should be the same as those from 
DuckDBSpark, returning a timestamp value.

{code:python}
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession

sql_text = "SELECT AVG(CAST(t0.c0 AS TIMESTAMP)) FROM t0"
pd_df = pd.DataFrame({
    'c0': ['1969-12-21'],
})

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("t0")

print("PySpark SQL result:")
pyspark_result = spark.sql(sql_text)
pyspark_result.show()

print("PySpark API result:")
pyspark_result = spark.table("t0").select(F.avg(F.col("c0").cast("timestamp")))
pyspark_result.show()

duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
df = duckdb_spark.createDataFrame(pd_df)
df.createOrReplaceTempView("t0")

print("Duckdb Spark SQL result: ")
duckdb_spark_result = duckdb_spark.sql(sql_text)
duckdb_spark_result.show()
{code}


{code:bash}
PySpark SQL result:
+--------------------------+                                                    
|avg(CAST(c0 AS TIMESTAMP))|
+--------------------------+
|                 -979200.0|
+--------------------------+

PySpark API result:
+--------------------------+
|avg(CAST(c0 AS TIMESTAMP))|
+--------------------------+
|                 -979200.0|
+--------------------------+

Duckdb Spark SQL result: 
┌────────────────┐
│ avg(CAST(t0.c0 AS TIMESTAMP))  │
│           timestamp            │
├────────────────┤
│ 1969-12-21 00:00:00            │
└────────────────┘
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to