[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

Vindhya G (Jira) Thu, 04 Dec 2025 22:20:08 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042979#comment-18042979
 ]


Vindhya G edited comment on SPARK-54372 at 12/5/25 6:19 AM:
------------------------------------------------------------

I see that scala spark also behaves same as pyspark.  Both use epoch format 
(long/double) when it is cast to timestamp and aggregated.  As per the code 
pyspark internally uses java type to cast timestamp as "double" (epoch format). 
Same with Scala.  Duckdb spark on the other hand seems to be using python 
native datetimeobject of python causing difference in behaviour.  
[https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104
 
|https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104]

from_unixtime can be used in pyspark  to convert epoch back to timestamp value. 
 
{code:java}
.agg(F.from_unixtime(F.avg(F.col("c0").cast("timestamp")))")){code}


was (Author: JIRAUSER299405):
I see that scala spark also behaves same as pyspark.  Both use epoch format 
(long/double) As per the code pyspark internally uses java type to cast 
timestamp as "double" (epoch format). Same with Scala.  Duckdb spark on the 
other hand seems to be using python native datetimeobject of python causing 
difference in behaviour.  
[https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104
 
|https://github.com/duckdb/duckdb/blob/v1.3-ossivalis/tools/pythonpkg/duckdb/experimental/spark/sql/type_utils.py#L104]

from_unixtime can be used in pyspark  to convert epoch back to timestamp value. 
 
{code:java}
.agg(F.from_unixtime(F.avg(F.col("c0").cast("timestamp")))")){code}

> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
>                 Key: SPARK-54372
>                 URL: https://issues.apache.org/jira/browse/SPARK-54372
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform:            Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python:              3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark                  4.0.1
> duckdb                   1.4.2
> pandas                   2.3.3
> pyarrow                  22.0.0
>            Reporter: asddfl
>            Priority: Critical
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from 
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
>     'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result = 
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result = 
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+                                       
>   
> |        c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21|                 -979200.0|
> +----------+--------------------------+
> Duckdb Spark result: 
> ┌──────┬──────────────┐
> │     c0     │ avg(CAST(c0 AS TIMESTAMP)) │
> │  varchar   │         timestamp          │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00        │
> └──────┴──────────────┘
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-54372) PySpark: incorrect `avg()` query result

Reply via email to