[
https://issues.apache.org/jira/browse/SPARK-54372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043307#comment-18043307
]
Ashrith Bandla edited comment on SPARK-54372 at 12/7/25 6:16 AM:
-----------------------------------------------------------------
I’m investigating. Current behavior: analyzer coerces timestamp to double, so
`avg(ts)` returns epoch seconds. In ANSI mode this is surprising. I'm in favor
of adding a change to make this more visible, in ANSI mode reject `avg` on
timestamp/date (no implicit cast to numeric) and raise a clear analysis error,
and keep non-ANSI behavior unchanged. I added this behavior in my PR here:
https://github.com/apache/spark/pull/53373
was (Author: JIRAUSER311521):
I’m investigating. Current behavior: analyzer coerces timestamp to double, so
`avg(ts)` returns epoch seconds. In ANSI mode this is surprising. I'm in favor
of adding a change to make this more visible, in ANSI mode reject `avg` on
timestamp/date (no implicit cast to numeric) and raise a clear analysis error,
and keep non-ANSI behavior unchanged. I’ll add regression tests to
SQLQueryTestSuite.
> PySpark: incorrect `avg(<timestamp>)` query result
> --------------------------------------------------
>
> Key: SPARK-54372
> URL: https://issues.apache.org/jira/browse/SPARK-54372
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.1
> Environment: Platform: Ubuntu 24.04
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025,
> 22:29:10) [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
> sharing)
> pyspark 4.0.1
> duckdb 1.4.2
> pandas 2.3.3
> pyarrow 22.0.0
> Reporter: asddfl
> Priority: Critical
> Labels: pull-request-available
>
> The `avg(<timestamp>)` query result of PySpark is incorrect.
> I consider the query results from PySpark should be the same as those from
> DuckDBSpark, returning a timestamp value.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from duckdb.experimental.spark.sql import SparkSession as DuckdbSparkSession
> pd_df = pd.DataFrame({
> 'c0': ['1969-12-21'],
> })
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> print("PySpark result:")
> pyspark_result =
> spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> pyspark_result.show()
> duckdb_spark = DuckdbSparkSession.builder.getOrCreate()
> df = duckdb_spark.createDataFrame(pd_df)
> df.createOrReplaceTempView("t0")
> from duckdb.experimental.spark.sql import functions as F
> print("Duckdb Spark result: ")
> duckdb_spark_result =
> duckdb_spark.table("t0").groupBy("c0").agg(F.avg(F.col("c0").cast("timestamp")))
> duckdb_spark_result.show()
> {code}
> {code:bash}
> PySpark result:
> +----------+--------------------------+
>
> | c0|avg(CAST(c0 AS TIMESTAMP))|
> +----------+--------------------------+
> |1969-12-21| -979200.0|
> +----------+--------------------------+
> Duckdb Spark result:
> ┌──────┬──────────────┐
> │ c0 │ avg(CAST(c0 AS TIMESTAMP)) │
> │ varchar │ timestamp │
> ├──────┼──────────────┤
> │ 1969-12-21 │ 1969-12-21 00:00:00 │
> └──────┴──────────────┘
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]