Matthias Roels created SPARK-44947:
--------------------------------------
Summary: Taking sum of two columns behaves differently from sum
aggregation function
Key: SPARK-44947
URL: https://issues.apache.org/jira/browse/SPARK-44947
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.4.1
Environment: * Docker container: python:3.10-slim-bullseye
* Java: openjdk-17-jre-headless
* Spark 3.4.1
Reporter: Matthias Roels
Taking the sum of two columns behaves differently when there are NULL values
than taking the SUM of a column. This is odd and confusing for users
Reproducible example:
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()
$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
>
+---+----+
|foo| bar|
+---+----+
| 1| 2|
| 2|null|
+---+----+
$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
| 3| 2|
+--------+--------+
$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
>
+-----------+
|sum(foobar)|
+-----------+
| 3|
| null|
+-----------+{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]