Matthias Roels created SPARK-44947:
--------------------------------------

             Summary: Taking sum of two columns behaves differently from sum 
aggregation function
                 Key: SPARK-44947
                 URL: https://issues.apache.org/jira/browse/SPARK-44947
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.4.1
         Environment: * Docker container: python:3.10-slim-bullseye
 * Java: openjdk-17-jre-headless
 * Spark 3.4.1
            Reporter: Matthias Roels


Taking the sum of two columns behaves differently when there are NULL values 
than taking the SUM of a column. This is odd and confusing for users

Reproducible example: 
{code:java}
$ from pyspark.sql import SparkSession
$ spark = SparkSession.builder.getOrCreate()

$ df = spark.createDataFrame([(1, 2), (2, None)], ["foo", "bar"])
$ df.show()
> 
+---+----+
|foo| bar|
+---+----+
|  1|   2|
|  2|null|
+---+----+

$ df.select(f.sum("foo"), f.sum("bar")).show()
>
+--------+--------+
|sum(foo)|sum(bar)|
+--------+--------+
|       3|       2|
+--------+--------+

$ df.select((f.col("foo") + f.col("bar")).alias("sum(foobar)")).show()
> 
+-----------+
|sum(foobar)|
+-----------+
|          3|
|       null|
+-----------+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to