[ https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brady Bickel resolved SPARK-44912. ---------------------------------- Resolution: Fixed Verified build containing linked issue fix solved the problem. > Spark 3.4 multi-column sum slows with many columns > -------------------------------------------------- > > Key: SPARK-44912 > URL: https://issues.apache.org/jira/browse/SPARK-44912 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.0, 3.4.1 > Reporter: Brady Bickel > Priority: Major > > The code below is a minimal reproducible example of an issue I discovered > with Pyspark 3.4.x. I want to sum the values of multiple columns and put the > sum of those columns (per row) into a new column. This code works and returns > in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in > Pyspark 3.4.x when the number of columns grows. See below for execution > timing summary as N varies. > {code:java} > import pyspark.sql.functions as F > import random > import string > from functools import reduce > from operator import add > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > # generate a dataframe N columns by M rows with random 8 digit column > # names and random integers in [-5,10] > N = 30 > M = 100 > columns = [''.join(random.choices(string.ascii_uppercase + > string.digits, k=8)) > for _ in range(N)] > data = [tuple([random.randint(-5,10) for _ in range(N)]) > for _ in range(M)] > df = spark.sparkContext.parallelize(data).toDF(columns) > # 3 ways to add a sum column, all of them slow for high N in spark 3.4 > df = df.withColumn("col_sum1", sum(df[col] for col in columns)) > df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns])) > df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code} > Timing results for Spark 3.3: > ||N||Exe Time (s)|| > |5|0.514| > |10|0.248| > |15|0.327| > |20|0.403| > |25|0.279| > |30|0.322| > |50|0.430| > Timing results for Spark 3.4: > ||N||Exe Time (s)|| > |5|0.379| > |10|0.318| > |15|0.405| > |20|1.32| > |25|28.8| > |30|448| > |50|>10000 (did not finish)| -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org