[ 
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781539#comment-17781539
 ] 

Javier commented on SPARK-45745:
--------------------------------

I originally posted a comment in StackOverflow asking for feedback on this 
([https://stackoverflow.com/questions/77391731/extremely-slow-execution-in-spark-3-4-1-when-computing-the-sum-of-pyspark-datafr])
 and a user there pointed me to a problem to a never ending UT reported here 
https://issues.apache.org/jira/browse/SPARK-43972 It is for the same Spark 
version, but I honestly don't know if this can be related.

> Extremely slow execution of sum of columns in Spark 3.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-45745
>                 URL: https://issues.apache.org/jira/browse/SPARK-45745
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.1
>            Reporter: Javier
>            Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to 
> Spark 3.4.1 and some code that was running fine is now basically never ending 
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark 
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
>     "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the 
> computation time seems to explode when the value of `n_cols` is bigger than 
> about 25 columns. A colleague suggested that it could be related to the limit 
> of 22 elements in a tuple in Scala 2.13, since the 25 columns are 
> suspiciously close to this. Is there any known defect in the logical plan 
> optimization in 3.4.1? Or is this kind of operations (sum of multiple 
> columns) supposed to be implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to