[
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781539#comment-17781539
]
Javier commented on SPARK-45745:
--------------------------------
I originally posted a comment in StackOverflow asking for feedback on this
([https://stackoverflow.com/questions/77391731/extremely-slow-execution-in-spark-3-4-1-when-computing-the-sum-of-pyspark-datafr])
and a user there pointed me to a problem to a never ending UT reported here
https://issues.apache.org/jira/browse/SPARK-43972 It is for the same Spark
version, but I honestly don't know if this can be related.
> Extremely slow execution of sum of columns in Spark 3.4.1
> ---------------------------------------------------------
>
> Key: SPARK-45745
> URL: https://issues.apache.org/jira/browse/SPARK-45745
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.4.1
> Reporter: Javier
> Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to
> Spark 3.4.1 and some code that was running fine is now basically never ending
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
> "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the
> computation time seems to explode when the value of `n_cols` is bigger than
> about 25 columns. A colleague suggested that it could be related to the limit
> of 22 elements in a tuple in Scala 2.13, since the 25 columns are
> suspiciously close to this. Is there any known defect in the logical plan
> optimization in 3.4.1? Or is this kind of operations (sum of multiple
> columns) supposed to be implemented differently?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]