[
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Javier updated SPARK-45745:
---------------------------
Description:
We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to Spark
3.4.1 and some code that was running fine is now basically never ending even
for small dataframes.
We have simplified the problematic piece of code and the minimum pySpark
example below shows the issue:
{code:java}
n_cols = 50
data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
df_data = sql_context.createDataFrame(data)
df_data = df_data.withColumn(
"col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
)
df_data.show(10, False) {code}
Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the computation
time seems to explode when the value of `n_cols` is bigger than about 25
columns. A colleague suggested that it could be related to the limit of 22
elements in a tuple in Scala 2.13
(https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25
columns are suspiciously close to this. Is there any known defect in the
logical plan optimization in 3.4.1? Or is this kind of operations (sum of
multiple columns) supposed to be implemented differently?
was:
We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to Spark
3.4.1 and some code that was running fine is now basically never ending even
for small dataframes.
We have simplified the problematic piece of code and the minimum pySpark
example below shows the issue:
{code:java}
n_cols = 50
data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
df_data = sql_context.createDataFrame(data)
df_data = df_data.withColumn(
"col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
)
df_data.show(10, False) {code}
Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the computation
time seems to explode when the value of `n_cols` is bigger than about 25
columns. A colleague suggested that it could be related to the limit of 22
elements in a tuple in Scala 2.13, since the 25 columns are suspiciously close
to this. Is there any known defect in the logical plan optimization in 3.4.1?
Or is this kind of operations (sum of multiple columns) supposed to be
implemented differently?
> Extremely slow execution of sum of columns in Spark 3.4.1
> ---------------------------------------------------------
>
> Key: SPARK-45745
> URL: https://issues.apache.org/jira/browse/SPARK-45745
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.4.1
> Reporter: Javier
> Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to
> Spark 3.4.1 and some code that was running fine is now basically never ending
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
> "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the
> computation time seems to explode when the value of `n_cols` is bigger than
> about 25 columns. A colleague suggested that it could be related to the limit
> of 22 elements in a tuple in Scala 2.13
> (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25
> columns are suspiciously close to this. Is there any known defect in the
> logical plan optimization in 3.4.1? Or is this kind of operations (sum of
> multiple columns) supposed to be implemented differently?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]