zzzzming95 created SPARK-45071:
----------------------------------
Summary: Optimize the processing speed of
`BinaryArithmetic#dataType` when processing multi-column data
Key: SPARK-45071
URL: https://issues.apache.org/jira/browse/SPARK-45071
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.0, 3.5.0
Reporter: zzzzming95
Since `BinaryArithmetic#dataType` will recursively process the datatype of each
node, the driver will be very slow when multiple columns are processed.
For example, the following code:
```
import spark.implicits._
import scala.util.Random
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.types.\{StructType, StructField, IntegerType}
val N = 30
val M = 100
val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
val schema = StructType(columns.map(StructField(_, IntegerType)))
val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
val df = spark.createDataFrame(rdd, schema)
val colExprs = columns.map(sum(_))
// gen a new column , and add the other 30 column
df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```
This code will take a few minutes for the driver to execute in the spark3.4
version, but only takes a few seconds to execute in the spark3.2 version.
Related issue: SPARK-39316
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]