zzzzming95 created SPARK-45071:
----------------------------------

             Summary: Optimize the processing speed of 
`BinaryArithmetic#dataType` when processing multi-column data
                 Key: SPARK-45071
                 URL: https://issues.apache.org/jira/browse/SPARK-45071
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.0, 3.5.0
            Reporter: zzzzming95


Since `BinaryArithmetic#dataType` will recursively process the datatype of each 
node, the driver will be very slow when multiple columns are processed.

For example, the following code:
```
    import spark.implicits._
    import scala.util.Random
    import org.apache.spark.sql.functions.sum
    import org.apache.spark.sql.types.\{StructType, StructField, IntegerType}

    val N = 30
    val M = 100

    val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
    val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))

    val schema = StructType(columns.map(StructField(_, IntegerType)))
    val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
    val df = spark.createDataFrame(rdd, schema)
    val colExprs = columns.map(sum(_))

    // gen a new column , and add the other 30 column
    df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```

This code will take a few minutes for the driver to execute in the spark3.4 
version, but only takes a few seconds to execute in the spark3.2 version. 
Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to