[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

zzzzming95 (Jira) Mon, 04 Sep 2023 06:24:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zzzzming95 updated SPARK-45071:
-------------------------------
    Description: 
Since `BinaryArithmetic#dataType` will recursively process the datatype of each 
node, the driver will be very slow when multiple columns are processed.

For example, the following code:
{code:java}
```
    import spark.implicits._
    import scala.util.Random
    import org.apache.spark.sql.functions.sum
    import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
    val N = 30
    val M = 100
    val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
    val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
    val schema = StructType(columns.map(StructField(_, IntegerType)))
    val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
    val df = spark.createDataFrame(rdd, schema)
    val colExprs = columns.map(sum(_))
    // gen a new column , and add the other 30 column
    df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```
{code}
 

This code will take a few minutes for the driver to execute in the spark3.4 
version, but only takes a few seconds to execute in the spark3.2 version. 
Related issue: SPARK-39316

  was:
Since `BinaryArithmetic#dataType` will recursively process the datatype of each 
node, the driver will be very slow when multiple columns are processed.

For example, the following code:
```
    import spark.implicits._
    import scala.util.Random
    import org.apache.spark.sql.functions.sum
    import org.apache.spark.sql.types.\{StructType, StructField, IntegerType}

    val N = 30
    val M = 100

    val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
    val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))

    val schema = StructType(columns.map(StructField(_, IntegerType)))
    val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
    val df = spark.createDataFrame(rdd, schema)
    val colExprs = columns.map(sum(_))

    // gen a new column , and add the other 30 column
    df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```

This code will take a few minutes for the driver to execute in the spark3.4 
version, but only takes a few seconds to execute in the spark3.2 version. 
Related issue: SPARK-39316


> Optimize the processing speed of `BinaryArithmetic#dataType` when processing 
> multi-column data
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-45071
>                 URL: https://issues.apache.org/jira/browse/SPARK-45071
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: zzzzming95
>            Priority: Major
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of 
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
>     import spark.implicits._
>     import scala.util.Random
>     import org.apache.spark.sql.functions.sum
>     import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
>     val N = 30
>     val M = 100
>     val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
>     val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
>     val schema = StructType(columns.map(StructField(_, IntegerType)))
>     val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
>     val df = spark.createDataFrame(rdd, schema)
>     val colExprs = columns.map(sum(_))
>     // gen a new column , and add the other 30 column
>     df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>  
> This code will take a few minutes for the driver to execute in the spark3.4 
> version, but only takes a few seconds to execute in the spark3.2 version. 
> Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Reply via email to