Yang Jie created SPARK-49178:
--------------------------------

             Summary: `Row#getSeq` exhibits a performance regression between 
master and 3.5.
                 Key: SPARK-49178
                 URL: https://issues.apache.org/jira/browse/SPARK-49178
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Yang Jie


{code:java}
object GetSeqBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def testRowGetSeq(valuesPerIteration: Int, arraySize: Int): Unit = {

    val data = (0 until arraySize).toArray
    val row = Seq(data).toDF().collect().head

    val benchmark = new Benchmark(
      s"Test get seq with $arraySize from row",
      valuesPerIteration,
      output = output)

    benchmark.addCase("Get Seq") { _: Int =>

      for (_ <- 0L until valuesPerIteration) {
        val ret = row.getSeq(0)
      }
    }

    benchmark.run()
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val valuesPerIteration = 100000
    testRowGetSeq(valuesPerIteration, 10)
    testRowGetSeq(valuesPerIteration, 100)
    testRowGetSeq(valuesPerIteration, 1000)
    testRowGetSeq(valuesPerIteration, 10000)
    testRowGetSeq(valuesPerIteration, 100000)
  }
} {code}
 

branch-3.5
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_422-b05 on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10 from row:            Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               1              1          
 0        194.8           5.1       1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05 
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100 from row:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               1              1          
 0         96.8          10.3       1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05 
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 1000 from row:          Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               1              1          
 0         97.0          10.3       1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05 
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10000 from row:         Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               1              1          
 0         96.8          10.3       1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05 
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100000 from row:        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               1              1          
 0         96.9          10.3       1.0X {code}
master
{code:java}
OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10 from row:            Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                               9             10          
 0         10.5          94.8       1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS 
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100 from row:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                              65             65          
 1          1.5         646.4       1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS 
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 1000 from row:          Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                             614            615          
 1          0.2        6140.2       1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS 
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10000 from row:         Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                            6122           6128          
 8          0.0       61223.1       1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS 
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100000 from row:        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq                                           61247          61268          
30          0.0      612468.1       1.0X {code}
We can observe that in branch-3.5, the latency of `Row#getSeq` is constant, 
whereas in master, the latency of `Row#getSeq` exhibits a linearly increasing 
trend with the length of the array.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to