[
https://issues.apache.org/jira/browse/SPARK-49178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-49178:
-----------------------------------
Labels: pull-request-available (was: )
> `Row#getSeq` exhibits a performance regression between master and 3.5.
> ----------------------------------------------------------------------
>
> Key: SPARK-49178
> URL: https://issues.apache.org/jira/browse/SPARK-49178
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Yang Jie
> Priority: Major
> Labels: pull-request-available
>
> {code:java}
> object GetSeqBenchmark extends SqlBasedBenchmark {
> import spark.implicits._
> def testRowGetSeq(valuesPerIteration: Int, arraySize: Int): Unit = {
> val data = (0 until arraySize).toArray
> val row = Seq(data).toDF().collect().head
> val benchmark = new Benchmark(
> s"Test get seq with $arraySize from row",
> valuesPerIteration,
> output = output)
> benchmark.addCase("Get Seq") { _: Int =>
> for (_ <- 0L until valuesPerIteration) {
> val ret = row.getSeq(0)
> }
> }
> benchmark.run()
> }
> override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
> val valuesPerIteration = 100000
> testRowGetSeq(valuesPerIteration, 10)
> testRowGetSeq(valuesPerIteration, 100)
> testRowGetSeq(valuesPerIteration, 1000)
> testRowGetSeq(valuesPerIteration, 10000)
> testRowGetSeq(valuesPerIteration, 100000)
> }
> } {code}
>
> branch-3.5
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_422-b05 on Linux 5.15.0-1068-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 10 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 1 1
> 0 194.8 5.1 1.0XOpenJDK 64-Bit Server VM
> 1.8.0_422-b05 on Linux 5.15.0-1068-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 100 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 1 1
> 0 96.8 10.3 1.0XOpenJDK 64-Bit Server VM
> 1.8.0_422-b05 on Linux 5.15.0-1068-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 1000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 1 1
> 0 97.0 10.3 1.0XOpenJDK 64-Bit Server VM
> 1.8.0_422-b05 on Linux 5.15.0-1068-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 10000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 1 1
> 0 96.8 10.3 1.0XOpenJDK 64-Bit Server VM
> 1.8.0_422-b05 on Linux 5.15.0-1068-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 100000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 1 1
> 0 96.9 10.3 1.0X {code}
> master
> {code:java}
> OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 10 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 9 10
> 0 10.5 94.8 1.0XOpenJDK 64-Bit Server VM
> 17.0.12+7-LTS on Linux 6.5.0-1025-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 100 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 65 65
> 1 1.5 646.4 1.0XOpenJDK 64-Bit Server VM
> 17.0.12+7-LTS on Linux 6.5.0-1025-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 1000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 614 615
> 1 0.2 6140.2 1.0XOpenJDK 64-Bit Server VM
> 17.0.12+7-LTS on Linux 6.5.0-1025-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 10000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 6122 6128
> 8 0.0 61223.1 1.0XOpenJDK 64-Bit Server VM
> 17.0.12+7-LTS on Linux 6.5.0-1025-azure
> AMD EPYC 7763 64-Core Processor
> Test get seq with 100000 from row: Best Time(ms) Avg Time(ms)
> Stdev(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------------------------------
> Get Seq 61247 61268
> 30 0.0 612468.1 1.0X {code}
> We can observe that in branch-3.5, the latency of `Row#getSeq` is constant,
> whereas in master, the latency of `Row#getSeq` exhibits a linearly increasing
> trend with the length of the array.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]