Yang Jie created SPARK-49178:
--------------------------------
Summary: `Row#getSeq` exhibits a performance regression between
master and 3.5.
Key: SPARK-49178
URL: https://issues.apache.org/jira/browse/SPARK-49178
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.0.0
Reporter: Yang Jie
{code:java}
object GetSeqBenchmark extends SqlBasedBenchmark {
import spark.implicits._
def testRowGetSeq(valuesPerIteration: Int, arraySize: Int): Unit = {
val data = (0 until arraySize).toArray
val row = Seq(data).toDF().collect().head
val benchmark = new Benchmark(
s"Test get seq with $arraySize from row",
valuesPerIteration,
output = output)
benchmark.addCase("Get Seq") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
val ret = row.getSeq(0)
}
}
benchmark.run()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val valuesPerIteration = 100000
testRowGetSeq(valuesPerIteration, 10)
testRowGetSeq(valuesPerIteration, 100)
testRowGetSeq(valuesPerIteration, 1000)
testRowGetSeq(valuesPerIteration, 10000)
testRowGetSeq(valuesPerIteration, 100000)
}
} {code}
branch-3.5
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_422-b05 on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 1 1
0 194.8 5.1 1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 1 1
0 96.8 10.3 1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 1000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 1 1
0 97.0 10.3 1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 1 1
0 96.8 10.3 1.0XOpenJDK 64-Bit Server VM 1.8.0_422-b05
on Linux 5.15.0-1068-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 1 1
0 96.9 10.3 1.0X {code}
master
{code:java}
OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 9 10
0 10.5 94.8 1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 65 65
1 1.5 646.4 1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 1000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 614 615
1 0.2 6140.2 1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 10000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 6122 6128
8 0.0 61223.1 1.0XOpenJDK 64-Bit Server VM 17.0.12+7-LTS
on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
Test get seq with 100000 from row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Get Seq 61247 61268
30 0.0 612468.1 1.0X {code}
We can observe that in branch-3.5, the latency of `Row#getSeq` is constant,
whereas in master, the latency of `Row#getSeq` exhibits a linearly increasing
trend with the length of the array.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]