LuciferYang opened a new pull request, #38533: URL: https://github.com/apache/spark/pull/38533
### What changes were proposed in this pull request? This pr change to use a more appropriate constructor when the input is `ArrayBuffer` or `Empty Collection` to improve the construction performance of `GenericArrayData` with Scala 2.13. ### Why are the changes needed? Minor performance improvement. `GenericArrayData ` has the following constructor https://github.com/apache/spark/blob/57d492556768eb341f525ce7eb5c934089fa9e7e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L30 When the input type is `ArrayBuffer`, the following code is similar in Spark ``` new GenericArrayData(arrayBuffer.toSeq) ``` For Scala 2.12, there will be no performance difference. However, when Scala 2.13 is used, there will be a performance gap, because 'toSeq' will cause a redundant memory copy. For the following test case: ```scala val benchmark = new Benchmark(s"constructor with buffer size = $bufferSize", valuesPerIteration, output = output) benchmark.addCase("toSeq and construct") { _ => var n = 0 while (n < valuesPerIteration) { new GenericArrayData(buffer.toSeq) n += 1 } } benchmark.addCase("construct directly") { _ => var n = 0 while (n < valuesPerIteration) { new GenericArrayData(buffer) n += 1 } } ``` When bufferSize=10, there is a performance gap of more than 5 times between a and b, and the performance gap increases almost linearly with the increase of bufferSize There will be more than 5 times performance gap between `new GenericArrayData(buffer.toSeq)` and `new GenericArrayData(buffer)` when `bufferSize = 10` and the performance gap will increase with the increase of bufferSize. ``` OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz constructor with buffer size = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ toSeq and construct 2617 2622 7 3.8 261.7 1.0X construct directly 399 406 11 25.1 39.9 6.6X OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz constructor with buffer size = 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ toSeq and construct 12512 12554 60 0.8 1251.2 1.0X construct directly 779 781 2 12.8 77.9 16.1X OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz constructor with buffer size = 1000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ toSeq and construct 108882 109400 732 0.1 10888.2 1.0X construct directly 5717 5731 20 1.7 571.7 19.0X ``` We can safely change `new GenericArrayData(buffer.toSeq)` to `new GenericArrayData(buffer)` due to `ArrayBuffer` is still `scala.collection.Seq` in Scala 2.13. On the other hand, when input is an empty set, using `Array.empty` is 10% faster than using `Seq.empty`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manually test the `catalyst` and `sql` module with Scala 2.13 **will add result** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
