[GitHub] [spark] LuciferYang opened a new pull request, #38533: [SPARK-41029][SQL] Optimize the use of `GenericArrayData` constructor for Scala 2.13

GitBox Mon, 07 Nov 2022 04:17:23 -0800


LuciferYang opened a new pull request, #38533:
URL: https://github.com/apache/spark/pull/38533


   ### What changes were proposed in this pull request?
   This pr change to use a more appropriate constructor when the input is 
`ArrayBuffer` or `Empty Collection` to improve the construction performance of 
`GenericArrayData` with Scala 2.13.
   
   
   
   
   ### Why are the changes needed?
   Minor performance improvement.
   
   `GenericArrayData ` has the following constructor
   
   
https://github.com/apache/spark/blob/57d492556768eb341f525ce7eb5c934089fa9e7e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L30
   
   When the input type is `ArrayBuffer`, the following code is similar in Spark
   
   ```
   new GenericArrayData(arrayBuffer.toSeq)
   ```
   
   For Scala 2.12, there will be no performance difference.
   
   However, when Scala 2.13 is used, there will be a performance gap, because 
'toSeq' will cause a redundant memory copy.
   
   For the following test case:
   
   ```scala
     val benchmark = new Benchmark(s"constructor with buffer size = 
$bufferSize",
       valuesPerIteration, output = output)
     benchmark.addCase("toSeq and construct") { _ =>
       var n = 0
       while (n < valuesPerIteration) {
         new GenericArrayData(buffer.toSeq)
         n += 1
       }
     }
   
     benchmark.addCase("construct directly") { _ =>
       var n = 0
       while (n < valuesPerIteration) {
         new GenericArrayData(buffer)
         n += 1
       }
     }
   ```
   
   When bufferSize=10, there is a performance gap of more than 5 times between 
a and b, and the performance gap increases almost linearly with the increase of 
bufferSize
   
   There will be more than 5 times performance gap between `new 
GenericArrayData(buffer.toSeq)` and `new GenericArrayData(buffer)` when 
`bufferSize = 10` and the performance gap will increase with the increase of 
bufferSize.
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 10:        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                                2617           2622       
    7          3.8         261.7       1.0X
   construct directly                                  399            406       
   11         25.1          39.9       6.6X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 100:       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                               12512          12554       
   60          0.8        1251.2       1.0X
   construct directly                                  779            781       
    2         12.8          77.9      16.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 1000:      Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                              108882         109400       
  732          0.1       10888.2       1.0X
   construct directly                                 5717           5731       
   20          1.7         571.7      19.0X
   ```
   
   
   We can safely change `new GenericArrayData(buffer.toSeq)` to `new 
GenericArrayData(buffer)` due to `ArrayBuffer` is still `scala.collection.Seq` 
in Scala 2.13.
   
   
   On the other hand, when input is an empty set, using `Array.empty` is 10% 
faster than using `Seq.empty`.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   - Pass GitHub Actions
   - Manually test the `catalyst` and `sql` module with Scala 2.13
   
   **will add result**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang opened a new pull request, #38533: [SPARK-41029][SQL] Optimize the use of `GenericArrayData` constructor for Scala 2.13

Reply via email to