[PR] [CELEBORN-1982] Slot Selection Perf Improvements [celeborn]

via GitHub Thu, 24 Apr 2025 00:17:27 -0700


akpatnam25 opened a new pull request, #3228:
URL: https://github.com/apache/celeborn/pull/3228


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   After profiling to see where the hotspots are for slot selection, we 
identified 2 main areas: 
   - iter.remove 
([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447))
 is a major hotspot, especially if partitionIdList is massive - since it is an 
ArrayList and we are removing from the begining - resulting in O(n) deletion 
costs. 
   - `haveDisk` is computed per partitionId, iterated across all workers.  We 
precompute this and store it as a field in `WorkerInfo`. 
   
   See the below flamegraph for the hotspot of `iter.remove` 
(`oop_disjoint_arraycopy`). 
   ![Screenshot 2025-04-21 at 3 39 35 
PM](https://github.com/user-attachments/assets/7c02b853-6f43-4ca7-9e7e-d33211a20824)
   
   
   ### Why are the changes needed?
   speed up slot selection performance in the case of large partitionIds
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   After applying the above changes, we can see the hotspot is removed in the 
flamegraph. 
   
   ![Screenshot 2025-04-21 at 3 40 51 
PM](https://github.com/user-attachments/assets/a5c021ad-731d-4ac7-ac08-8798dd00d64f)
   
   ## Benchmarks:
   Without changes: 
   ```
   # Detecting actual CPU count: 12 detected
   # JMH version: 1.37
   # VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
   # VM invoker: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
   # VM options: -javaagent:/Applications/LI IntelliJ IDEA 2023.3 Apple 
Silicon.app/Contents/lib/idea_rt.jar=54704:/Applications/LI IntelliJ IDEA 
2023.3 Apple Silicon.app/Contents/bin -Dfile.encoding=UTF-8
   # Blackhole mode: full + dont-inline hint (auto-detected, use 
-Djmh.blackhole.autoDetect=false to disable)
   # Warmup: 5 iterations, 5 s each
   # Measurement: 5 iterations, 60 s each
   # Timeout: 10 min per iteration
   # Threads: 12 threads, will synchronize iterations
   # Benchmark mode: Average time, time/op
   # Benchmark: 
org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
   
   # Run progress: 0.00% complete, ETA 00:05:25
   # Fork: 1 of 1
   # Warmup Iteration   1: 2060198.745 ±(99.9%) 306976.270 us/op
   # Warmup Iteration   2: 1137534.950 ±(99.9%) 72065.776 us/op
   # Warmup Iteration   3: 1032434.221 ±(99.9%) 59585.256 us/op
   # Warmup Iteration   4: 903621.382 ±(99.9%) 41542.172 us/op
   # Warmup Iteration   5: 921816.398 ±(99.9%) 44025.884 us/op
   Iteration   1: 853276.360 ±(99.9%) 13285.688 us/op
   Iteration   2: 865183.111 ±(99.9%) 9691.856 us/op
   Iteration   3: 909971.254 ±(99.9%) 10201.037 us/op
   Iteration   4: 874154.240 ±(99.9%) 11287.538 us/op
   Iteration   5: 907655.363 ±(99.9%) 11893.789 us/op
   
   
   Result 
"org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
     882048.066 ±(99.9%) 98360.936 us/op [Average]
     (min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
     CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)
   
   
   # Run complete. Total time: 00:05:43
   
   REMEMBER: The numbers below are just data. To gain reusable insights, you 
need to follow up on
   why the numbers are the way they are. Use profilers (see -prof, -lprof), 
design factorial
   experiments, perform baseline and negative tests that provide experimental 
control, make sure
   the benchmarking environment is safe on JVM/OS/HW level, ask for reviews 
from the domain experts.
   Do not assume the numbers tell you what you want them to tell.
   
   Benchmark                                       Mode  Cnt       Score       
Error  Units
   SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  882048.066 ± 
98360.936  us/op
   
   Process finished with exit code 0
   ```
   With changes: 
   ```
   # Detecting actual CPU count: 12 detected
   # JMH version: 1.37
   # VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
   # VM invoker: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
   # VM options: -javaagent:/Applications/LI IntelliJ IDEA 2023.3 Apple 
Silicon.app/Contents/lib/idea_rt.jar=54585:/Applications/LI IntelliJ IDEA 
2023.3 Apple Silicon.app/Contents/bin -Dfile.encoding=UTF-8
   # Blackhole mode: full + dont-inline hint (auto-detected, use 
-Djmh.blackhole.autoDetect=false to disable)
   # Warmup: 5 iterations, 5 s each
   # Measurement: 5 iterations, 60 s each
   # Timeout: 10 min per iteration
   # Threads: 12 threads, will synchronize iterations
   # Benchmark mode: Average time, time/op
   # Benchmark: 
org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
   
   # Run progress: 0.00% complete, ETA 00:05:25
   # Fork: 1 of 1
   # Warmup Iteration   1: 305437.719 ±(99.9%) 81860.733 us/op
   # Warmup Iteration   2: 137498.811 ±(99.9%) 7669.102 us/op
   # Warmup Iteration   3: 129355.869 ±(99.9%) 5030.972 us/op
   # Warmup Iteration   4: 135311.734 ±(99.9%) 6964.080 us/op
   # Warmup Iteration   5: 131013.323 ±(99.9%) 8560.232 us/op
   Iteration   1: 133695.396 ±(99.9%) 3713.684 us/op
   Iteration   2: 143735.961 ±(99.9%) 5858.078 us/op
   Iteration   3: 135619.704 ±(99.9%) 5257.352 us/op
   Iteration   4: 128806.160 ±(99.9%) 4541.790 us/op
   Iteration   5: 134179.546 ±(99.9%) 5137.425 us/op
   
   
   Result 
"org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
     135207.354 ±(99.9%) 20845.544 us/op [Average]
     (min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
     CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)
   
   
   # Run complete. Total time: 00:05:29
   
   REMEMBER: The numbers below are just data. To gain reusable insights, you 
need to follow up on
   why the numbers are the way they are. Use profilers (see -prof, -lprof), 
design factorial
   experiments, perform baseline and negative tests that provide experimental 
control, make sure
   the benchmarking environment is safe on JVM/OS/HW level, ask for reviews 
from the domain experts.
   Do not assume the numbers tell you what you want them to tell.
   
   Benchmark                                       Mode  Cnt       Score       
Error  Units
   SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  135207.354 ± 
20845.544  us/op
   
   Process finished with exit code 0
   ```
   
   882048.066 us/ops without changes vs 135207.354 us/op with changes. That is 
about 6.5x improvement. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [CELEBORN-1982] Slot Selection Perf Improvements [celeborn]

Reply via email to