akpatnam25 opened a new pull request, #3228:
URL: https://github.com/apache/celeborn/pull/3228
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX]
Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we
identified 2 main areas:
- iter.remove
([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447))
is a major hotspot, especially if partitionIdList is massive - since it is an
ArrayList and we are removing from the begining - resulting in O(n) deletion
costs.
- `haveDisk` is computed per partitionId, iterated across all workers. We
precompute this and store it as a field in `WorkerInfo`.
See the below flamegraph for the hotspot of `iter.remove`
(`oop_disjoint_arraycopy`).

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the
flamegraph.

## Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker:
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/LI IntelliJ IDEA 2023.3 Apple
Silicon.app/Contents/lib/idea_rt.jar=54704:/Applications/LI IntelliJ IDEA
2023.3 Apple Silicon.app/Contents/bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use
-Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark:
org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration 2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration 3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration 4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration 5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration 1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration 2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration 3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration 4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration 5: 907655.363 ±(99.9%) 11893.789 us/op
Result
"org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
882048.066 ±(99.9%) 98360.936 us/op [Average]
(min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)
# Run complete. Total time: 00:05:43
REMEMBER: The numbers below are just data. To gain reusable insights, you
need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof),
design factorial
experiments, perform baseline and negative tests that provide experimental
control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews
from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score
Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 882048.066 ±
98360.936 us/op
Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker:
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/LI IntelliJ IDEA 2023.3 Apple
Silicon.app/Contents/lib/idea_rt.jar=54585:/Applications/LI IntelliJ IDEA
2023.3 Apple Silicon.app/Contents/bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use
-Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark:
org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration 2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration 3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration 4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration 5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration 1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration 2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration 3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration 4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration 5: 134179.546 ±(99.9%) 5137.425 us/op
Result
"org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
135207.354 ±(99.9%) 20845.544 us/op [Average]
(min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)
# Run complete. Total time: 00:05:29
REMEMBER: The numbers below are just data. To gain reusable insights, you
need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof),
design factorial
experiments, perform baseline and negative tests that provide experimental
control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews
from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score
Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 135207.354 ±
20845.544 us/op
Process finished with exit code 0
```
882048.066 us/ops without changes vs 135207.354 us/op with changes. That is
about 6.5x improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]