heesung-sn opened a new pull request, #19343:
URL: https://github.com/apache/pulsar/pull/19343
<!--
### Contribution Checklist
- PR title format should be *[type][component] summary*. For details, see
*[Guideline - Pulsar PR Naming
Convention](https://docs.google.com/document/d/1d8Pw6ZbWk-_pCKdOmdvx9rnhPiyuxwq60_TrD68d7BA/edit#heading=h.trs9rsex3xom)*.
- Fill out the template below to describe the changes contributed by the
pull request. That will give reviewers the context they need to do the review.
- Each pull request should address only one issue, not mix up code from
multiple issues.
- Each commit in the pull request has a meaningful commit message
- Once all items of the checklist are addressed, remove the above text and
this checklist, leaving only the filled out template below.
-->
### Motivation
Related to https://github.com/apache/pulsar/pull/19341,
- When there are many concurrent subscriptions in a topic(10k> ), brokers'
performance degrades because `checkBackloggedCursors()` iterates all
subscriptions for each subscribe call.
```
"pulsar-io-4-2" #77 prio=5 os_prio=0 cpu=17289.90ms elapsed=170.09s
tid=0x00007f85b0085800 nid=0xa2 runnable [0x00007f859d05e000]
java.lang.Thread.State: RUNNABLE
at
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
- locked <0x00007f8921b24c58> (a
org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
at
org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
at
org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
at
org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown
Source)
at
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
at
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
at java.lang.Thread.run([email protected]/Thread.java:829)
"pulsar-io-4-3" #78 prio=5 os_prio=0 cpu=17224.94ms elapsed=170.09s
tid=0x00007f85b0087000 nid=0xa3 waiting for monitor entry [0x00007f859cf5d000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.deactivateCursor(ManagedLedgerImpl.java:3526)
- waiting to lock <0x00007f8921b24c58> (a
org.apache.bookkeeper.mledger.impl.ManagedCursorContainer)
at
org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setInactive(ManagedCursorImpl.java:1116)
at
org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$checkBackloggedCursors$95(PersistentTopic.java:2363)
at
org.apache.pulsar.broker.service.persistent.PersistentTopic$$Lambda$1346/0x00007f8465c23cb0.accept(Unknown
Source)
at
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:544)
at
org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:272)
--
at java.lang.Thread.run([email protected]/Thread.java:829)
```
- Running `checkBackloggedCursors()` in the subscription path is redundant
as `checkBackloggedCursors()` is called in `updateStats()` every
`statsUpdateFrequencyInSecs`(by default 1min).
### Modifications
Removed `checkBackloggedCursors()` from `subscribe()` call.
### Verifying this change
- [x] Make sure that the change passes the CI checks.
This change can be verified as follows:
**simulate many concurrent subscriptions**
```
#!/bin/sh
while :
do
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss a &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss b &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss c &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss d &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss e &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss f &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss g &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss h &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss i &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss j &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss k &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss l &
bin/pulsar-perf consume my-topic -r 10 -ns 10000 -st Exclusive -sp
Earliest -time 3 -ss m &
for job in `jobs -p`
do
echo $job
wait $job
done
done
```
**before this PR**
```
bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-27T18:47:30,430-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 0
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:47:40,510-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 11.8 msg/s --- 0.1 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct:
12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - Max: 12728.447
2023-01-27T18:47:50,531-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:00,555-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:10,583-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:20,616-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:30,656-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:40,683-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:48:50,711-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:49:00,734-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 119
msg --- 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 -
99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000
2023-01-27T18:49:01,300-0800 [pulsar-perf-producer-exec-1-1] INFO
org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached
the maximum number: 1000 of production) --------------
2023-01-27T18:49:06,306-0800 [Thread-0] INFO
org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats
--- 119 records sent --- 1.123 msg/s --- 0.009 Mbit/s
2023-01-27T18:49:06,337-0800 [Thread-0] INFO
org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats ---
Latency: mean: 6834.356 ms - med: 6834.815 - 95pct: 12128.831 - 99pct:
12629.887 - 99.9pct: 12728.447 - 99.99pct: 12728.447 - 99.999pct: 12728.447 -
Max: 12728.447
```
**after this PR**
```
bin/pulsar-perf produce my-topic -r 10 -m 1000
2023-01-27T21:26:43,991-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 15
msg --- 1.5 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 5465.266 ms - med: 6249.727 - 95pct: 7482.431 - 99pct: 7584.639
- 99.9pct: 7584.639 - 99.99pct: 7584.639 - Max: 7584.639
2023-01-27T21:26:54,025-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 62
msg --- 4.6 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 9959.775 ms - med: 10424.703 - 95pct: 12068.927 - 99pct:
12823.935 - 99.9pct: 12823.935 - 99.99pct: 12823.935 - Max: 12823.935
2023-01-27T21:27:04,048-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 97
msg --- 3.5 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 16149.440 ms - med: 16641.855 - 95pct: 19121.023 - 99pct:
19414.143 - 99.9pct: 19414.143 - 99.99pct: 19414.143 - Max: 19414.143
2023-01-27T21:27:14,070-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 131
msg --- 3.4 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 22474.266 ms - med: 22154.495 - 95pct: 24488.959 - 99pct:
25234.559 - 99.9pct: 25234.559 - 99.99pct: 25234.559 - Max: 25234.559
2023-01-27T21:27:24,106-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 192
msg --- 6.1 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 28824.695 ms - med: 28932.479 - 95pct: 30471.679 - 99pct:
30507.007 - 99.9pct: 30516.479 - 99.99pct: 30516.479 - Max: 30516.479
2023-01-27T21:27:34,143-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 269
msg --- 7.7 msg/s --- 0.1 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 31316.609 ms - med: 30886.527 - 95pct: 32863.231 - 99pct:
32963.199 - 99.9pct: 33045.759 - 99.99pct: 33045.759 - Max: 33045.759
2023-01-27T21:27:44,178-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 418
msg --- 14.9 msg/s --- 0.1 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 30904.110 ms - med: 31308.287 - 95pct: 33457.919 - 99pct:
33696.511 - 99.9pct: 33730.303 - 99.99pct: 33730.303 - Max: 33730.303
2023-01-27T21:27:54,205-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 792
msg --- 37.3 msg/s --- 0.3 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 12566.494 ms - med: 11749.119 - 95pct: 25876.223 - 99pct:
27032.319 - 99.9pct: 27429.375 - 99.99pct: 27429.375 - Max: 27429.375
2023-01-27T21:28:04,231-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 892
msg --- 10.0 msg/s --- 0.1 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 158.367 ms - med: 103.553 - 95pct: 512.813 - 99pct: 626.371 -
99.9pct: 729.767 - 99.99pct: 729.767 - Max: 729.767
2023-01-27T21:28:14,254-0800 [main] INFO
org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 992
msg --- 10.0 msg/s --- 0.1 Mbit/s --- failure 0.0 msg/s ---
Latency: mean: 113.023 ms - med: 87.656 - 95pct: 275.551 - 99pct: 362.147 -
99.9pct: 384.981 - 99.99pct: 384.981 - Max: 384.981
2023-01-27T21:28:14,871-0800 [pulsar-perf-producer-exec-1-1] INFO
org.apache.pulsar.testclient.PerformanceProducer - ------------- DONE (reached
the maximum number: 1000 of production) --------------
2023-01-27T21:28:19,879-0800 [Thread-0] INFO
org.apache.pulsar.testclient.PerformanceProducer - Aggregated throughput stats
--- 1000 records sent --- 9.434 msg/s --- 0.074 Mbit/s
2023-01-27T21:28:19,942-0800 [Thread-0] INFO
org.apache.pulsar.testclient.PerformanceProducer - Aggregated latency stats ---
Latency: mean: 15404.022 ms - med: 14796.287 - 95pct: 32638.591 - 99pct:
33425.023 - 99.9pct: 33696.511 - 99.99pct: 33730.303 - 99.999pct: 33730.303 -
Max: 33730.303
```
### Does this pull request potentially affect one of the following parts:
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
*If the box was checked, please highlight the changes*
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
### Documentation
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local
preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR
description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update
later -->
- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
### Matching PR in forked repository
PR in forked repository: https://github.com/heesung-sn/pulsar/pull/24
<!--
After opening this PR, the build in apache/pulsar will fail and instructions
will
be provided for opening a PR in the PR author's forked repository.
apache/pulsar pull requests should be first tested in your own fork since
the
apache/pulsar CI based on GitHub Actions has constrained resources and quota.
GitHub Actions provides separate quota for pull requests that are executed
in
a forked repository.
The tests will be run in the forked repository until all PR review comments
have
been handled, the tests pass and the PR is approved by a reviewer.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]