heesung-sn opened a new issue, #15792:
URL: https://github.com/apache/pulsar/issues/15792
**Describe the bug**
Intermittent latency spikes have been observed from the current master
branch, 2.11. When running OpenMessaging Benchmarks long enough for hours,
sometimes Pub latency abnormally spikes and comes back to the normal. I suspect
this could be due to some random network issues in the test environment(AWS) or
some edge-case performance regressions introduced in the latest code.
Unfortunately, 2.9 branch didn't show this issue so far.
**To Reproduce**
Steps to reproduce the behavior:
- Run OpenMessaging Benchmark Test(1-topic-100-partitions-1kb-4p-4c-500k)
with the current pulsar master branch(2.11) several times(at least 3 times)
with duration 15mins for each run. (skip journaling)
- Observe the max latency spikes > 200ms in the logs like the followings.
```
######## Observation 1:
[ec2-user@ip-10-0-0-236 benchmark]$ sudo bin/benchmark \
> --drivers driver-pulsar/pulsar.yaml \
> workloads/1-topic-100-partitions-1kb-4p-4c-500k.yaml
...
05:24:37.011 [main] INFO - Workloads: {
"1-topic-100-partitions-1kb-4p-4c-500k" : {
"name" : "500k rate 4 producers and 4 consumers on 1 topic / 100
partition",
"topics" : 1,
"partitionsPerTopic" : 100,
"keyDistributor" : "NO_KEY",
"messageSize" : 1024,
"useRandomizedPayloads" : true,
"randomBytesRatio" : 0.5,
"randomizedPayloadPoolSize" : 1000,
"payloadFile" : null,
"subscriptionsPerTopic" : 1,
"producersPerTopic" : 4,
"consumerPerSubscription" : 4,
"producerRate" : 500000,
"consumerBacklogSizeGB" : 0,
"testDurationMinutes" : 5,
"warmupDurationMinutes" : 1
}
}
05:24:37.286 [main] INFO - Workers list - producers:
[http://10.0.0.236:8080, http://10.0.0.165:8080]
05:24:37.287 [main] INFO - Workers list - consumers: [http://10.0.0.73:8080,
http://10.0.0.170:8080]
05:24:37.291 [main] INFO - --------------- WORKLOAD : 500k rate 4 producers
and 4 consumers on 1 topic / 100 partition --- DRIVER : Pulsar---------------
05:24:39.028 [main] INFO - Created 1 topics in 1083.131706 ms
05:24:42.706 [main] INFO - Created 4 consumers in 3672.372545 ms
05:24:42.857 [main] INFO - Created 4 producers in 150.298553 ms
05:24:42.857 [main] INFO - Waiting for consumers to be ready
05:24:42.873 [main] INFO - All consumers are ready
05:24:43.008 [main] INFO - ----- Starting warm-up traffic (1m) ------
...
05:26:24.471 [main] INFO - Pub rate 500638.7 msg/s / 488.9 MB/s | Cons rate
500687.5 msg/s / 489.0 MB/s | Backlog: 0.0 K | Pub Latency (ms) avg: 1.6 -
50%: 1.1 - 99%: 1.7 - 99.9%: 138.4 - Max: 511.8 | Pub Delay Latency (us) avg:
26.3 - 50%: 24.0 - 99%: 55.0 - 99.9%: 63.0 - Max: 946.0
...
05:30:26.759 [main] INFO - Pub rate 499597.5 msg/s / 487.9 MB/s | Cons rate
499626.2 msg/s / 487.9 MB/s | Backlog: -5.0 K | Pub Latency (ms) avg: 1.1 -
50%: 1.1 - 99%: 1.6 - 99.9%: 6.1 - Max: 25.9 | Pub Delay Latency (us) avg:
26.2 - 50%: 24.0 - 99%: 55.0 - 99.9%: 62.0 - Max: 163.0
05:30:36.847 [main] INFO - Pub rate 500577.0 msg/s / 488.8 MB/s | Cons rate
500527.0 msg/s / 488.8 MB/s | Backlog: -4.5 K | Pub Latency (ms) avg: 1.1 -
50%: 1.1 - 99%: 1.6 - 99.9%: 1.8 - Max: 70.6 | Pub Delay Latency (us) avg:
26.3 - 50%: 24.0 - 99%: 55.0 - 99.9%: 63.0 - Max: 6422.0
05:30:46.942 [main] INFO - Pub rate 499666.5 msg/s / 488.0 MB/s | Cons rate
499692.0 msg/s / 488.0 MB/s | Backlog: -4.8 K | Pub Latency (ms) avg: 1.1 -
50%: 1.1 - 99%: 1.6 - 99.9%: 4.9 - Max: 64.9 | Pub Delay Latency (us) avg:
26.2 - 50%: 24.0 - 99%: 55.0 - 99.9%: 62.0 - Max: 259.0
05:30:47.044 [main] INFO - ----- Aggregated Pub Latency (ms) avg: 1.2 -
50%: 1.1 - 95%: 1.5 - 99%: 1.6 - 99.9%: 1.9 - 99.99%: 233.5 - Max: 511.8 |
Pub Delay (us) avg: 26.2 - 50%: 24.0 - 95%: 55.0 - 99%: 55.0 - 99.9%: 62.0 -
99.99%: 92.0 - Max: 6422.0
05:30:49.357 [main] INFO - Writing test result into
1-topic-100-partitions-1kb-4p-4c-500k-Pulsar-2022-05-17-05-30-49.json
######## Observation 2:
00:37:32.044 [main] INFO - Workloads: {
"1-topic-100-partitions-1kb-4p-4c-500k" : {
"name" : "500k rate 4 producers and 4 consumers on 1 topic / 100
partition",
"topics" : 1,
"partitionsPerTopic" : 100,
"keyDistributor" : "NO_KEY",
"messageSize" : 1024,
"useRandomizedPayloads" : true,
"randomBytesRatio" : 0.5,
"randomizedPayloadPoolSize" : 1000,
"payloadFile" : null,
"subscriptionsPerTopic" : 1,
"producersPerTopic" : 4,
"consumerPerSubscription" : 4,
"producerRate" : 500000,
"consumerBacklogSizeGB" : 0,
"testDurationMinutes" : 15,
"warmupDurationMinutes" : 1
}
}
...
00:47:42.172 [main] INFO - Pub rate 499434.4 msg/s / 487.7 MB/s | Cons rate
499439.6 msg/s / 487.7 MB/s | Backlog: -0.1 K | Pub Latency (ms) avg: 3.6 -
50%: 1.2 - 99%: 108.2 - 99.9%: 159.3 - Max: 220.4 | Pub Delay Latency (us)
avg: 26.3 - 50%: 24.0 - 99%: 55.0 - 99.9%: 61.0 - Max: 150.0
...
00:53:35.406 [main] INFO - Pub rate 500426.2 msg/s / 488.7 MB/s | Cons rate
500456.2 msg/s / 488.7 MB/s | Backlog: -0.0 K | Pub Latency (ms) avg: 1.2 -
50%: 1.2 - 99%: 1.8 - 99.9%: 1.9 - Max: 2.4 | Pub Delay Latency (us) avg:
26.2 - 50%: 24.0 - 99%: 55.0 - 99.9%: 62.0 - Max: 131.0
00:53:45.499 [main] INFO - Pub rate 499718.5 msg/s / 488.0 MB/s | Cons rate
499726.6 msg/s / 488.0 MB/s | Backlog: -0.1 K | Pub Latency (ms) avg: 1.2 -
50%: 1.2 - 99%: 1.8 - 99.9%: 1.9 - Max: 2.9 | Pub Delay Latency (us) avg:
26.2 - 50%: 24.0 - 99%: 55.0 - 99.9%: 62.0 - Max: 131.0
00:53:45.612 [main] INFO - ----- Aggregated Pub Latency (ms) avg: 1.2 -
50%: 1.2 - 95%: 1.6 - 99%: 1.8 - 99.9%: 2.0 - 99.99%: 111.2 - Max: 220.4 |
Pub Delay (us) avg: 26.4 - 50%: 24.0 - 95%: 55.0 - 99%: 55.0 - 99.9%: 62.0 -
99.99%: 91.0 - Max: 1027.0
```
**Expected behavior**
We expect max Pub latency < 100ms from
`1-topic-100-partitions-1kb-4p-4c-500k`. (Please correct me if I am wrong here)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]