MichalKoziorowski-TomTom commented on issue #21082:
URL: https://github.com/apache/pulsar/issues/21082#issuecomment-1699184180
Hi.
I will write here my case because it might be related:
I had previously server & client version 2.8.3, and everything worked fine.
After upgrading the server and Java client to version 3.0.1 or 3.1.0, problems
appeared.
My server has almost OOTB settings, with the following properties changed:
```
managedLedgerDefaultEnsembleSize: "3"
managedLedgerDefaultWriteQuorum: "3"
managedLedgerDefaultAckQuorum: "2"
brokerDeduplicationEnabled: "true"
# bookkeeperClientTimeoutInSeconds changed from default 30
# It allows to catch bookkeeper problems earlier and in case of
problematic bookies,
# be able to retry sendMessage within standard 30 seconds. After the
change, when bookie does not ack message in 5 seconds
# A new ensemble is created, and sendMessage is retried in about 15
seconds.
bookkeeperClientTimeoutInSeconds: "5"
# bookkeeperClientHealthCheckErrorThresholdPerInterval changed from
default 5
# It allows to react to bookie timeouts faster.
# By default, the health check interval is 60 seconds, and we are not
changing that.
# bookkeeperClientHealthCheckErrorThresholdPerInterval=3 means that if
there will be >= 3 timeouts within 60 seconds,
# bookkeeper will be quarantined, and the ensemble will be recreated on
different bookkeepers.
bookkeeperClientHealthCheckErrorThresholdPerInterval: "3"
# bookkeeperClientHealthCheckQuarantineTimeInSeconds changed from
default 1800 seconds.
# Bookkeeper is quarantined when broker detects addEntry timeouts.
# We are lowering this value because we lowered
bookkeeperClientTimeoutInSeconds, and in case of transient issues
# we don't want to have all bookies quarantined in a short time.
bookkeeperClientHealthCheckQuarantineTimeInSeconds: "600"
# Needed to set custom policies per topic
(https://jira.tomtomgroup.com/browse/NAV-103543)
systemTopicEnabled: "true"
topicLevelPoliciesEnabled: "true"
```
Our tenant and namespaces are created with:
```
bin/pulsar-admin --admin-url "${ADMIN_URL}" tenants create "${TENANT}"
--allowed-clusters pulsar
bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces create
"${TENANT}/batch"
bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces
set-max-unacked-messages-per-consumer -c 10 "${TENANT}/batch"
bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces
set-max-unacked-messages-per-subscription -c 20 "${TENANT}/batch"
```
We are using MultiTopicsConsumer to fetch messages from all queues in the
batch namespace. Below is how the client and consumer are configured:
```
**PULSAR CLIENT:**
PulsarClient.builder()
.ioThreads(1)
.listenerThreads(1)
.enableTlsHostnameVerification(false)
.serviceUrl(<URL>)
.keepAliveInterval(10_000, TimeUnit.MILLISECONDS)
.connectionTimeout(10_111, TimeUnit.MILLISECONDS)
.operationTimeout(30_000, TimeUnit.MILLISECONDS)
.startingBackoffInterval(100, TimeUnit.MILLISECONDS)
.maxBackoffInterval(10_000, TimeUnit.MILLISECONDS)
.build();
**CONSUMER:**
return pulsarClient
.newConsumer(<SCHEMA>)
.subscriptionName(<RANDOM_SUBSCIPTION_NAME>)
.subscriptionInitialPosition(SubscriptionInitialPosition.Earliest)
.subscriptionType(SubscriptionType.Shared)
.topicsPattern(<PATTERN_CATCHING_ALL_BATCH_NAMESPACE_QUEUES>)
.negativeAckRedeliveryDelay(100, TimeUnit.MILLISECONDS)
.patternAutoDiscoveryPeriod(60, TimeUnit.SECONDS)
.receiverQueueSize(1);
```
Messages are acked usually after 100 - 1000ms
Our publisher tries to have a constant number of messages in the queue
(about 100) and adds more after the previous message is processed.
With pulsar server & client in version 3.0.1 or 3.1.0, we see the following
values in prometheus:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]