[GitHub] [pulsar] MichalKoziorowski-TomTom commented on issue #21082: [Bug] one topic suddenly cannot be consumed,others is ok

via GitHub Wed, 30 Aug 2023 06:34:46 -0700


MichalKoziorowski-TomTom commented on issue #21082:
URL: https://github.com/apache/pulsar/issues/21082#issuecomment-1699184180


   Hi.
   
   I will write here my case because it might be related:
   I had previously server & client version 2.8.3, and everything worked fine. 
After upgrading the server and Java client to version 3.0.1 or 3.1.0, problems 
appeared. 
   
   My server has almost OOTB settings, with the following properties changed:
   ```
       managedLedgerDefaultEnsembleSize: "3"
       managedLedgerDefaultWriteQuorum: "3"
       managedLedgerDefaultAckQuorum: "2"
       brokerDeduplicationEnabled: "true"
       # bookkeeperClientTimeoutInSeconds changed from default 30
       # It allows to catch bookkeeper problems earlier and in case of 
problematic bookies,
       # be able to retry sendMessage within standard 30 seconds. After the 
change, when bookie does not ack message in 5 seconds
       # A new ensemble is created, and sendMessage is retried in about 15 
seconds.
       bookkeeperClientTimeoutInSeconds: "5"
       # bookkeeperClientHealthCheckErrorThresholdPerInterval changed from 
default 5
       # It allows to react to bookie timeouts faster.
       # By default, the health check interval is 60 seconds, and we are not 
changing that.
       # bookkeeperClientHealthCheckErrorThresholdPerInterval=3 means that if 
there will be >= 3 timeouts within 60 seconds,
       # bookkeeper will be quarantined, and the ensemble will be recreated on 
different bookkeepers.
       bookkeeperClientHealthCheckErrorThresholdPerInterval: "3"
       # bookkeeperClientHealthCheckQuarantineTimeInSeconds changed from 
default 1800 seconds.
       # Bookkeeper is quarantined when broker detects addEntry timeouts.
       # We are lowering this value because we lowered 
bookkeeperClientTimeoutInSeconds, and in case of transient issues
       # we don't want to have all bookies quarantined in a short time.
       bookkeeperClientHealthCheckQuarantineTimeInSeconds: "600"
       # Needed to set custom policies per topic 
(https://jira.tomtomgroup.com/browse/NAV-103543)
       systemTopicEnabled: "true"
       topicLevelPoliciesEnabled: "true"
   ```
   
   Our tenant and namespaces are created with:
   ```
   bin/pulsar-admin --admin-url "${ADMIN_URL}" tenants create "${TENANT}" 
--allowed-clusters pulsar
    bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces create 
"${TENANT}/batch"
   bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces 
set-max-unacked-messages-per-consumer -c 10 "${TENANT}/batch"
   bin/pulsar-admin --admin-url "${ADMIN_URL}" namespaces 
set-max-unacked-messages-per-subscription -c 20 "${TENANT}/batch"
   ```
   
   We are using MultiTopicsConsumer to fetch messages from all queues in the 
batch namespace. Below is how the client and consumer are configured:
   ```
   **PULSAR CLIENT:**
   
       PulsarClient.builder()
           .ioThreads(1)
           .listenerThreads(1)
           .enableTlsHostnameVerification(false)
           .serviceUrl(<URL>)
           .keepAliveInterval(10_000, TimeUnit.MILLISECONDS)
           .connectionTimeout(10_111, TimeUnit.MILLISECONDS)
           .operationTimeout(30_000, TimeUnit.MILLISECONDS)
           .startingBackoffInterval(100, TimeUnit.MILLISECONDS)
           .maxBackoffInterval(10_000, TimeUnit.MILLISECONDS)
           .build();
   
   **CONSUMER:**
   
   return pulsarClient
           .newConsumer(<SCHEMA>)
           .subscriptionName(<RANDOM_SUBSCIPTION_NAME>)
           .subscriptionInitialPosition(SubscriptionInitialPosition.Earliest)
           .subscriptionType(SubscriptionType.Shared)
           .topicsPattern(<PATTERN_CATCHING_ALL_BATCH_NAMESPACE_QUEUES>)
           .negativeAckRedeliveryDelay(100, TimeUnit.MILLISECONDS)
           .patternAutoDiscoveryPeriod(60, TimeUnit.SECONDS)
           .receiverQueueSize(1);
   
   ```
   Messages are acked usually after 100 - 1000ms
   
   Our publisher tries to have a constant number of messages in the queue 
(about 100) and adds more after the previous message is processed. 
   
   With pulsar server & client in version 3.0.1 or 3.1.0, we see the following 
values in prometheus:
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] MichalKoziorowski-TomTom commented on issue #21082: [Bug] one topic suddenly cannot be consumed,others is ok

Reply via email to