rdhabalia opened a new pull request #4138: [pulsar-broker]Fix: client-producer 
can't reconnect due to failed producer-future on cnx cache
URL: https://github.com/apache/pulsar/pull/4138
 
 
   ### Motivation
   
   Frequently having issue where broker doesn't allow client to create producer 
and client keep receiving error: `Producer is already present on the connection`
   
   - It happens when same client-host is connected to one broker for two 
different topics which are part of different bundles. So, client shares same 
cnx-C1 with broker-B1 for the two different topics which are part of different 
bundles. 
   - Now, one of the bundle and topic-T1 gets unloaded and goes to different 
broker-B2. while unloading, client try to reconnect to broker B1 and 
producer-creation fails due to unloading. However, in the unloading process, 
broker sometimes couldn't clear this failed-producer-future from the cache.
   - After sometime, the same bundle and topic T1 again gets unloaded from 
broker-B2 and comes back to Broker-B1.
   - Now, client already has cnx-C1 with broker-B1 (because bundle-2 topics are 
still connected) and try to create producer (with same producer-id which was 
created initially at client) for topic-T1 on the same cnx-C1.
   - Some how, broker couldn't clear old producer-result from the cnx-C1 cache 
and it thinks that producer is already connected so, client keeps failing and 
couldn't create the producer.
   
   **Broker-log**
   ```
   1#(1) Bundle started unloading
   10:11:18.980 [pulsar-web-29-18] INFO  
org.apache.pulsar.broker.namespace.OwnedBundle - Disabling ownership: 
my-prop/my-cluster/my-ns/0x00000000_0x80000000
   :
   10:11:19.026 [pulsar-web-29-18] INFO  
org.apache.pulsar.broker.service.BrokerService - 
[persistent://my-prop/my-cluster/my-ns/topic-partition-0] Unloading topic
   :
   10:11:19.029 [pulsar-web-29-18] INFO  
org.apache.pulsar.broker.service.Producer - Disconnecting producer: 
Producer{topic=PersistentTopic{topic=persistent://my-prop/my-cluster/my-ns/topic-partition-0},
 client=/121.121.121.121:58042, producerName=my-cluster-2897-13224, 
producerId=1069}
   :
   #(2) While unloading topic: producer disconnected and client tried again to 
create producer on same cnx: 121.121.121.121:58042 .. but I think 
producerFuture is not removed
   10:11:19.133 [pulsar-io-21-48] INFO  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Creating producer. producerId=1069
   10:11:19.134 [pulsar-io-21-42] INFO  
org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup 
yahoo.dht.set.prod-tenzing.su for topic 
persistent://my-prop/my-cluster/my-ns/topic-partition-0 with error Namespace 
bundle my-prop/my-cluster/my-ns/0x00000000_0x80000000 is being unloaded
   :
   10:11:19.134 [pulsar-io-21-42] INFO  
org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup 
yahoo.dht.set.prod-tenzing.su for topic 
persistent://my-prop/my-cluster/my-ns/topic-partition-0 with error Namespace 
bundle my-prop/my-cluster/my-ns/0x00000000_0x80000000 is being unloaded
   10:11:19.204 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO  
org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper - 
[my-prop/my-cluster/my-ns/persistent/topic-partition-0] 
[ContactsBackend.BlendedContacts_Index_subscription] Updating cursor info 
ledgerId=-1 mark-delete=10548168989:116241
   :
   10:11:19.208 [bookkeeper-ml-workers-OrderedExecutor-19-0] INFO  
org.apache.pulsar.broker.service.persistent.PersistentTopic - 
[persistent://my-prop/my-cluster/my-ns/topic-partition-0] Topic closed
   :
   10:11:19.211 [pulsar-io-21-27] WARN  
org.apache.pulsar.broker.service.BrokerService - 
[persistent://my-prop/my-cluster/my-ns/topic-partition-0] Service unit is not 
ready when loading the topic
   #(3) Topic again assigned to same broker and stale and failed 
producer-future already exists which doesn't allow producer to reconnect and 
keep giving error: Producer already connected
   :
   16:46:19.116 [bookkeeper-ml-workers-OrderedExecutor-19-0] INFO  
org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - 
[persistent://my-prop/my-cluster/my-ns/topic-partition-0] [null] setting 
message-dispatch-rate DispatchRate{dispatchThrottlingRateInMsg=0, 
dispatchThrottlingRateInByte=0, ratePeriodInSecond=1}
   16:46:19.155 [pulsar-io-21-48] WARN  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Producer is already present on the connection
   :
   16:46:19.155 [pulsar-io-21-48] WARN  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Producer is already present on the connection
   16:46:19.350 [pulsar-io-21-48] WARN  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Producer is already present on the connection
   16:46:19.751 [pulsar-io-21-48] WARN  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Producer is already present on the connection
   16:46:20.528 [pulsar-io-21-48] WARN  
org.apache.pulsar.broker.service.ServerCnx - 
[/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0]
 Producer is already present on the connection
   ```
   
   **client-log**
   CPP-Client log shows unknown error
   ```
   20190425 16:46:19.155 : WARN  
cms./var/builds/workspace/260010-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/pulsar/pulsar-client-cpp/lib/ProducerImpl.cc
 - 
[persistent://sherpa-prodhd/prod2-sg3/prodhd-NOTIFICATION-ContactsBackend.BlendedContacts_Index/DIFF-partition-0,
 prod2-sg3-2897-13224] Failed to reconnect producer: UnknownError
   ```
   
   I couldn't reproduce it in my local env but it happens very frequently to my 
prod env.
   
   ### Modification
   
   - Clean up producer-future from the cache if future is already completed 
with failure. 
   - added log once schema-future completes for future troubleshooting to make 
sure schema-future doesn't get stuck.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to