michaeljmarshall opened a new pull request, #19446:
URL: https://github.com/apache/pulsar/pull/19446

   ### Motivation
   
   The intention of https://github.com/apache/pulsar/pull/12780 was to close 
the whole connection when the client sends an unexpected `Send` command. The 
goal of that change was to make the broker more defensive to prevent incorrect 
implementations in the client (see https://github.com/apache/pulsar/pull/12779).
   
   However, the change in #12780 was too broad. It closes the connection in a 
very expected case. Specifically, when the server disconnects a producer due to 
load balancing or unloading event, the broker sends the client a 
`CloseProducer` command. If the broker receives any additional messages for 
that producer, the broker closes the whole connection. This is an expensive 
interruption for clients with many producers/consumers. Because the Pulsar 
Producer is expected to pipeline `Send` commands, there is no current way to 
know if the client sent the messages before or after receiving the send 
command, and because the goal of #12780 was to increase stability, I think we 
should ignore the messages when they are received in these conditions.
   
   I propose that we keep a map of recently closed producers, and use that to 
limit how long we keep around the producer's tombstone.
   
   For reference, when connections are closed due to this weakness in the 
implementation, the broker logs:
   
   
https://github.com/apache/pulsar/blob/fd3ce8b5786baf0b76f301bd9597cd0b99a412f1/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1625-L1626
   
   I observed this log line more than 17,000 in the past 7 days. As such, I 
plan to cherry pick this to active release branches.
   
   ### Modifications
   
   * Add a `HashMap` to track recently closed producers. This map is only ever 
updated on the `ServerCnx`'s event loop. The one downside is that the `HashMap` 
will box the `long` keys and values. However, it is likely faster than the 
`ConcurrentLongHashMap` since it does not have any synchronization.
   * Add logic to ignore `Send` commands if they come recently after the broker 
sent a `CloseProducer` command.
   * Use the keep alive interval as the length of time the producer is 
considered "recently" closed. It is possible that some will want a new 
configuration here. I do not think we need one because we are really just 
waiting for the client to receive the `CloseProducer` request, and the keep 
alive interval should be sufficient for a full round trip from broker to client.
   * Remove the recently closed producer from the map if the client attempts to 
recreate the producer. In this condition, the client has already received the 
close command and is attempting to create a new producer.
   
   ### Verifying this change
   
   New tests are added.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [x] The binary protocol
   
   This affects the protocol in a sense, but it does not change the protocol in 
any negative way.
   
   ### Documentation
   
   - [x] `doc-not-needed`
   
   ### Matching PR in forked repository
   
   PR in forked repository: https://github.com/michaeljmarshall/pulsar/pull/24
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to