michaeljmarshall opened a new pull request, #19446: URL: https://github.com/apache/pulsar/pull/19446
### Motivation The intention of https://github.com/apache/pulsar/pull/12780 was to close the whole connection when the client sends an unexpected `Send` command. The goal of that change was to make the broker more defensive to prevent incorrect implementations in the client (see https://github.com/apache/pulsar/pull/12779). However, the change in #12780 was too broad. It closes the connection in a very expected case. Specifically, when the server disconnects a producer due to load balancing or unloading event, the broker sends the client a `CloseProducer` command. If the broker receives any additional messages for that producer, the broker closes the whole connection. This is an expensive interruption for clients with many producers/consumers. Because the Pulsar Producer is expected to pipeline `Send` commands, there is no current way to know if the client sent the messages before or after receiving the send command, and because the goal of #12780 was to increase stability, I think we should ignore the messages when they are received in these conditions. I propose that we keep a map of recently closed producers, and use that to limit how long we keep around the producer's tombstone. For reference, when connections are closed due to this weakness in the implementation, the broker logs: https://github.com/apache/pulsar/blob/fd3ce8b5786baf0b76f301bd9597cd0b99a412f1/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1625-L1626 I observed this log line more than 17,000 in the past 7 days. As such, I plan to cherry pick this to active release branches. ### Modifications * Add a `HashMap` to track recently closed producers. This map is only ever updated on the `ServerCnx`'s event loop. The one downside is that the `HashMap` will box the `long` keys and values. However, it is likely faster than the `ConcurrentLongHashMap` since it does not have any synchronization. * Add logic to ignore `Send` commands if they come recently after the broker sent a `CloseProducer` command. * Use the keep alive interval as the length of time the producer is considered "recently" closed. It is possible that some will want a new configuration here. I do not think we need one because we are really just waiting for the client to receive the `CloseProducer` request, and the keep alive interval should be sufficient for a full round trip from broker to client. * Remove the recently closed producer from the map if the client attempts to recreate the producer. In this condition, the client has already received the close command and is attempting to create a new producer. ### Verifying this change New tests are added. ### Does this pull request potentially affect one of the following parts: - [x] The binary protocol This affects the protocol in a sense, but it does not change the protocol in any negative way. ### Documentation - [x] `doc-not-needed` ### Matching PR in forked repository PR in forked repository: https://github.com/michaeljmarshall/pulsar/pull/24 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
