michaeljmarshall commented on code in PR #20748: URL: https://github.com/apache/pulsar/pull/20748#discussion_r1303801386
########## pip/pip-281.md: ########## @@ -0,0 +1,268 @@ +<!-- +RULES +* Never place a link to an external site like Google Doc. The proposal should be in this issue entirely. +* Use a spelling and grammar checker tools if available for you (there are plenty of free ones). + +PROPOSAL HEALTH CHECK +I can read the design document and understand the problem statement and what you plan to change *without* resorting to a couple of hours of code reading just to start having a high level understanding of the change. + +THIS COMMENTS +Please remove them when done. +--> + +# Background knowledge + +- Pulsar broker load balancer periodically unloads bundles from overloaded brokers. During this unload process, previous owner brokers close topic sessions(e.g. producers, subscriptions(consumers), managed ledgers). When re-assigned, new owner brokers recreate the topic sessions. + +- Pulsar clients request `CommandLookupTopic` to lookup or assign owner brokers for topics and connect to them. + +- PIP-192, the extensible load balancer introduced the bundle state channel that event-sources this unloading process in a state machine manner, from `releasing,` `assigned`, to `owned` state order. At `releasing,` the owner broker "releases" the bundle ownership(close topic sessions). + +- PIP-192, the extensible load balancer introduced TransferShedder, a new shedding strategy, which pre-assigns new owner brokers beforehand. + + +# Motivation + +- When unloading closes many topic sessions, then many clients need to request CommandLookupTopic at the same time, which could cause many lookup requests on brokers. This unloading process can be further optimized if we can let the client directly connect to the new owner broker without following `CommandLookupTopic` requests. +- In the new load balancer(pip-192), since the owner broker is already known, we can modify the close command protocol to pass the new destination broker URL and skip the lookup requests. +- Also, when unloading, we can gracefully shutdown ledgers -- we always close old managed ledgers first and then recreate it on the new owner without conflicts. + +# Goals +- Remove clients' lookup requests in the unload protocol to reduce the publish latency spike and e2e latency spike during +unloading and also to resolve bottlenecks (of thundering lookups) when there are a large number of topics in a cluster. +- Gracefully shutdown managed ledgers before new owners create them to reduce possible race-conditions between ledger close and ledger creations during unloading. + +## In Scope + +<!-- +What this PIP intend to achieve once It's integrated into Pulsar. +Why does it benefit Pulsar. +--> + +- This change will be added in the extensible load balancer. + +## Out of Scope + +<!-- +Describe what you have decided to keep out of scope, perhaps left for a different PIP/s. +--> + +- This won't change the existing load balancer behavior(modular load manager). + + + +# High Level Design + +<!-- +Describe the design of your solution in *high level*. +Describe the solution end to end, from a birds-eye view. +Don't go into implementation details in this section. + +I should be able to finish reading from beginning of the PIP to here (including) and understand the feature and +how you intend to solve it, end to end. + +DON'T +* Avoid code snippets, unless it's essential to explain your intent. +--> + +To achieve the goals above, we could modify the bundle transfer protocol by the following. +The proposed protocol change is based on the bundle states from PIP-192. + +Basically, we could close the ledgers only in the releasing state and finally disconnect clients in the owned state with destination broker urls. The clients will directly connect to the pre-assigned destination broker url without lookups. Meanwhile, during this transfer, any produced messages will be ignored by the source broker. + +Current Unload and Lookup Sequence in Extensible Load Balancer +```mermaid +sequenceDiagram + participant Clients + participant Owner Broker + participant New Owner Broker + participant Leader Broker + Leader Broker ->> Owner Broker: "state:Releasing:" close topic + Owner Broker ->> Owner Broker: close broker topic sessions + Owner Broker ->> Clients: close producers and consumers + Clients ->> Clients: reconnecting (inital delay 100ms) + Owner Broker ->> New Owner Broker: "state:Assign:" assign new ownership + New Owner Broker ->> Owner Broker: "state:Owned:" ack new ownership + Clients ->> Owner Broker: lookup + Owner Broker ->> Clients: redirect + Clients ->> New Owner Broker: lookup + New Owner Broker ->> Clients: return(connected) +``` + +Proposed Unload Sequence in Extensible Load Balancer without Lookup +```mermaid +sequenceDiagram + participant Clients + participant Owner Broker + participant New Owner Broker + participant Leader Broker + Leader Broker ->> Owner Broker: "state:Releasing:" close topic + Owner Broker ->> Owner Broker: close broker topic sessions(e.g ledgers) without disconnecting producers/consumers(fenced) + Clients -->> Owner Broker: message pubs are ignored + Owner Broker ->> New Owner Broker: "state:Assign:" assign new ownership + New Owner Broker ->> Owner Broker: "state:Owned:" ack new ownership + Owner Broker ->> Owner Broker: close the fenced broker topic sessions + Owner Broker ->> Clients: close producers and consumers (with newOwnerBrokerUrl) + Clients ->> New Owner Broker: immediately connect Review Comment: > Still any inflight messages need to be ignored at source brokers. I agree that there will still be a window of messages that will be received but not persisted. That is already part of the protocol. > * It needs brokers to additionally send the prepare command to the clients, adding client-broker protocol complexity. I really think the protocol is designed to handle this complexity. The resulting code might have multiple conditionals, but lifecycle changes are complex, and decoupling events (telling a producer to stop sending messages and telling producer which broker to connect to) seems like a more robust design, in my opinion. Another potential issue with ignoring protocol messages from clients during unloading is that it might increase the number of duplicate messages sent to consumers because `ACK` protocol messages will be sent to the source broker but not persisted to the consumer's ledger. This is always a possibility (we work with at least once delivery after all), but our goal is to make the protocol work in a way that minimizes these kinds of dupes. The current protocol disconnects a consumer and the consumer which triggers behaviors that should decrease dupes (though I haven't looked at the source code in a while). This new protocol leaves the gap open a little longer. My primary concerns about the pausing are in cases when the system is already degraded. It's possible that the latency is so negligible in most cases that the dupes and extra network usage are not a problem. > * It adds the complexity of adding `paused` states in the clients. We have java, c++, golang code to maintain, so this will increase the scope of the work and complexity. These clients already handle the "pausing" when they get disconnected from one broker and reconnect to the next. As such, I think they should all already have this state built in. There will still need to be additional logic to handle the new protocol message or protocol field, but the states should be represented there already. > I believe this "prepare" command can be added in the next iteration(probably optimization phase 2), too, if we see the strong need here. This is a valid perspective. My primary counterpoint is that having a phase two implementation would mean three hand-off mechanisms, which increases complexity more than switching to the "prepare" solution now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
