michaeljmarshall commented on code in PR #20748:
URL: https://github.com/apache/pulsar/pull/20748#discussion_r1294191843


##########
pip/pip-281.md:
##########
@@ -0,0 +1,268 @@
+<!--
+RULES
+* Never place a link to an external site like Google Doc. The proposal should 
be in this issue entirely.
+* Use a spelling and grammar checker tools if available for you (there are 
plenty of free ones).
+
+PROPOSAL HEALTH CHECK
+I can read the design document and understand the problem statement and what 
you plan to change *without* resorting to a couple of hours of code reading 
just to start having a high level understanding of the change.
+
+THIS COMMENTS
+Please remove them when done.
+-->
+
+# Background knowledge
+
+- Pulsar broker load balancer periodically unloads bundles from overloaded 
brokers. During this unload process, previous owner brokers close topic 
sessions(e.g. producers, subscriptions(consumers), managed ledgers). When 
re-assigned, new owner brokers recreate the topic sessions.
+
+- Pulsar clients request `CommandLookupTopic` to lookup or assign owner 
brokers for topics and connect to them.
+
+- PIP-192, the extensible load balancer introduced the bundle state channel 
that event-sources this unloading process in a state machine manner, from 
`releasing,` `assigned`, to `owned` state order. At `releasing,` the owner 
broker "releases" the bundle ownership(close topic sessions).
+
+- PIP-192, the extensible load balancer introduced TransferShedder, a new 
shedding strategy, which pre-assigns new owner brokers beforehand.
+
+
+# Motivation
+
+- When unloading closes many topic sessions, then many clients need to request 
CommandLookupTopic at the same time, which could cause many lookup requests on 
brokers. This unloading process can be further optimized if we can let the 
client directly connect to the new owner broker without following 
`CommandLookupTopic` requests.
+- In the new load balancer(pip-192), since the owner broker is already known, 
we can modify the close command protocol to pass the new destination broker URL 
and skip the lookup requests.
+- Also, when unloading, we can gracefully shutdown ledgers -- we always close 
old managed ledgers first and then recreate it on the new owner without 
conflicts.
+
+# Goals
+- Remove clients' lookup requests in the unload protocol to reduce the publish 
latency spike and e2e latency spike during
+unloading and also to resolve bottlenecks (of thundering lookups) when there 
are a large number of topics in a cluster.
+- Gracefully shutdown managed ledgers before new owners create them to reduce 
possible race-conditions between ledger close and ledger creations during 
unloading.
+
+## In Scope
+
+<!--
+What this PIP intend to achieve once It's integrated into Pulsar.
+Why does it benefit Pulsar.
+-->
+
+- This change will be added in the extensible load balancer.
+
+## Out of Scope
+
+<!--
+Describe what you have decided to keep out of scope, perhaps left for a 
different PIP/s.
+-->
+
+- This won't change the existing load balancer behavior(modular load manager).
+
+
+
+# High Level Design
+
+<!--
+Describe the design of your solution in *high level*.
+Describe the solution end to end, from a birds-eye view.
+Don't go into implementation details in this section.
+
+I should be able to finish reading from beginning of the PIP to here 
(including) and understand the feature and 
+how you intend to solve it, end to end.
+
+DON'T
+* Avoid code snippets, unless it's essential to explain your intent.
+-->
+
+To achieve the goals above, we could modify the bundle transfer protocol by 
the following.
+The proposed protocol change is based on the bundle states from PIP-192. 
+
+Basically, we could close the ledgers only in the releasing state and finally 
disconnect clients in the owned state with destination broker urls. The clients 
will directly connect to the pre-assigned destination broker url without 
lookups.  Meanwhile, during this transfer, any produced messages will be 
ignored by the source broker.
+
+Current Unload and Lookup Sequence in Extensible Load Balancer
+```mermaid
+sequenceDiagram
+    participant Clients
+    participant Owner Broker
+    participant New Owner Broker
+    participant Leader Broker
+    Leader Broker ->> Owner Broker: "state:Releasing:" close topic
+    Owner Broker ->> Owner Broker: close broker topic sessions
+    Owner Broker ->> Clients: close producers and consumers
+    Clients ->> Clients: reconnecting (inital delay 100ms)
+    Owner Broker ->> New Owner Broker: "state:Assign:" assign new ownership
+    New Owner Broker ->> Owner Broker: "state:Owned:" ack new ownership
+    Clients ->> Owner Broker: lookup
+    Owner Broker ->> Clients: redirect
+    Clients ->> New Owner Broker: lookup
+    New Owner Broker ->> Clients: return(connected)
+```
+
+Proposed Unload Sequence in Extensible Load Balancer without Lookup
+```mermaid
+sequenceDiagram
+    participant Clients
+    participant Owner Broker
+    participant New Owner Broker
+    participant Leader Broker
+    Leader Broker ->> Owner Broker: "state:Releasing:" close topic
+    Owner Broker ->> Owner Broker: close broker topic sessions(e.g ledgers) 
without disconnecting producers/consumers(fenced)
+    Clients -->> Owner Broker: message pubs are ignored
+    Owner Broker ->> New Owner Broker: "state:Assign:" assign new ownership
+    New Owner Broker ->> Owner Broker: "state:Owned:" ack new ownership
+    Owner Broker ->> Owner Broker: close the fenced broker topic sessions
+    Owner Broker ->> Clients: close producers and consumers (with 
newOwnerBrokerUrl)
+    Clients ->> New Owner Broker: immediately connect

Review Comment:
   Based on what you've written, I understand you think solution 2 is too 
complex. I agree that it is a large change and in open source, we focus on 
concrete improvements. As such, I would like to discuss my first solution more, 
which I do not think introduces nearly as much complexity. At it's heart, my 
first diagram is focused on removing "ignore messages" from the pulsar hand-off 
protocol. This is where my question on the other thread comes into play. What 
is the expected duration of the state transition from `Assign` to `Owned`?
   
   Even though we likely won't go with 2, here are my in line responses to your 
comment:
   
   > We probably should consider the complexity of `sending additional 
"prepare" commands in the protocol` because the complexity could add up when 
there are many clients.
   
   Well said. I agree that increased complexity should be accompanied by 
appropriate analysis of the tradeoffs. And it's especially true when adding 
logic to the Pulsar clients, which are already quite "thick" and lacking in 
appropriate documentation.
   
   > Also, in the multi-producer/consumer client env, the connections to the 
new brokers could already be there; in this case, the latency optimization 
impact could be less.
   
   Yes, in certain configurations, there would be diminishing returns for 
certain attempts to eagerly hand off the connection. However, this is only one 
style of deployment. It is also common for many applications to have just a few 
producers/consumers.
   
   > Although I believe this parallel ledger creation can help reduce latency 
in most cases, in the worst case, this could cause excessive ledger creations 
because the dst broker needs to list all of the topics for that namespace, 
filter them by the bundle mapping, and create ledgers for them.
   
   I agree that we should not naively pre-create ledgers. How hard would it be 
for the owner broker to tell the destination broker which topics it should 
pre-load?
   
   > We probably need to limit the number of pre-created ledgers or let the 
clients tell the new broker the list of topics, which will increase the 
protocol complexity and make the brokers additionally rely on clients.
   
   This could easily work if we let the client's `PRODUCER` and `SUBSCRIBE` 
protocol commands take on a special meaning when the "destination broker" is in 
the state where it is "acquiring" the bundle but hasn't taken ownership yet. In 
this case, a client connects and then attempts to create a producer or a 
consumer. The broker takes this as the queue to start loading the topic. 
However, it will delay responding to the client's command until the topic is 
fully loaded.
   
   Additionally, because the destination broker would not accept ownership of 
the topic until the broker is 99% ready to write data (there will still need to 
be final updates to zookeeper to add the pre-loaded ledger to the topic's list 
of ledgers), we minimize the risk of intermittent networking issues to 
zookeeper/bookkeeper because any issues that lead to a back off retry is out of 
the hot path, assuming the clients are not disconnected from the original 
broker eagerly.
   
   > This PIP can't optimize for all cases, but it is trying to reduce the 
latency with minimal complexity.
   
   I agree we must limit what we're optimizing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to