dwang-qm commented on issue #23451:
URL: https://github.com/apache/pulsar/issues/23451#issuecomment-2417744931

   Thank you for the explanation and source code snippets! After reviewing 
them, I think I understand.
   
   Suppose, there's a multi-partition topic called `xxxx`, originally with only 
one partition (so only `xxxx-partition-0` exists). However, when an admin adds 
another partition, the topic metadata gets updated. The Pulsar client rechecks 
the topic metadata on an interval and when it discovers the new partition, it 
tries to create a producer to connect to the new partition 
(`xxxx-partition-1`). It first needs to perform a lookup of the broker service 
url. So it contacts a broker it already is connected to, say `pulsar-broker-0`, 
which eventually calls `findBrokerServiceUrl`. It finds that `pulsar-broker-35` 
is actually the best suited by `searchForCandidateBroker`. That's not 
`pulsar-broker-0`, so it returns a redirect response to the client. The Pulsar 
client then contacts `pulsar-broker-35` and repeats the lookup. 
`pulsar-broker-35` eventually calls `findBrokerServiceUrl`, finds itself the 
best candidate with `searchForCandidateBroker`, and takes ownership of the 
topic. All good so 
 far.
   
   Then, the client sends the `PRODUCER` command to `pulsar-broker-35`, which 
does have ownership of the topic, which it will check near the top of 
`BrokerService::loadOrCreatePersistentTopic`. However, before it can do that, 
it performs that `fetchPartitionedTopicMetadataAsync` call and then checks if 
`topicName.getPartitionIndex() < metadata.partitions`. As you say, they're 
separate systems, and `pulsar-broker-35` may in fact have acquired ownership of 
the topic without having an updated view of the partitioned topic metadata from 
Zookeeper. As you say, the `findBrokerServiceUrl` would not even be aware that 
`xxxx-partition-1` is a partitioned topic and would never have had to check the 
special partitioned topic metadata of `xxxx` to acquire ownership of 
`xxxx-partition-1`.
   
   Do you think this could happen? I think just changing 
   
   
https://github.com/apache/pulsar/blob/9d2606d73b94b4a5e1b2ffcb1e9a3c25cf71edd4/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1091
   
   to `return fetchPartitionedTopicMetadataAsync(topicNameEntity, true)` would 
solve the issue. Do you agree?
   
   Thank you again for your continued attention and responses!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to