dwang-qm commented on issue #23451: URL: https://github.com/apache/pulsar/issues/23451#issuecomment-2417744931
Thank you for the explanation and source code snippets! After reviewing them, I think I understand. Suppose, there's a multi-partition topic called `xxxx`, originally with only one partition (so only `xxxx-partition-0` exists). However, when an admin adds another partition, the topic metadata gets updated. The Pulsar client rechecks the topic metadata on an interval and when it discovers the new partition, it tries to create a producer to connect to the new partition (`xxxx-partition-1`). It first needs to perform a lookup of the broker service url. So it contacts a broker it already is connected to, say `pulsar-broker-0`, which eventually calls `findBrokerServiceUrl`. It finds that `pulsar-broker-35` is actually the best suited by `searchForCandidateBroker`. That's not `pulsar-broker-0`, so it returns a redirect response to the client. The Pulsar client then contacts `pulsar-broker-35` and repeats the lookup. `pulsar-broker-35` eventually calls `findBrokerServiceUrl`, finds itself the best candidate with `searchForCandidateBroker`, and takes ownership of the topic. All good so far. Then, the client sends the `PRODUCER` command to `pulsar-broker-35`, which does have ownership of the topic, which it will check near the top of `BrokerService::loadOrCreatePersistentTopic`. However, before it can do that, it performs that `fetchPartitionedTopicMetadataAsync` call and then checks if `topicName.getPartitionIndex() < metadata.partitions`. As you say, they're separate systems, and `pulsar-broker-35` may in fact have acquired ownership of the topic without having an updated view of the partitioned topic metadata from Zookeeper. As you say, the `findBrokerServiceUrl` would not even be aware that `xxxx-partition-1` is a partitioned topic and would never have had to check the special partitioned topic metadata of `xxxx` to acquire ownership of `xxxx-partition-1`. Do you think this could happen? I think just changing https://github.com/apache/pulsar/blob/9d2606d73b94b4a5e1b2ffcb1e9a3c25cf71edd4/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1091 to `return fetchPartitionedTopicMetadataAsync(topicNameEntity, true)` would solve the issue. Do you agree? Thank you again for your continued attention and responses! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
