Haoze Wu created KAFKA-14882:
--------------------------------
Summary: Uncoordinated states about topic in ZooKeeper nodes and
Kafka brokers cause TopicExistException at client
Key: KAFKA-14882
URL: https://issues.apache.org/jira/browse/KAFKA-14882
Project: Kafka
Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: Haoze Wu
We have been doing testing on Kafka-2.8.0. We found some scenarios where
TopicExistException happens and we feel the design of the topic create process
in Kafka may confuse the users sometimes.
When a user uses a client which sends a topic create request to a Kafka broker,
and the following steps will happen:
# AdminManager check topic path in zkNodes and throw TopicExistException if
the topic exists (Kafka sends request to ZooKeeper)
# AdminManager add topic path in zkNodes (Kafka sends request to ZooKeeper)
# Controller’s ZookperRequestWatcher detect it and put the corresponding event
(ZooKeeper Watcher sends message to Kafka)
# Event kicked out of queue and get executed (Kafka broker (controller) sends
LeaderAndIsrRequest to Kafka broker (may include itself))
# Broker handles the request and back to step #1
A symptom we observed is that when step #4 has some delay (stuck for some
reason) and then the client may retry (send the topic create request again),
which triggers TopicExistException in step #1. However, The topic create
request should occur as kind of “transaction”. It should have some atomicity
and also be robust under concurrent topic creation.
After some inspection, we found that it is not easy for us to implement such
feature to the Kafka given the current implementation. But we do have the
complaint that the user client gets TopicExistException when the topic is not
actually existing or ready.
We suggest that maybe we can at least have some utility which help users
mitigate this issue. For example, provide a tool which help users clean the
ZooKeeper data and make sure the consistency of the topic metadata.
We are waiting for some feedbacks from the community. We can provided some
concrete cases and reproduction scripts and analysis of the workload if needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)