becketqin commented on pull request #12255: URL: https://github.com/apache/flink/pull/12255#issuecomment-631167059
@aljoscha Thanks for the review. I went through all the reported failures in the Jira ticket. It looks people are reporting two different issues. 1. The "producer already closed" error caused by the race condition in `StreamOperatorWrapper` that we are trying to fix in FLINK-16383. 2. The "does not host this topic-partition" error we are trying to fix here. All the failures caused by issue 2 are either from 0.10 or 0.11 Kafka connectors, while the first issues are reported on all the versions. I think people are just adding comments to the same ticket once they see this test failed, regardless of the failure cause. That being said, theoretically speaking, using `KafkaAdminClient` to create topic does not 100% guarantee that a producer will not see the "does not host this topic-partition" error. This is because when the `AdminClient` can only guarantee the topic metadata information has existed in the broker to which it sent the `CreateTopicRequest`. When a producer comes at a later point, it might send `TopicMetdataRequest` to a different broker and that broker may have not received the updated topic metadata yet. But this is much unlikely to happen given the broker usually receives the metadata update at the same time. Having retries configured on the producer side should be sufficient to handle such cases. We can also do that for 0.10 and 0.11 producers. But given that we have the producer properties scattered over the places (which is something we probably should avoid to begin with), it would be simpler to just make sure the topic has been created successfully before we start the tests. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
