[
https://issues.apache.org/jira/browse/FLINK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834581#comment-17834581
]
yansuopeng commented on FLINK-34995:
------------------------------------
[~martijnvisser] Apologies for the confusion; the issue described is not a bug
within Flink itself. Instead, it pertains to Flink's implementation when
utilizing Kafka's blocking API, which may lead to the problem mentioned. This
can be addressed by using the invalid leader filter and discovery partition
interval.
> flink kafka connector source stuck when partition leader invalid
> ----------------------------------------------------------------
>
> Key: FLINK-34995
> URL: https://issues.apache.org/jira/browse/FLINK-34995
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Kafka
> Affects Versions: 1.17.0, 1.19.0, 1.18.1
> Reporter: yansuopeng
> Priority: Major
>
> when partition leader invalid(leader=-1), the flink streaming job using
> KafkaSource can't restart or start a new instance with a new groupid, it
> will stuck and got following exception:
> "{*}org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms
> expired before the position for partition aaa-1 could be determined{*}"
> when leader=-1, kafka api like KafkaConsumer.position() will block until
> either the position could be determined or an unrecoverable error is
> encountered
> infact, leader=-1 not easy to avoid, even replica=3, three disk offline
> together will trigger the problem, especially when the cluster size is
> relatively large. it rely on kafka administrator to fix in time, but it
> take risk when in kafka cluster peak period.
> I have solve this problem, and want to create a PR.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)