[
https://issues.apache.org/jira/browse/FLINK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834629#comment-17834629
]
yansuopeng commented on FLINK-34995:
------------------------------------
https://github.com/apache/flink-connector-kafka/pull/91
> flink kafka connector source stuck when partition leader invalid
> ----------------------------------------------------------------
>
> Key: FLINK-34995
> URL: https://issues.apache.org/jira/browse/FLINK-34995
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Kafka
> Affects Versions: 1.17.0, 1.19.0, 1.18.1
> Reporter: yansuopeng
> Priority: Major
> Labels: pull-request-available
>
> when partition leader invalid(leader=-1), the flink streaming job using
> KafkaSource can't restart or start a new instance with a new groupid, it
> will stuck and got following exception:
> "{*}org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms
> expired before the position for partition aaa-1 could be determined{*}"
> when leader=-1, kafka api like KafkaConsumer.position() will block until
> either the position could be determined or an unrecoverable error is
> encountered
> infact, leader=-1 not easy to avoid, even replica=3, three disk offline
> together will trigger the problem, especially when the cluster size is
> relatively large. it rely on kafka administrator to fix in time, but it
> take risk when in kafka cluster peak period.
> I have solve this problem, and want to create a PR.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)