[jira] [Commented] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

Mason Chen (Jira) Thu, 16 Jun 2022 05:30:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555057#comment-17555057
 ]


Mason Chen commented on FLINK-28060:
------------------------------------

+1 on [~peter.schrott] 's assessment–we need also need this for metrics and in 
case Flink state is lost or we need to do a migration where it is operationally 
easier to throwaway Flink state.

In addition, Flink hasn't removed support for FlinkKafkaConsumer, right? The 
group offsets are essential for the migration process to the FLIP 27 Kafka 
Source since users will have operational issues moving without committed 
offsets in Kafka.

[~Christian.Lorenz77] I can look at the reproduction code next week. Does the 
commit eventually succeed? e.g. after the 5th checkpoint, nth checkpoint, etc?

> Kafka Commit on checkpointing fails repeatedly after a broker restart
> ---------------------------------------------------------------------
>
>                 Key: FLINK-28060
>                 URL: https://issues.apache.org/jira/browse/FLINK-28060
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream, Connectors / Kafka
>    Affects Versions: 1.15.0
>         Environment: Reproduced on MacOS and Linux.
> Using java 8, Flink 1.15.0, Kafka 2.8.1.
>            Reporter: Christian Lorenz
>            Priority: Major
>         Attachments: flink-kafka-testjob.zip
>
>
> When Kafka Offset committing is enabled and done on Flinks checkpointing, an 
> error might occur if one Kafka broker is shutdown which might be the leader 
> of that partition in Kafkas internal __consumer_offsets topic.
> This is an expected behaviour. But once the broker is started up again, the 
> next checkpoint issued by flink should commit the meanwhile processed offsets 
> back to kafka. Somehow this does not seem to happen always in Flink 1.15.0 
> anymore and the offset committing is broken. An warning like the following 
> will be logged on each checkpoint:
> {code}
> [info] 14:33:13.684 WARN  [Source Data Fetcher for Source: input-kafka-source 
> -> Sink: output-stdout-sink (1/1)#1] o.a.f.c.k.s.reader.KafkaSourceReader - 
> Failed to commit consumer offsets for checkpoint 35
> [info] org.apache.kafka.clients.consumer.RetriableCommitFailedException: 
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets.
> [info] Caused by: 
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
> coordinator is not available.
> {code}
> To reproduce this I've attached a small flink job program.  To execute this 
> java8, scala sbt and docker / docker-compose is required.  Also see readme.md 
> for more details.
> The job can be run with `sbt run`, kafka cluster is started by 
> `docker-compose up`. If then the kafka brokers are restarted gracefully by 
> e.g. `docker-compose stop kafka1` and `docker-compose start kafka1` with 
> kafka2 and kafka3 afterwards, this warning will occur and no offsets will be 
> committed into kafka.
> This is not reproducible in flink 1.14.4.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

Reply via email to