[jira] [Updated] (FLINK-39837) Make dynamic Kafka source resilient for Kafka topology instability

ASF GitHub Bot (Jira) Thu, 04 Jun 2026 16:02:13 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated FLINK-39837:
-----------------------------------
    Labels: pull-request-available  (was: )

> Make dynamic Kafka source resilient for Kafka topology instability 
> -------------------------------------------------------------------
>
>                 Key: FLINK-39837
>                 URL: https://issues.apache.org/jira/browse/FLINK-39837
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Kafka
>            Reporter: Xin Gao
>            Priority: Critical
>              Labels: pull-request-available
>
> In Flink dynamic Kafka source, the Kafka metadata service could consume and 
> update the Kafka source topology change dynamically. This is great to reduce 
> the operation overhead in the happy path but is also making the system 
> vulnerable under instability (non-happy path).
> The Kafka topology metadata service/config could have flakiness and slight 
> inconsistency in the distributed system. For example, during a rollout or 
> restart, version of configs could be inconsistent between primary - standby 
> or across replicas. In such scenarios, the Flink Kafka metadata service might
>  # Get config version `n` at time `t`
>  # Get config version `n - 1` at time `t + 1`
>  # Get config version `n` at time `t + 2`
> As a result, the dynamic source might *drop a cluster* at step 2 then 
> *instantly and permanently* remove it from the checkpoint. Then at step 3, 
> even if we eventually add it back, the progress would be lost (as cluster 
> removed from checkpoint) and it would ingest from the EARLIEST.
> Hence, to make it resilient under the distributed systems, I'd propose to 
> expose an option (keep default same as today) so that we can retain the 
> removed cluster/partition in the checkpoint for x days (to be configured) so 
> that such flakiness would not cause the progress lost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39837) Make dynamic Kafka source resilient for Kafka topology instability

Reply via email to