[ https://issues.apache.org/jira/browse/KAFKA-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411606#comment-17411606 ]
A. Sophie Blee-Goldman commented on KAFKA-13269: ------------------------------------------------ Hey [~rohitbobade], thanks for the report. Unfortunately I don't think the acceptable recovery lag is directly responsible, as that config is only used within the assignor to figure out the placement of tasks. Assigning a task as "Active" just means that the instance should try to process it, the task still has to go through restoration if it's anything less than 100% caught up with the end of the changelog. Wonder if this might be due to https://issues.apache.org/jira/browse/KAFKA-13249? > Kafka Streams Aggregation data loss between instance restarts and rebalances > ---------------------------------------------------------------------------- > > Key: KAFKA-13269 > URL: https://issues.apache.org/jira/browse/KAFKA-13269 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.6.2 > Reporter: Rohit Bobade > Priority: Major > > Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also > setting Processing Guarantee - EXACTLY_ONCE_BETA and > NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting > instances in middle while processing to test fault tolerance. The output > count is incorrect because of data loss while restoring state. > It looks like the streams task becomes active and starts processing even when > the state is not fully restored but is within the acceptable recovery lag > (default is 10000) This results in data loss > {quote}A stateful active task is assigned to an instance only when its state > is within the configured acceptable.recovery.lag, if one exists > {quote} > [https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance] > [https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag] > Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the > correct result. > Related KIP: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances] > Just want to get some thoughts on this use case from the Kafka team or if > anyone has encountered similar issue -- This message was sent by Atlassian Jira (v8.3.4#803005)