[ 
https://issues.apache.org/jira/browse/KAFKA-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411606#comment-17411606
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13269:
------------------------------------------------

Hey [~rohitbobade], thanks for the report. Unfortunately I don't think the 
acceptable recovery lag is directly responsible, as that config is only used 
within the assignor to figure out the placement of tasks. Assigning a task as 
"Active" just means that the instance should try to process it, the task still 
has to go through restoration if it's anything less than 100% caught up with 
the end of the changelog. 

Wonder if this might be due to 
https://issues.apache.org/jira/browse/KAFKA-13249?

> Kafka Streams Aggregation data loss between instance restarts and rebalances
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-13269
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13269
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.2
>            Reporter: Rohit Bobade
>            Priority: Major
>
> Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also 
> setting Processing Guarantee - EXACTLY_ONCE_BETA and 
> NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting 
> instances in middle while processing to test fault tolerance. The output 
> count is incorrect because of data loss while restoring state.
> It looks like the streams task becomes active and starts processing even when 
> the state is not fully restored but is within the acceptable recovery lag 
> (default is 10000) This results in data loss
> {quote}A stateful active task is assigned to an instance only when its state 
> is within the configured acceptable.recovery.lag, if one exists
> {quote}
> [https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance]
> [https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag]
> Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the 
> correct result.
> Related KIP: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances]
> Just want to get some thoughts on this use case from the Kafka team or if 
> anyone has encountered similar issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to