[jira] [Commented] (KAFKA-9846) Race condition can lead to severe lag underestimate for active tasks

Sophie Blee-Goldman (Jira) Wed, 24 Jun 2020 14:11:25 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144398#comment-17144398
 ]


Sophie Blee-Goldman commented on KAFKA-9846:
--------------------------------------------

This is definitely a limitation of the current Affects Version/Fix Version 
system – this actually is fixed in 2.6.0, but has not been fixed in 2.5.0 
(hence the ticket is unresolved).

That said, to avoid interfering with the release process I think we can leave 
it as is for now and then put 2.6.0 back on the fix version once it's released 
so that users know this doesn't affect 2.6+

> Race condition can lead to severe lag underestimate for active tasks
> --------------------------------------------------------------------
>
>                 Key: KAFKA-9846
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9846
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.5.0
>            Reporter: Sophie Blee-Goldman
>            Priority: Critical
>             Fix For: 2.7.0
>
>
> In KIP-535 we added the ability to query still-restoring and standby tasks. 
> To give users control over how out of date the data they fetch can be, we 
> added an API to KafkaStreams that fetches the end offsets for all changelog 
> partitions and computes the lag for each local state store.
> During this lag computation, we check whether an active task is in RESTORING 
> and calculate the actual lag if so. If not, we assume it's in RUNNING and 
> return a lag of zero. However, tasks may be in other states besides running 
> and restoring; notably they first pass through the CREATED state before 
> getting to RESTORING. A CREATED task may happen to be caught-up to the end 
> offset, but in many cases it is likely to be lagging or even completely 
> uninitialized.
> This introduces a race condition where users may be led to believe that a 
> task has zero lag and is "safe" to query even with the strictest correctness 
> guarantees, while the task is actually lagging by some unknown amount.  
> During transfer of ownership of the task between different threads on the 
> same machine, tasks can actually spend a while in CREATED while the new owner 
> waits to acquire the task directory lock. So, this race condition may not be 
> particularly rare in multi-threaded Streams applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9846) Race condition can lead to severe lag underestimate for active tasks

Reply via email to