[
https://issues.apache.org/jira/browse/KAFKA-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143275#comment-17143275
]
Vinoth Chandar commented on KAFKA-9846:
---------------------------------------
Sounds good. Close this as "Wont fix" then?
> Race condition can lead to severe lag underestimate for active tasks
> --------------------------------------------------------------------
>
> Key: KAFKA-9846
> URL: https://issues.apache.org/jira/browse/KAFKA-9846
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.5.0
> Reporter: Sophie Blee-Goldman
> Assignee: Vinoth Chandar
> Priority: Critical
> Fix For: 2.6.0
>
>
> In KIP-535 we added the ability to query still-restoring and standby tasks.
> To give users control over how out of date the data they fetch can be, we
> added an API to KafkaStreams that fetches the end offsets for all changelog
> partitions and computes the lag for each local state store.
> During this lag computation, we check whether an active task is in RESTORING
> and calculate the actual lag if so. If not, we assume it's in RUNNING and
> return a lag of zero. However, tasks may be in other states besides running
> and restoring; notably they first pass through the CREATED state before
> getting to RESTORING. A CREATED task may happen to be caught-up to the end
> offset, but in many cases it is likely to be lagging or even completely
> uninitialized.
> This introduces a race condition where users may be led to believe that a
> task has zero lag and is "safe" to query even with the strictest correctness
> guarantees, while the task is actually lagging by some unknown amount.
> During transfer of ownership of the task between different threads on the
> same machine, tasks can actually spend a while in CREATED while the new owner
> waits to acquire the task directory lock. So, this race condition may not be
> particularly rare in multi-threaded Streams applications
--
This message was sent by Atlassian Jira
(v8.3.4#803005)