[ 
https://issues.apache.org/jira/browse/KAFKA-10166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144562#comment-17144562
 ] 

Sophie Blee-Goldman commented on KAFKA-10166:
---------------------------------------------

I think this should be a blocker, yes. It's a regression in 2.6 and causes 
Streams to unnecessarily rebuild state from the changelog which can mean a very 
long stall.

One root cause just occurred to me while looking at some related code, which 
I'll open a PR for right away. I'm not sure it's the _only_ root cause but I'll 
begin testing right away to see if it fixes the majority of the problem or not.

[~cadonna] do you want to split up this ticket? There are two kinds of 
TaskCorruptedException, both of which we see more than expected. It probably 
makes sense to look at these individually and in parallel. Can you look into 
the TaskCorruptedException thrown in StoreChangelogReader#restore? I'll 
investigate my theory for the exceptions thrown in ProcessorStateManager

> Excessive TaskCorruptedException seen in testing
> ------------------------------------------------
>
>                 Key: KAFKA-10166
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10166
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Sophie Blee-Goldman
>            Assignee: Bruno Cadonna
>            Priority: Blocker
>             Fix For: 2.6.0
>
>
> As the title indicates, long-running test applications with injected network 
> "outages" seem to hit TaskCorruptedException more than expected.
> Seen occasionally on the ALOS application (~20 times in two days in one case, 
> for example), and very frequently with EOS (many times per day)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to