[ 
https://issues.apache.org/jira/browse/KAFKA-12486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12486:
-------------------------------------------
    Description: 
In KIP-441, we added the HighAvailabilityTaskAssignor to address certain common 
scenarios which tend to lead to heavy downtime for tasks, such as scaling out. 
The new assignor will always place an active task on a client which has a 
"caught-up" copy of that tasks' state, if any exists, while the intended 
recipient will instead get a standby task to warm up the state in the 
background. This way we keep tasks live as much as possible, and avoid the long 
downtime imposed by state restoration on active tasks.

We can actually expand on this to reduce downtime due to restoring state: 
specifically, we may throw a TaskCorruptedException on an active task which 
leads to wiping out the state stores of that task and restoring from scratch. 
There are a few cases where this may be thrown:
 # No checkpoint found with EOS
 # TimeoutException when processing a StreamTask
 # TimeoutException when committing offsets under eos
 # RetriableException in RecordCollectorImpl

(There is also the case of OffsetOutOfRangeException, but that is excluded here 
since it only applies to standby tasks).

We should consider triggering a rebalance when we hit TaskCorruptedException on 
an active task, after we've wiped out the corrupted state stores. This will 
allow the assignor to temporarily redirect this task to another client who can 
resume work on the task while the original owner works on restoring the state 
from scratch.

  was:
In KIP-441, we added the HighAvailabilityTaskAssignor to address certain common 
scenarios which tend to lead to heavy downtime for tasks, such as scaling out. 
The new assignor will always place an active task on a client which has a 
"caught-up" copy of that tasks' state, if any exists, while the intended 
recipient will instead get a standby task to warm up the state in the 
background. This way we keep tasks live as much as possible, and avoid the long 
downtime imposed by state restoration on active tasks.

We can actually expand on this to reduce downtime due to restoring state: 
specifically, we may throw a TaskCorruptedException on an active task which 
leads to wiping out the state stores of that task and restoring from scratch. 
There are a few cases where this may be thrown:
 # No checkpoint found with EOS
 # TimeoutException when processing a StreamTask
 # TimeoutException when committing offsets under eos
 # RetriableException in RecordCollectorImpl

(There is also the case of OffsetOutOfRangeException, but that is excluded here 
since it only applies to standby tasks).

We should consider triggering a rebalance when we hit TaskCorruptedException on 
an active task so that the assignor has the chance to redirect this to another 
client who can resume work on the task while the original owner works on 
restoring the state from scratch.


> Utilize HighAvailabilityTaskAssignor to avoid downtime on corrupted task
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-12486
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12486
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Critical
>
> In KIP-441, we added the HighAvailabilityTaskAssignor to address certain 
> common scenarios which tend to lead to heavy downtime for tasks, such as 
> scaling out. The new assignor will always place an active task on a client 
> which has a "caught-up" copy of that tasks' state, if any exists, while the 
> intended recipient will instead get a standby task to warm up the state in 
> the background. This way we keep tasks live as much as possible, and avoid 
> the long downtime imposed by state restoration on active tasks.
> We can actually expand on this to reduce downtime due to restoring state: 
> specifically, we may throw a TaskCorruptedException on an active task which 
> leads to wiping out the state stores of that task and restoring from scratch. 
> There are a few cases where this may be thrown:
>  # No checkpoint found with EOS
>  # TimeoutException when processing a StreamTask
>  # TimeoutException when committing offsets under eos
>  # RetriableException in RecordCollectorImpl
> (There is also the case of OffsetOutOfRangeException, but that is excluded 
> here since it only applies to standby tasks).
> We should consider triggering a rebalance when we hit TaskCorruptedException 
> on an active task, after we've wiped out the corrupted state stores. This 
> will allow the assignor to temporarily redirect this task to another client 
> who can resume work on the task while the original owner works on restoring 
> the state from scratch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to