[ https://issues.apache.org/jira/browse/KAFKA-12486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303793#comment-17303793 ]
A. Sophie Blee-Goldman commented on KAFKA-12486: ------------------------------------------------ Ideally we would only kick off a rebalance under certain conditions in which we can infer that this will help: for example, if the active task was in RUNNING or was within the acceptable.recovery.lag if in CREATED or RESTORING. The reasoning here is that an active task in CREATED/RESTORING but with more than the acceptable.recovery.lag to restore would only have been assigned to this client if there were no other clients available who were considered to be caught-up. At the moment, the assignor has a yes/no take on the total lag, and won't take into consideration if there is another client who's not completely caught-up but has some amount of state (and is therefore preferable to restoring from scratch) There are some other heuristics we could consider, such as whether the applications has standbys configured; if no standbys are used, the odds of another client maintaining an up-to-date copy of this state is lower (but not zero). I don't think we necessarily need to add that much complexity from the get-go, but it's something to think about > Utilize HighAvailabilityTaskAssignor to avoid downtime on corrupted task > ------------------------------------------------------------------------ > > Key: KAFKA-12486 > URL: https://issues.apache.org/jira/browse/KAFKA-12486 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: A. Sophie Blee-Goldman > Priority: Critical > > In KIP-441, we added the HighAvailabilityTaskAssignor to address certain > common scenarios which tend to lead to heavy downtime for tasks, such as > scaling out. The new assignor will always place an active task on a client > which has a "caught-up" copy of that tasks' state, if any exists, while the > intended recipient will instead get a standby task to warm up the state in > the background. This way we keep tasks live as much as possible, and avoid > the long downtime imposed by state restoration on active tasks. > We can actually expand on this to reduce downtime due to restoring state: > specifically, we may throw a TaskCorruptedException on an active task which > leads to wiping out the state stores of that task and restoring from scratch. > There are a few cases where this may be thrown: > # No checkpoint found with EOS > # TimeoutException when processing a StreamTask > # TimeoutException when committing offsets under eos > # RetriableException in RecordCollectorImpl > (There is also the case of OffsetOutOfRangeException, but that is excluded > here since it only applies to standby tasks). > We should consider triggering a rebalance when we hit TaskCorruptedException > on an active task, after we've wiped out the corrupted state stores. This > will allow the assignor to temporarily redirect this task to another client > who can resume work on the task while the original owner works on restoring > the state from scratch. -- This message was sent by Atlassian Jira (v8.3.4#803005)