[ https://issues.apache.org/jira/browse/KAFKA-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sudesh Wasnik reassigned KAFKA-14091: ------------------------------------- Assignee: Sudesh Wasnik (was: Sagar Rao) > Suddenly-killed tasks can leave hanging transactions open > --------------------------------------------------------- > > Key: KAFKA-14091 > URL: https://issues.apache.org/jira/browse/KAFKA-14091 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect > Reporter: Chris Egerton > Assignee: Sudesh Wasnik > Priority: Major > > Right now, if a task running with exactly-once support is killed > ungracefully, it may leave a hanging transaction open. If the transaction > included writes to the global offsets topic, then startup for future workers > becomes blocked on that transaction expiring. > Ideally, we could identify these kinds of hanging transactions and > proactively abort them. > Unfortunately, there are a few facts that make this fairly complicated: > # Workers read to the end of the offsets topic during startup, before > joining the cluster > # Workers do not know which tasks they are assigned until they join the > cluster > The result of these facts is that we cannot trust workers that are restarted > shortly after being ungracefully shut down to fence out their own hanging > transactions, since any hanging transactions would prevent them from being > able to join the group and receive their task assignment in the first place. > We could possibly accomplish this by having the leader proactively abort any > open transactions for tasks on workers that appear to have left the cluster > during a rebalance. This would not require us to wait for the scheduled > rebalance delay to elapse, since the intent of the delay is to provide a > buffer between when workers leave and when their connectors/tasks are > reallocated across the cluster (and, if the worker is able to rejoin before > that buffer is consumed, then give it back the same connectors/tasks it was > running previously); aborting transactions for tasks on these workers would > not interfere with that goal. > > It's also possible that we may have to handle the case where a > [cancelled|https://github.com/apache/kafka/blob/badfbacdd09a9ee8821847f4b28d98625f354ed7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L274-L287] > task leaves a transaction open; I have yet to confirm whether this is > possible, though. -- This message was sent by Atlassian Jira (v8.20.10#820010)