[GitHub] [kafka] vamossagar12 commented on pull request #13376: KAFKA-14091: Leader proactively aborting tasks from lost workers in rebalance in EOS mode

via GitHub Tue, 28 Mar 2023 22:22:19 -0700


vamossagar12 commented on PR #13376:
URL: https://github.com/apache/kafka/pull/13376#issuecomment-1487963852


   > @vamossagar12 I don't think this is the right approach.
   
   > 1. In the original design, hanging transactions blocking the global 
offsets topic was considered in detail, and the solution was accepted at the 
time to be per-connector offsets topics. I think this was in part that the 
framework couldn't make the right tradeoff of when to abort transactions in a 
way that was good enough for any use-case.
   
   I agree but then if somebody does still use the global offsets topic, this 
issue remains for them. I thought it might still be useful to fix this. For 
example, it automatic topic creation is disabled in some 
environments/enterprises, so whenever a new connector is created a new offsets 
topic request needs to raised which could be a hassle at times.
   > 2. Aborting transactions proactively on losing the task assignment treats 
every rebalance as a failure scenario, and does not permit the task to perform 
a clean end-of-lifetime-commit, which is explicitly handled via the (internal) 
TransactionBoundaryManager::shouldCommitFinalTransaction, and is used in all of 
the different transaction boundary modes.
   
   Yeah, in that sense, it's expensive but I have tried to do it only when we 
detect there are missing workers during a rebalance. So, what you are saying is 
definitely applicable when workers are coming up and down very frequently.
   > 3. Fencing zombie source tasks is a very expensive and blocking operation, 
and it may delay rebalances that otherwise complete nearly instantly. I'm not 
sure how the cluster would react to assignments taking a significant amount of 
time to compute, but I don't think it would be more available and better 
behaved than it is now.
   
   I see. I am not fully aware of how costly it is, but I feel that's a price 
we would have to pay at some point when using global offsets topic. Otherwise, 
there would always be Zombis lingering around.
   > 4. This compromises the interface of the IncrementalCooperativeAssignor 
significantly. Previously, it only had an effect via the return value of 
performAssignment, and now it would have a side-effect for the (mutable) 
passed-in Coordinator.
   
   Agreed, plz check my comment 
[here](https://github.com/apache/kafka/pull/13376#issuecomment-1487939356) 
   > 5. This puts a solution to a problem in one implementation of an interface 
(IncrementalCooperativeAssignor) and not another (EagerAssignor). Since this 
isn't an Incremental-specific problem, why is the solution only present in 
incremental mode?
   Yeah I had assumed that Incremental is more prevalent and hence thought 
would try to fix it there first. Can always extend it for Eager.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] vamossagar12 commented on pull request #13376: KAFKA-14091: Leader proactively aborting tasks from lost workers in rebalance in EOS mode

Reply via email to