[ 
https://issues.apache.org/jira/browse/KAFKA-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976965#comment-16976965
 ] 

ASF GitHub Bot commented on KAFKA-8571:
---------------------------------------

hachikuji commented on pull request #7069: KAFKA-8571: Clean up purgatory when 
leader replica is kicked out of replica list.
URL: https://github.com/apache/kafka/pull/7069
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Not complete delayed produce requests when processing StopReplicaRequest 
> causing high produce latency for acks=all
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8571
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8571
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Zhanxiang (Patrick) Huang
>            Assignee: Zhanxiang (Patrick) Huang
>            Priority: Major
>
> Currently a broker will only attempt to complete delayed requests upon 
> highwater mark changes and receiving LeaderAndIsrRequest. When a broker 
> receives StopReplicaRequest, it will not try to complete delayed operations 
> including delayed produce for acks=all, which can cause the producer to 
> timeout even though the producer should have attempted to talk to the new 
> leader faster if a NotLeaderForPartition error is sent.
> This can happen during partition reassignment when controller is trying to 
> kick the previous leader out of the replica set. It this case, controller 
> will only send StopReplicaRequest (not LeaderAndIsrRequest) to the previous 
> leader in the replica set shrink phase. Here is an example:
> {noformat}
> During Reassign the replica set of partition A from [B1, B2] to [B2, B3]:
> t0: Controller expands the replica set to [B1, B2, B3]
> t1: B1 receives produce request PR on partition A with acks=all and timetout 
> T. B1 puts PR into the DelayedProducePurgatory with timeout T.
> t2: Controller elects B2 as the new leader and shrinks the replica set fo 
> [B2, B3]. LeaderAndIsrRequests are sent to B2 and B3. StopReplicaRequest is 
> sent to B!.
> t3: B1 receives StopReplicaRequest but doesn't try to comeplete PR.
> If PR cannot be fullfilled by t3, and t1 + T > t3, PR will eventually time 
> out in the purgatory and producer will eventually time out the produce 
> request.{noformat}
> Since it is possible for the leader to receive only a StopReplicaRequest 
> (without receiving any LeaderAndIsrRequest) to leave the replica set, a fix 
> for this issue is to also try to complete delay operations in processing 
> StopReplicaRequest.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to