[ 
https://issues.apache.org/jira/browse/KAFKA-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sanghyeok An updated KAFKA-20058:
---------------------------------
    Labels: transaction  (was: )

> Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing 
> premature retries
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-20058
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20058
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: sanghyeok An
>            Assignee: sanghyeok An
>            Priority: Minor
>              Labels: transaction
>
> While investigating a flaky failure in 
> ProducerIdManagerTest.testRetryBackoffOnNoResponse, I found a race in 
> RPCProducerIdManager.maybeRequestNextBlock().
> maybeRequestNextBlock() currently does:
>  * sendRequest()
>  * backoffDeadlineMs.set(NO_RETRY) (unconditional)
> On the response path, handleUnsuccessfulResponse() does:
>  * backoffDeadlineMs.set(now + RETRY_BACKOFF_MS)
>  * requestInFlight.set(false)
>  
> Because sendRequest() is asynchronous, the unconditional backoffDeadlineMs 
> reset can run after handleUnsuccessfulResponse(), overwriting the newly-set 
> backoff deadline. If backoffDeadlineMs ends up as NO_RETRY, a subsequent 
> generateProducerId() call can re-send immediately, which may prefill 
> nextProducerIdBlock earlier than expected and lead to test flakiness (and 
> potentially unnecessary controller traffic).
>  
> In production, this race is less likely because the request/response path 
> typically has higher latency than in the unit test (which simulates the 
> controller response on a local executor). However, the code still has a 
> correctness window where a newly set backoff deadline can be clobbered by an 
> unconditional reset. Using compareAndSet to conditionally reset backoff 
> preserves the intended behavior, avoids overwriting newer backoff values, and 
> should have negligible performance impact (CAS is only executed on the 
> request path, and contention should be rare). This also eliminates the 
> observed test flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to