[jira] [Updated] (KAFKA-15459) Convert coordinator retriable errors to a known producer response error.

Justine Olshan (Jira) Tue, 12 Sep 2023 17:35:04 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Justine Olshan updated KAFKA-15459:
-----------------------------------
    Description: 
KIP-890 Part 1 tries to address hanging transactions on old clients. Thus, the 
produce version can not be bumped and no new errors can be added. Currently we 
use the java client's notion of retriable and abortable errors -- retriable 
errors are defined as such by extending the retriable error class, fatal errors 
are defined explicitly, and abortable errors are the remaining. However, many 
other clients treat non specified errors as fatal and that means many retriable 
errors kill the application. This is not ideal.

While reviewing [https://github.com/apache/kafka/pull/14370] I added some of 
the documentation for the returned errors in the produce response as well.

There were concerns about the new errors:
 * {@link Errors#COORDINATOR_LOAD_IN_PROGRESS}
 * {@link Errors#COORDINATOR_NOT_AVAILABLE}
 * {@link Errors#INVALID_TXN_STATE}
 * {@link Errors#INVALID_PRODUCER_ID_MAPPING}
 * {@link Errors#CONCURRENT_TRANSACTIONS}

The coordinator load, not available, and concurrent transactions errors should 
be retriable.

The invalid txn state and pid mapping errors should be abortable.

This is how older java clients handle the errors, but it is unclear how other 
clients handle them. It seems that rdkafka (for example) treats the abortable 
errors as fatal instead. The coordinator errors are retriable but not the 
concurrent transactions error. Generally anything not specified otherwise is 
fatal.

It seems acceptable for the abortable errors to be fatal on some clients since 
the error is likely on a zombie producer or in a state that may be harder to 
recover from. However, for the retriable errors, we can return 
NOT_ENOUGH_REPLICAS which is a known retriable response. We can use the produce 
api's response string to specify the real cause of the error for debugging. 

There were trade-offs between making the older clients work and for clarity in 
errors. This seems to be the best compromise.

  was:
While reviewing [https://github.com/apache/kafka/pull/14370] I added some of 
the documentation for the returned errors in the produce response as well.

There were concerns about the new errors:
 * {@link Errors#COORDINATOR_LOAD_IN_PROGRESS}
 * {@link Errors#COORDINATOR_NOT_AVAILABLE}
 * {@link Errors#INVALID_TXN_STATE}
 * {@link Errors#INVALID_PRODUCER_ID_MAPPING}
 * {@link Errors#CONCURRENT_TRANSACTIONS}

The coordinator load, not available, and concurrent transactions errors should 
be retriable.

The invalid txn state and pid mapping errors should be abortable.

This is how older java clients handle the errors, but it is unclear how other 
clients handle them. It seems that rdkafka (for example) treats the abortable 
errors as fatal instead. The coordinator errors are retriable but not the 
concurrent transactions error.

It seems acceptable for the abortable errors to be fatal on some clients since 
the error is likely on a zombie producer or in a state that may be harder to 
recover from. However, for the retriable errors, we can return 
NOT_ENOUGH_REPLICAS which is a known retriable response. We can use the produce 
api's response string to specify the real cause of the error for debugging. 

There were trade-offs between making the older clients work and for clarity in 
errors. This seems to be the best compromise.


> Convert coordinator retriable errors to a known producer response error.
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-15459
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15459
>             Project: Kafka
>          Issue Type: Sub-task
>    Affects Versions: 3.6.0
>            Reporter: Justine Olshan
>            Assignee: Justine Olshan
>            Priority: Blocker
>             Fix For: 3.6.0
>
>
> KIP-890 Part 1 tries to address hanging transactions on old clients. Thus, 
> the produce version can not be bumped and no new errors can be added. 
> Currently we use the java client's notion of retriable and abortable errors 
> -- retriable errors are defined as such by extending the retriable error 
> class, fatal errors are defined explicitly, and abortable errors are the 
> remaining. However, many other clients treat non specified errors as fatal 
> and that means many retriable errors kill the application. This is not ideal.
> While reviewing [https://github.com/apache/kafka/pull/14370] I added some of 
> the documentation for the returned errors in the produce response as well.
> There were concerns about the new errors:
>  * {@link Errors#COORDINATOR_LOAD_IN_PROGRESS}
>  * {@link Errors#COORDINATOR_NOT_AVAILABLE}
>  * {@link Errors#INVALID_TXN_STATE}
>  * {@link Errors#INVALID_PRODUCER_ID_MAPPING}
>  * {@link Errors#CONCURRENT_TRANSACTIONS}
> The coordinator load, not available, and concurrent transactions errors 
> should be retriable.
> The invalid txn state and pid mapping errors should be abortable.
> This is how older java clients handle the errors, but it is unclear how other 
> clients handle them. It seems that rdkafka (for example) treats the abortable 
> errors as fatal instead. The coordinator errors are retriable but not the 
> concurrent transactions error. Generally anything not specified otherwise is 
> fatal.
> It seems acceptable for the abortable errors to be fatal on some clients 
> since the error is likely on a zombie producer or in a state that may be 
> harder to recover from. However, for the retriable errors, we can return 
> NOT_ENOUGH_REPLICAS which is a known retriable response. We can use the 
> produce api's response string to specify the real cause of the error for 
> debugging. 
> There were trade-offs between making the older clients work and for clarity 
> in errors. This seems to be the best compromise.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-15459) Convert coordinator retriable errors to a known producer response error.

Reply via email to