Johnson Okorie created KAFKA-16692:
--------------------------------------
Summary: InvalidRequestException: ADD_PARTITIONS_TO_TXN with
version 4 which is not enabled when
Key: KAFKA-16692
URL: https://issues.apache.org/jira/browse/KAFKA-16692
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.6.2
Reporter: Johnson Okorie
We have a kafka cluster running on version 3.5.2 that we are upgrading to
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and
hence creating transactions. As we replaced brokers with the new binaries, we
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=<client>,
transactionalId=<transactionalId>] Got error produce response with correlation
id 6402937 on topic-partition <topic-partition>, retrying (2147483512 attempts
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running
Kafka version 3.5.2:
{code:java}
message:
Closing socket for <ChannelId> because of error
exception_exception_class:
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not
enabled
exception_stacktrace:
org.apache.kafka.common.errors.InvalidRequestException: Received request api
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.2 we saw the following errors:
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for
node 1043 with a network exception.{code}
I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms,
request timeout: 30000ms){code}
We started investigating this issue and digging through the changes in 3.6, we
came across some changes introduced as part of
[KAFKA-14402|https://issues.apache.org/jira/browse/KAFKA-14402] that we thought
might lead to this behaviour.
First we could see that _transaction.partition.verification.enable_ is enabled
by default and enables a new code path that culminates in we sending version 4
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated
[here|[https://github.com/apache/kafka/blob/cb35ddc5ca233d5cca6f51c1c41b952a7e9fe1a0/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]].
>From a
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x]
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be
>possible as the following code paths should prevent version 4
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
However, this seems to be these requests are still sent to other brokers in our
environment.
On further inspection of the code, I am wondering if the following code path
could lead to this issue:
[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
In this scenario, we don't have any _NodeApiVersions_ available for the
specified nodeId and potentially skipping _latestUsableVersion_ check as
expected. I am wondering if it is possible that because
_discoverBrokerVersions_ is set to false for the network client of the
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that
we create the network client here:
[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
This _NetworkUtils.buildNetworkClient_ seems to create a network client that
has _discoverBrokerVersions_ set to false.
I was hoping I could get some assistance debugging this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)