Johnson Okorie created KAFKA-16692:
--------------------------------------

             Summary: InvalidRequestException: ADD_PARTITIONS_TO_TXN with 
version 4 which is not enabled when 
                 Key: KAFKA-16692
                 URL: https://issues.apache.org/jira/browse/KAFKA-16692
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.6.2
            Reporter: Johnson Okorie


We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:


{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=<client>, 
transactionalId=<transactionalId>] Got error produce response with correlation 
id 6402937 on topic-partition <topic-partition>, retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for <ChannelId> because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.2 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 30000ms){code}

We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of 
[KAFKA-14402|https://issues.apache.org/jira/browse/KAFKA-14402] that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|[https://github.com/apache/kafka/blob/cb35ddc5ca233d5cca6f51c1c41b952a7e9fe1a0/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, this seems to be these requests are still sent to other brokers in our 
environment. 

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to false for the network client of the 
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
has _discoverBrokerVersions_ set to false. 

I was hoping I could get some assistance debugging this issue.










 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to