[
https://issues.apache.org/jira/browse/KAFKA-20322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ritika Reddy updated KAFKA-20322:
---------------------------------
Description:
Bug: TransactionMarkerChannelManager does not discover broker API versions,
causing UnsupportedVersionException during rolling
upgrades
[KIP-1228|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1228%3A+Add+Transaction+Version+to+WriteTxnMarkersRequest]
(KAFKA-19446) added WriteTxnMarkersRequest v2 with a TransactionVersion field.
However, TransactionMarkerChannelManager creates its NetworkClient with
discoverBrokerVersions=false, which disables API version negotiation with peer
brokers. Without version discovery, the ApiVersions cache is never populated —
apiVersions.get(nodeId) returns null in NetworkClient.doSend(), causing it to
fall through to builder.latestAllowedVersion() which blindly uses the highest
version the sending broker knows about rather than negotiating a mutually
supported version. TransactionMarkerChannelManager is the only inter-broker
NetworkClient that sets discoverBrokerVersions=false; all others
(ReplicaFetcherThread, AlterPartitionManager, ForwardingManager, etc.)
correctly use true.
*IMPACT:* When a 4.2+ broker is the transaction coordinator and needs to write
markers to a 4.1 or earlier broker (partition leader),
it sends WriteTxnMarkersRequest v2, which the older broker doesn't support,
causing UnsupportedVersionException. Transaction markers are never written,
leaving transactions stuck in PrepareCommit/PrepareAbort. This prevents the LSO
(Last Stable Offset) from
advancing, which blocks all read_committed consumers on affected partitions.
The issue is self-resolving once all brokers are
upgraded to the same version, but transactions stuck during the mixed-version
window remain blocked until the coordinator retries
successfully.
*AFFECTED CODE:*
core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala#L98|http://example.com]
SOLUTION: Set discoverBrokerVersions to true in
TransactionMarkerChannelManager's NetworkClient constructor. This enables the
client
to send ApiVersionsRequest when connecting to each broker, populate the
ApiVersions cache, and use latestUsableVersion() to
negotiate the correct request version. A one-line change.
was:
This bug was introduced from
[KIP-1228|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1228%3A+Add+Transaction+Version+to+WriteTxnMarkersRequest]
where a new WriteTxnMarker request version (v2) was introduced. However, the
TransactionMarkerChannelManager still created its NetworkClient with
discoverBrokerVersions=false, which disables API version negotiation with peer
brokers. This causes the transaction coordinator to blindly send
WriteTxnMarkersRequest at the latest supported version, which fails with
UnsupportedVersionException when the target broker is running an older version
that doesn't support that version. Transactions get permanently stuck in
PrepareCommit or PrepareAbort during rolling upgrades.
*ROOT CAUSE:* TransactionMarkerChannelManager creates NetworkClient with
discoverBrokerVersions=false, which disables API version discovery. Without
version discovery, NetworkClient blindly uses the latest API version for all
requests. Discovery should be set to true for compatibility.
*IMPACT:* When a 4.2 broker is the transaction coordinator and needs to write
markers to a 4.1 or earlier broker (partition leader), it sends
WriteTxnMarkersRequest v2, which the older broker doesn't support, causing
UnsupportedVersionException.
*AFFECTED CODE:* origin/trunk:
core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala#L98|http://example.com/]
*SOLUTION:* Set version discovery flag to true in the Transaction Marker
Channel Manager
> TransactionMarkerChannelManager has discoverBrokerVersions=false causing
> UnsupportedVersionException during rolling upgrades
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-20322
> URL: https://issues.apache.org/jira/browse/KAFKA-20322
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Ritika Reddy
> Assignee: Ritika Reddy
> Priority: Major
> Fix For: 4.2.1
>
>
> Bug: TransactionMarkerChannelManager does not discover broker API versions,
> causing UnsupportedVersionException during rolling
> upgrades
>
>
>
> [KIP-1228|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1228%3A+Add+Transaction+Version+to+WriteTxnMarkersRequest]
> (KAFKA-19446) added WriteTxnMarkersRequest v2 with a TransactionVersion
> field. However, TransactionMarkerChannelManager creates its NetworkClient
> with discoverBrokerVersions=false, which disables API version negotiation
> with peer brokers. Without version discovery, the ApiVersions cache is never
> populated — apiVersions.get(nodeId) returns null in NetworkClient.doSend(),
> causing it to fall through to builder.latestAllowedVersion() which blindly
> uses the highest version the sending broker knows about rather than
> negotiating a mutually supported version. TransactionMarkerChannelManager is
> the only inter-broker NetworkClient that sets discoverBrokerVersions=false;
> all others (ReplicaFetcherThread, AlterPartitionManager, ForwardingManager,
> etc.) correctly use true.
> *IMPACT:* When a 4.2+ broker is the transaction coordinator and needs to
> write markers to a 4.1 or earlier broker (partition leader),
> it sends WriteTxnMarkersRequest v2, which the older broker doesn't support,
> causing UnsupportedVersionException. Transaction markers are never written,
> leaving transactions stuck in PrepareCommit/PrepareAbort. This prevents the
> LSO (Last Stable Offset) from
> advancing, which blocks all read_committed consumers on affected partitions.
> The issue is self-resolving once all brokers are
> upgraded to the same version, but transactions stuck during the mixed-version
> window remain blocked until the coordinator retries
> successfully.
> *AFFECTED CODE:*
> core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala
>
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala#L98|http://example.com]
> SOLUTION: Set discoverBrokerVersions to true in
> TransactionMarkerChannelManager's NetworkClient constructor. This enables the
> client
> to send ApiVersionsRequest when connecting to each broker, populate the
> ApiVersions cache, and use latestUsableVersion() to
> negotiate the correct request version. A one-line change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)