Chia-Ping Tsai created KAFKA-19999:
--------------------------------------
Summary: Transaction coordinator livelock caused by invalid
producer epoch
Key: KAFKA-19999
URL: https://issues.apache.org/jira/browse/KAFKA-19999
Project: Kafka
Issue Type: Bug
Reporter: Chia-Ping Tsai
Assignee: Chia-Ping Tsai
Fix For: 4.2.0, 4.0.2, 4.1.2
*case 1: during recovery*
When a Transaction Coordinator fails over and reloads transactions in
PREPARE_COMMIT or PREPARE_ABORT state from the transaction log, it currently
reuses the logged producer epoch to send the transaction markers.
In Transaction V2, brokers enforce strict epoch monotonicity for control
batches (markers), requiring marker_epoch > current_epoch. Consequently, the
broker rejects the recovery marker with InvalidProducerEpochException.
{code:java}
private void checkProducerEpoch(short producerEpoch, long offset, short
transactionVersion) {
short current = updatedEntry.producerEpoch();
boolean invalidEpoch = (transactionVersion >= 2) ? (producerEpoch <=
current) : (producerEpoch < current);
if (invalidEpoch) {
String comparison = (transactionVersion >= 2) ? "<=" : "<";
String message = "Epoch of producer " + producerId + " at offset "
+ offset + " in " + topicPartition +
" is " + producerEpoch + ", which is " + comparison + " the
last seen epoch " + current +
" (TV" + transactionVersion + ")";
if (origin == AppendOrigin.REPLICATION) {
log.warn(message);
} else {
// Starting from 2.7, we replaced ProducerFenced error with
InvalidProducerEpoch in the
// producer send response callback to differentiate from the
former fatal exception,
// letting client abort the ongoing transaction and retry.
throw new InvalidProducerEpochException(message);
}
}
}
{code}
The coordinator handles this error by removing the pending transaction from
memory without writing a COMPLETE state to the transaction log. This leaves the
transaction permanently hanging in the PREPARE state. Clients attempting to
continue or commit offsets subsequently fail with CONCURRENT_TRANSACTIONS in an
infinite retry loop.
{code:scala}
case Errors.INVALID_PRODUCER_EPOCH |
Errors.TRANSACTION_COORDINATOR_FENCED => // producer
or coordinator epoch has changed, this txn can now be ignored
info(s"Sending $transactionalId's transaction marker for
partition $topicPartition has permanently failed with error
${error.exceptionName} " +
s"with the current coordinator epoch
${epochAndMetadata.coordinatorEpoch}; cancel sending any more transaction
markers $txnMarker to the brokers")
txnMarkerChannelManager.removeMarkersForTxn(pendingCompleteTxn)
abortSending = true
{code}
*case 2: due to disconnection*
If a WriteTxnMarkersRequest is persisted by a partition leader but the response
is lost (e.g., due to disconnection), the Transaction Coordinator (TC) retries
the request using the same producer epoch.
If the target broker restarts and restores its state before receiving the
retry, it rejects the request with InvalidProducerEpochException.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)