[
https://issues.apache.org/jira/browse/KAFKA-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054490#comment-18054490
]
sanghyeok An commented on KAFKA-20090:
--------------------------------------
[~jolshan] [~chia7712]
It might help to separate this discussion into two related but slightly
different concerns.
I may be misunderstanding some details, so please correct me if anything below
is off. :)
*A. Single-producer, epoch-exhaustion boundary case (the reproducer)*
The reproducer seems to show a race between a timeout-driven abort and a client
EndTxn ABORT near the max-epoch boundary, where the request can be interpreted
as a retry and we may end up returning a usable MAX_EPOCH state back to the
client. With TV2, that appears to open a path to an ONGOING transaction that
cannot be completed or timed out.
For this issue, one option could be to keep the fix fairly targeted around the
retry/completion handling, and avoid returning an exhausted epoch to the
client, potentially steering the flow toward re-init or PID rotation using
existing mechanisms.
*B. Ownership-change corner case (raised in the last comment)*
Separately, and independent of max epoch, I think there is a broader question
when the abort is caused by InitProducerId fencing. If ownership has changed (a
new producer takes over the transactionalId), then a late ABORT from the old
producer ideally should not be treated as a valid retry, and we probably want
to avoid returning the current pid/epoch to the old producer. This makes me
wonder if the retry classification could incorporate an ownership check in
addition to just comparing state/result.
As a related thought (more of a conceptual/long-term idea), I wonder if it
would be cleaner if each InitProducerId call for a given transactionalId always
resulted in a new producerId, so that producerId alone represents a unique
producer instance. Today we effectively identify ownership using producerId
plus epoch, but epoch can change both when a new producer takes over and when
the coordinator bumps epochs due to other reasons (e.g., timeouts /
fencing-related transitions). That can make the conceptual meaning of
producerId plus epoch feel a bit overloaded.
I realize InitProducerId is retryable and always rotating producerId could have
compatibility/idempotency implications, so this may be better treated as a
separate discussion rather than part of the minimal fix for this issue.
I’m still learning the TV2/transaction internals,
so I’d really appreciate any corrections if I got something wrong. :)
> TV2 can allow for ongoing transactions with max epoch that never complete
> -------------------------------------------------------------------------
>
> Key: KAFKA-20090
> URL: https://issues.apache.org/jira/browse/KAFKA-20090
> Project: Kafka
> Issue Type: Task
> Reporter: Justine Olshan
> Assignee: sanghyeok An
> Priority: Major
>
> When transaction version 2 was introduced, epoch bumps happen on every
> transaction.
> The original EndTransaction logic considers retries and because of epoch
> bumps we wanted to be careful to not fence ourselves. This means that for
> EndTransaction retries, we have to check if the epoch has been bumped to
> consider a retry.
> The original logic returns the current producer ID and epoch in the
> transaction metadata when a retry has been identified. The normal end
> transaction case with max epoch - 1 was considered and accounted for – the
> state there is safe to return to the producer.
> However, we didn't consider that in the case of fencing epoch bumps with max
> epoch - 1, where we also bump the epoch, but don't create a new producer ID
> and epoch. In this scenario the producer was expected to be fenced and call
> init producer ID, so this isn't a problem, but it is a problem if we try to
> return it to the producer.
> There is a scenario we race a timeout and end transaction abort with max
> epoch - 1, we can consider the end transaction request a "retry" and return
> max epoch as the current producer's epoch instead of fencing.
> 1. The fencing abort on transactional timeout bumps the epoch to max
> 2. The EndTxn request with max epoch - 1 is considered a "retry" and we
> return max epoch
> 3. The producer can start a transaction since we don't check epochs on
> starting transactions
> 4. We cannot commit this transaction with TV2 and we cannot timeout the
> transaction. It is stuck in Ongoing forever.
> I modified
> [https://github.com/apache/kafka/blob/aad33e4e41aaa94b06f10a5be0094b717b98900f/core/src/test/scala/unit/kafka/coordinator/transaction/TransactionCoordinatorTest.scala#L1329]
> to capture this behavior. I added the following code to the end:
> {code:java}
> // Transition to COMPLETE_ABORT since we can't do it via writing markers
> response callback
> txnMetadata.completeTransitionTo(new
> TxnTransitMetadata(txnMetadata.producerId(), txnMetadata.prevProducerId(),
> txnMetadata.nextProducerId(), Short.MaxValue, Short.MaxValue -1,
> txnTimeoutMs, txnMetadata.pendingState().get(), new
> util.HashSet[TopicPartition](), txnMetadata.txnLastUpdateTimestamp(),
> txnMetadata.txnLastUpdateTimestamp(), TV_2))
> coordinator.handleEndTransaction(transactionalId, producerId,
> epochAtMaxBoundary, TransactionResult.ABORT, TV_2, endTxnCallback)
> assertEquals(10, newProducerId) assertEquals(Short.MaxValue, newEpoch)
> assertEquals(Errors.NONE, error){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)