Thanks for the KIP.
Couple of clarification questions (I am not a broker expert do maybe
some question are obvious for others, but not for me with my lack of
broker knowledge).
(10)
The delayed message case can also violate EOS if the delayed message comes in
after the next addPartitionsToTxn request comes in. Effectively we may see a
message from a previous (aborted) transaction become part of the next
transaction.
What happens if the message come in before the next addPartitionsToTxn
request? It seems the broker hosting the data partitions won't know
anything about it and append it to the partition, too? What is the
difference between both cases?
Also, it seems a TX would only hang, if there is no following TX that is
either committer or aborted? Thus, for the case above, the TX might
actually not hang (of course, we might get an EOS violation if the first
TX was aborted and the second committed, or the other way around).
(20)
Of course, 1 and 2 require client-side changes, so for older clients, those
approaches won’t apply.
For (1) I understand why a client change is necessary, but not sure why
we need a client change for (2). Can you elaborate? -- Later you explain
that we should send a DescribeTransactionRequest, but I am not sure why?
Can't we not just do an implicit AddPartiitonToTx, too? If the old
producer correctly registered the partition already, the TX-coordinator
can just ignore it as it's an idempotent operation?
(30)
To cover older clients, we will ensure a transaction is ongoing before we write
to a transaction
Not sure what you mean by this? Can you elaborate?
(40)
[the TX-coordinator] will write the prepare commit message with a bumped epoch
and send WriteTxnMarkerRequests with the bumped epoch.
Why do we use the bumped epoch for both? It seems more intuitive to use
the current epoch, and only return the bumped epoch to the producer?
(50) "Implicit AddPartitionToTransaction"
Why does the implicitly sent request need to be synchronous? The KIP
also says
in case we need to abort and need to know which partitions
What do you mean by this?
we don’t want to write to it before we store in the transaction manager
Do you mean TX-coordinator instead of "manager"?
(60)
For older clients and ensuring that the TX is ongoing, you describe a
race condition. I am not sure if I can follow here. Can you elaborate?
-Matthias
On 11/18/22 1:21 PM, Justine Olshan wrote:
Hey all!
I'd like to start a discussion on my proposal to add some server-side
checks on transactions to avoid hanging transactions. I know this has been
an issue for some time, so I really hope this KIP will be helpful for many
users of EOS.
The KIP includes changes that will be compatible with old clients and
changes to improve performance and correctness on new clients.
Please take a look and leave any comments you may have!
KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
JIRA: https://issues.apache.org/jira/browse/KAFKA-14402
Thanks!
Justine