Thanks for the details Justine!
20)
The client side change for 2 is removing the addPartitions to
transaction
call. We don't need to make this from the producer to the txn
coordinator,
only server side.
I think I did not express myself clearly. I understand that we can (and
should) change the producer to not send the `addPartitions` request any
longer. But I don't thinks it's requirement to change the broker?
What I am trying to say is: as a safe-guard and improvement for older
producers, the partition leader can just send the `addPartitions`
request to the TX-coordinator in any case -- if the old producer
correctly did send the `addPartition` request to the TX-coordinator
already, the TX-coordinator can just "ignore" is as idempotent. However,
if the old producer has a bug and did forget to sent the `addPartition`
request, we would now ensure that the partition is indeed added to the
TX and thus fix a potential producer bug (even if we don't get the
fencing via the bump epoch). -- It seems to be a good improvement? Or is
there a reason to not do this?
30)
Transaction is ongoing = partition was added to transaction via
addPartitionsToTxn. We check this with the DescribeTransactions call.
Let
me know if this wasn't sufficiently explained here:
If we do what I propose in (20), we don't really need to make this
`DescribeTransaction` call, as the partition leader adds the partition
for older clients and we get this check for free.
40)
The idea here is that if any messages somehow come in before we get the
new
epoch to the producer, they will be fenced. However, if we don't think
this
is necessary, it can be discussed
I agree that we should have epoch fencing. My question is different:
Assume we are at epoch 73, and we have an ongoing transaction, that is
committed. It seems natural to write the "prepare commit" marker and the
`WriteTxMarkerRequest` both with epoch 73, too, as it belongs to the
current transaction. Of course, we now also bump the epoch and expect
the next requests to have epoch 74, and would reject an request with
epoch 73, as the corresponding TX for epoch 73 was already committed.
It seems you propose to write the "prepare commit marker" and
`WriteTxMarkerRequest` with epoch 74 though, what would work, but it
seems confusing. Is there a reason why we would use the bumped epoch 74
instead of the current epoch 73?
60)
When we are checking if the transaction is ongoing, we need to make a
round
trip from the leader partition to the transaction coordinator. In the
time
we are waiting for this message to come back, in theory we could have
sent
a commit/abort call that would make the original result of the check
out of
date. That is why we can check the leader state before we write to the
log.
Thanks. Got it.
However, is this really an issue? We put the produce request in
purgatory, so how could we process the `WriteTxnMarkerRequest` first?
Don't we need to put the `WriteTxnMarkerRequest` into purgatory, too,
for this case, and process both request in-order? (Again, my broker
knowledge is limited and maybe we don't maintain request order for this
case, what seems to be an issue IMHO, and I am wondering if changing
request handling to preserve order for this case might be the cleaner
solution?)
-Matthias
On 11/30/22 3:28 PM, Artem Livshits wrote:
Hi Justine,
I think the interesting part is not in this logic (because it tries to
figure out when UNKNOWN_PRODUCER_ID is retriable and if it's retryable,
it's definitely not fatal), but what happens when this logic doesn't
return
'true' and falls through. In the old clients it seems to be fatal, if
we
keep the behavior in the new clients, I'd expect it would be fatal as
well.
-Artem
On Tue, Nov 29, 2022 at 11:57 AM Justine Olshan
<jols...@confluent.io.invalid> wrote:
Hi Artem and Jeff,
Thanks for taking a look and sorry for the slow response.
You both mentioned the change to handle UNKNOWN_PRODUCER_ID errors. To
be
clear — this error code will only be sent again when the client's
request
version is high enough to ensure we handle it correctly.
The current (Java) client handles this by the following (somewhat long)
code snippet:
// An UNKNOWN_PRODUCER_ID means that we have lost the producer state
on the
broker. Depending on the log start
// offset, we may want to retry these, as described for each case
below. If
none of those apply, then for the
// idempotent producer, we will locally bump the epoch and reset the
sequence numbers of in-flight batches from
// sequence 0, then retry the failed batch, which should now succeed.
For
the transactional producer, allow the
// batch to fail. When processing the failed batch, we will transition
to
an abortable error and set a flag
// indicating that we need to bump the epoch (if supported by the
broker).
if (error == Errors.*UNKNOWN_PRODUCER_ID*) {
if (response.logStartOffset == -1) {
// We don't know the log start offset with this response. We
should
just retry the request until we get it.
// The UNKNOWN_PRODUCER_ID error code was added along with
the new
ProduceResponse which includes the
// logStartOffset. So the '-1' sentinel is not for backward
compatibility. Instead, it is possible for
// a broker to not know the logStartOffset at when it is
returning
the response because the partition
// may have moved away from the broker from the time the
error was
initially raised to the time the
// response was being constructed. In these cases, we should
just
retry the request: we are guaranteed
// to eventually get a logStartOffset once things settle down.
return true;
}
if (batch.sequenceHasBeenReset()) {
// When the first inflight batch fails due to the truncation
case,
then the sequences of all the other
// in flight batches would have been restarted from the
beginning.
However, when those responses
// come back from the broker, they would also come with an
UNKNOWN_PRODUCER_ID error. In this case, we should not
// reset the sequence numbers to the beginning.
return true;
} else if (lastAckedOffset(batch.topicPartition).orElse(
*NO_LAST_ACKED_SEQUENCE_NUMBER*) < response.logStartOffset) {
// The head of the log has been removed, probably due to the
retention time elapsing. In this case,
// we expect to lose the producer state. For the transactional
producer, reset the sequences of all
// inflight batches to be from the beginning and retry them,
so
that the transaction does not need to
// be aborted. For the idempotent producer, bump the epoch to
avoid
reusing (sequence, epoch) pairs
if (isTransactional()) {
txnPartitionMap.startSequencesAtBeginning(batch.topicPartition,
this.producerIdAndEpoch);
} else {
requestEpochBumpForPartition(batch.topicPartition);
}
return true;
}
if (!isTransactional()) {
// For the idempotent producer, always retry
UNKNOWN_PRODUCER_ID
errors. If the batch has the current
// producer ID and epoch, request a bump of the epoch.
Otherwise
just retry the produce.
requestEpochBumpForPartition(batch.topicPartition);
return true;
}
}
I was considering keeping this behavior — but am open to simplifying
it.
We are leaving changes to older clients off the table here since it
caused
many issues for clients in the past. Previously this was a fatal error
and
we didn't have the mechanisms in place to detect when this was a
legitimate
case vs some bug or gap in the protocol. Ensuring each transaction has
its
own epoch should close this gap.
And to address Jeff's second point:
*does the typical produce request path append records to local log
along*
*with the currentTxnFirstOffset information? I would like to
understand*
*when the field is written to disk.*
Yes, the first produce request populates this field and writes the
offset
as part of the record batch and also to the producer state snapshot.
When
we reload the records on restart and/or reassignment, we repopulate
this
field with the snapshot from disk along with the rest of the producer
state.
Let me know if there are further comments and/or questions.
Thanks,
Justine
On Tue, Nov 22, 2022 at 9:00 PM Jeff Kim <jeff....@confluent.io.invalid
wrote:
Hi Justine,
Thanks for the KIP! I have two questions:
1) For new clients, we can once again return an error
UNKNOWN_PRODUCER_ID
for sequences
that are non-zero when there is no producer state present on the
server.
This will indicate we missed the 0 sequence and we don't yet want to
write
to the log.
I would like to understand the current behavior to handle older
clients,
and if there are any changes we are making. Maybe I'm missing
something,
but we would want to identify whether we missed the 0 sequence for
older
clients, no?
2) Upon returning from the transaction coordinator, we can set the
transaction
as ongoing on the leader by populating currentTxnFirstOffset
through the typical produce request handling.
does the typical produce request path append records to local log
along
with the currentTxnFirstOffset information? I would like to understand
when the field is written to disk.
Thanks,
Jeff
On Tue, Nov 22, 2022 at 4:44 PM Artem Livshits
<alivsh...@confluent.io.invalid> wrote:
Hi Justine,
Thank you for the KIP. I have one question.
5) For new clients, we can once again return an error
UNKNOWN_PRODUCER_ID
I believe we had problems in the past with returning
UNKNOWN_PRODUCER_ID
because it was considered fatal and required client restart. It
would
be
good to spell out the new client behavior when it receives the error.
-Artem
On Tue, Nov 22, 2022 at 10:00 AM Justine Olshan
<jols...@confluent.io.invalid> wrote:
Thanks for taking a look Matthias. I've tried to answer your
questions
below:
10)
Right — so the hanging transaction only occurs when we have a late
message
come in and the partition is never added to a transaction again. If
we
never add the partition to a transaction, we will never write a
marker
and
never advance the LSO.
If we do end up adding the partition to the transaction (I suppose
this
can
happen before or after the late message comes in) then we will
include
the
late message in the next (incorrect) transaction.
So perhaps it is clearer to make the distinction between messages
that
eventually get added to the transaction (but the wrong one) or
messages
that never get added and become hanging.
20)
The client side change for 2 is removing the addPartitions to
transaction
call. We don't need to make this from the producer to the txn
coordinator,
only server side.
In my opinion, the issue with the addPartitionsToTxn call for older
clients
is that we don't have the epoch bump, so we don't know if the
message
belongs to the previous transaction or this one. We need to check if
the
partition has been added to this transaction. Of course, this means
we
won't completely cover the case where we have a really late message
and
we
have added the partition to the new transaction, but that's
unfortunately
something we will need the new clients to cover.
30)
Transaction is ongoing = partition was added to transaction via
addPartitionsToTxn. We check this with the DescribeTransactions
call.
Let
me know if this wasn't sufficiently explained here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense#KIP890:TransactionsServerSideDefense-EnsureOngoingTransactionforOlderClients(3)
40)
The idea here is that if any messages somehow come in before we get
the
new
epoch to the producer, they will be fenced. However, if we don't
think
this
is necessary, it can be discussed
50)
It should be synchronous because if we have an event (ie, an error)
that
causes us to need to abort the transaction, we need to know which
partitions to send transaction markers to. We know the partitions
because
we added them to the coordinator via the addPartitionsToTxn call.
Previously we have had asynchronous calls in the past (ie, writing
the
commit markers when the transaction is completed) but often this
just
causes confusion as we need to wait for some operations to complete.
In
the
writing commit markers case, clients often see
CONCURRENT_TRANSACTIONs
error messages and that can be confusing. For that reason, it may be
simpler to just have synchronous calls — especially if we need to
block
on
some operation's completion anyway before we can start the next
transaction. And yes, I meant coordinator. I will fix that.
60)
When we are checking if the transaction is ongoing, we need to make
a
round
trip from the leader partition to the transaction coordinator. In
the
time
we are waiting for this message to come back, in theory we could
have
sent
a commit/abort call that would make the original result of the check
out
of
date. That is why we can check the leader state before we write to
the
log.
I'm happy to update the KIP if some of these things were not clear.
Thanks,
Justine
On Mon, Nov 21, 2022 at 7:11 PM Matthias J. Sax <mj...@apache.org>
wrote:
Thanks for the KIP.
Couple of clarification questions (I am not a broker expert do
maybe
some question are obvious for others, but not for me with my lack
of
broker knowledge).
(10)
The delayed message case can also violate EOS if the delayed
message
comes in after the next addPartitionsToTxn request comes in.
Effectively
we
may see a message from a previous (aborted) transaction become part
of
the
next transaction.
What happens if the message come in before the next
addPartitionsToTxn
request? It seems the broker hosting the data partitions won't know
anything about it and append it to the partition, too? What is the
difference between both cases?
Also, it seems a TX would only hang, if there is no following TX
that
is
either committer or aborted? Thus, for the case above, the TX might
actually not hang (of course, we might get an EOS violation if the
first
TX was aborted and the second committed, or the other way around).
(20)
Of course, 1 and 2 require client-side changes, so for older
clients,
those approaches won’t apply.
For (1) I understand why a client change is necessary, but not sure
why
we need a client change for (2). Can you elaborate? -- Later you
explain
that we should send a DescribeTransactionRequest, but I am not sure
why?
Can't we not just do an implicit AddPartiitonToTx, too? If the old
producer correctly registered the partition already, the
TX-coordinator
can just ignore it as it's an idempotent operation?
(30)
To cover older clients, we will ensure a transaction is ongoing
before
we write to a transaction
Not sure what you mean by this? Can you elaborate?
(40)
[the TX-coordinator] will write the prepare commit message with a
bumped
epoch and send WriteTxnMarkerRequests with the bumped epoch.
Why do we use the bumped epoch for both? It seems more intuitive to
use
the current epoch, and only return the bumped epoch to the
producer?
(50) "Implicit AddPartitionToTransaction"
Why does the implicitly sent request need to be synchronous? The
KIP
also says
in case we need to abort and need to know which partitions
What do you mean by this?
we don’t want to write to it before we store in the transaction
manager
Do you mean TX-coordinator instead of "manager"?
(60)
For older clients and ensuring that the TX is ongoing, you
describe a
race condition. I am not sure if I can follow here. Can you
elaborate?
-Matthias
On 11/18/22 1:21 PM, Justine Olshan wrote:
Hey all!
I'd like to start a discussion on my proposal to add some
server-side
checks on transactions to avoid hanging transactions. I know this
has
been
an issue for some time, so I really hope this KIP will be helpful
for
many
users of EOS.
The KIP includes changes that will be compatible with old clients
and
changes to improve performance and correctness on new clients.
Please take a look and leave any comments you may have!
KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense
JIRA: https://issues.apache.org/jira/browse/KAFKA-14402
Thanks!
Justine