Hey,

just catching up on the latest discussion, and I think we are going in circles? And maybe over-complicate things a little bit too much?


My take on the KIP is a follows:

 - As a minimum, we want to solve client side RecordTooLargeException

- If we can tackle other client side errors "for free", why not include them? I am ok with either approach to either list all errors we cover or list all errors we want to exclude. If we cannot reach a quick consensus on this question, I would pragmatically limit it to RecordTooLargeException just to unblock the KIP and to address an issue users did hit (we can also update the enum to express this clearly, following a "list all error we cover explicitly" approach) -- We can do follow up KIP for other errors on an on-demand basis and fix-forward / enlarge the scope successively.

(The question what "poison poll" means might not have a precise answer, but a single too large record is for sure a poison pill...)


 - The KIP is pretty clear about "send" errors, not TX errors

- The KIP was a little bit unspecific about client vs broker side errors, but Alieh clearly express on the DISCUSS thread that she intends to only address client side error, so I would propose to just stick with this for now (if we want to cover broker side error, we can always do a follow up KIP)

- Chris, you mentioned to you don't want us to get into a corner that we cannot get out with a follow up KIP. This is basically always a risk, that I think we just need to accept. If Alieh wants to limit the scope to client side errors it would be a good incremental step forward and I think we should take it. Or is there any justification why we _need_ to expand the scope to broker side error (I agree, it would nice to include, but I don't see it as a must, and it seems it raises too many open questions, preventing us to make progress...)

Btw: I also find the argument of "it's much more difficult to implement" totally valid to limit the scope of some work... we do this ALL the time...


- The proposed API makes sense (as I already expressed previously); using a `send()` overload and an expressive name for the enum and clear documentation, make the semantics and usage pattern easy to understand



-Matthias



On 7/8/24 12:44 PM, Alieh Saeedi wrote:
Hey Chris


Thanks for the comprehensive answer.


I don’t believe that things were better/worse before/after the fix of X/Y.
I strongly believe that things get better and better when we intend to
improve our current solutions. I am happy that experts, including you, are
so cautious and, at the same time so helpful with the changes and
improvements.


Regarding considering broker side errors: As you mentioned, it’s a bit
difficult since we must overload both `send` and `commitTnx`. It won’t be
doable only in `commitTnx` (see this
<https://lists.apache.org/thread/zgdoxx93gj410zhh1wso5o5w44vy9wyg> please).
Also, the broker and the producer have different configurations and
consequently different constraints for a too large record. Long story
short, the conclusion is what you mentioned: that it will be harder to
implement and even to clarify for the user in my opinion (the user must set
`ignore-send-errors` in both methods, and otherwise we throw exceptions or
if they use it only in `send` we do action X and if they use it only in
`commitTnx` we do action Y, and so on).


Regarding clarifying the new `send` for users: as you suggested before, we
must use a reasonable heuristic that would cover the errors (without
explicitly list them). We can say something like: “setting
`ignore-send-errors`, prevents failing a transaction because of single
poison pill records that violated producer configs. The faulty record is
omitted from the batch and the transaction is committed with the rest of
the batch. The broker can still reject committing a transaction due to its
own rules/constraints.”

The main idea (as you called it heuristic) must be around the idea of
non-existing records in the batch. Also, explicitly mentioning `Producer`
configs helps not assuming broker-side error coverage. This is for too
large records. Of course, we can re-word it and add the other cases as well
to make it more accurate.


About the current votes: I assume that both the new PR
<https://github.com/apache/kafka/pull/16465> (changes in `send`) and the
former PR <https://github.com/apache/kafka/pull/16332> (changes in
`commitTnx`) do not consider broker-side errors. But of course it was a
main concern of all experts (specifically Andrew) since KIP-1038 and
through the discussions we came to this agreement. BTW, it is my
opinion/assumption and might be wrong.


Cheers,

Alieh

On Mon, Jul 8, 2024 at 5:38 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

Hi Alieh,

Can you clarify why broker-side errors shouldn't be covered? The only real
rationale I can come up with is that it's easier to implement.

"Things were better for Kafka Streams before KAFKA-9279 was fixed" isn't
very convincing, because Kafka Streams is not the only user of the Java
producer client. And for others, especially new users, I doubt that this
new API we're proposing would make sense without having to consult a lot of
historical context.

I also don't think that most users will know or even care about the
distinction between errors that cause a record to fail before it's added to
a batch vs. after. If you were writing a producer application of your own,
and you wanted to handle RecordTooLargeException instances by dropping a
record without aborting a transaction, would you care about whether it was
your client or your broker that balked? Would you be happy if you wrote
logic expecting that that problem was solved once and for all, only to
learn that it could still affect you in other circumstances? Or,
alternatively, would you be happy if you wanted to solve that problem and
found an API that seemed to do exactly what you wanted, but after reading
the fine print, realized you'd have to do it yourself instead?

Ultimately, the more I think about this, the more I believe that we're
adding noise to the API (with the new overloaded variant of send) for a
feature that will likely bring confusion and even frustration to anyone
besides maintainers of Kafka Streams who tries to use it.

If the only concern about covering broker-side errors is that it would be
more difficult to implement, I believe we should strongly reconsider that
alternative. That said, if there is a straightforward way to explain this
feature to new users that won't mislead them or require them to do research
on producer internals, then I can still live with it.

Regarding a list of recoverable vs. irrecoverable errors, this is actually
the subject of another recently-introduced KIP:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions

Finally, I'd also like to ask the people who have already voted (Andrew,
Matthias) if, at the time they voted, they believed that the API would
handle all errors, or only the subset of errors that would cause a record
to be rejected from a batch before it can be sent to a broker.

Best,

Chris

On Thu, Jul 4, 2024 at 12:43 PM Alieh Saeedi <asae...@confluent.io.invalid

wrote:

Salut from the KIP’s author


Clarifying two points:


1) broker side errors:

As far as I remember we are not going to cover the errors originating
from
the broker!

A historical fact: One of the debate points in KIP-1038 was that by
defining a producer custom handler, the user may assume that broker-side
errors must be covered as well. They may define a handler for handling
`RecordTooLargeException` and still see such errors not being handled as
they wish.


2) Regarding irrecoverable/recoverable errors:

Before the fix of `KAFKA-9279`,  errors such as `RecordTooLargeException`
or errors related to missing meta data (both originating from Producer
`send()`) were considered as recoverable but after that they turned into
being irrecoverable without changing any Javadocs or having any KIP.  All
the effort made in this KIP and the former one have been towards
returning
to the former state.


I am sure that it is clear for you that which sort of errors we are going
to cover: A single record may happen to NOT get added to the batch due to
the issues with the record or its corresponding topic. The point was that
if the record is not added to the batch let ’s don’t fail the whole batch
because of that non-existing record. We never intended to do sth in
broker
side or ignore more important errors.  But I agree with you Chris. If we
are adding a new API, we must have good documentation for that. The
sentence `all irrecoverable transactional errors will still be fatal` as
you suggested is good. What do you think? I am totally against
enumerating
errors in Javadocs since these sort of errors can be changing during
time.  More
over, have you ever seen any list of recoverable or irrecoverable errors
somewhere so far?


Bests,

Alieh

On Wed, Jul 3, 2024 at 6:07 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

Hi Justine,

I agree that enumerating a list of errors that should be covered by the
KIP
is difficult; I was thinking it might be easier if we list the errors
that
should _not_ be covered by the KIP, and only if we can't define a
reasonable heuristic that would cover them without having to explicitly
list them. Could it be enough to say "all irrecoverable transactional
errors will still be fatal", or even just "all transactional errors (as
opposed to errors related to this specific record) will still be
fatal"?

Cheers,

Chris

On Wed, Jul 3, 2024 at 11:56 AM Justine Olshan
<jols...@confluent.io.invalid>
wrote:

Hey Chris,

I think what you say makes sense. I agree that defining the behavior
based
on code that can possibly change is not a good idea, and I was trying
to
get a clearer definition from the KIP's author :)

I think it can always be hard to ensure that only specific errors are
handled unless they are explicitly enumerated in code as the code can
change and can be changed by folks who are not aware of this KIP or
conversation.
I personally don't have the bandwidth to do this
definition/enumeration
of
errors, so hopefully Alieh can expand upon this.

Justine

On Wed, Jul 3, 2024 at 8:28 AM Chris Egerton <chr...@aiven.io.invalid

wrote:

Hi Alieh,

I don't love defining the changes for this KIP in terms of a catch
clause
in the KafkaProducer class, for two reasons. First, the set of
errors
that
are handled by that clause may shift over time as the code base is
modified, and second, it would be fairly opaque to users who want
to
understand whether an error would be affected by using this API or
not.

It also seems strange that we'd handle some types of
RecordTooLargeException (i.e., ones reported client-side) with this
API,
but not others (i.e., ones reported by a broker).

I think this kind of API would be most powerful, most intuitive to
users,
and easiest to document if we expanded the scope to all
record-send-related
errors, except anything indicating issues with exactly-once
semantics.
That
would include records that are too large (when caught both client-
and
server-side), records that can't be sent due to authorization
failures,
records sent to nonexistent topics/topic partitions, and keyless
records
sent to compacted topics. It would not include
ProducerFencedException, InvalidProducerEpochException,
UnsupportedVersionException,
and possibly others.

@Justine -- do you think it would be possible to develop either a
better
definition for the kinds of "excluded" errors that should not be
covered
by
this API, or, barring that, a comprehensive list of exact error
types?
And
do you think this would be acceptable in terms of risk and
complexity?

Cheers,

Chris

On Tue, Jul 2, 2024 at 5:05 PM Alieh Saeedi
<asae...@confluent.io.invalid

wrote:

Hey Justine,

About the consequences: the consequences will be like when we did
not
have
the fix made in `KAFKA-9279`: silent loss of data! Obviously,
when
the
user
intentionally chose to ignore errors, that would not be silent
any
more.
Right?
Of course, considering all types of `ApiException`s would be too
broad.
But
are the exceptions caught in `catch(ApiException e)` of the
`doSend()`
method also too broad?

-Alieh

On Tue, Jul 2, 2024 at 9:45 PM Justine Olshan
<jols...@confluent.io.invalid

wrote:

Hey Alieh,

If we want to allow any error to be ignored we should probably
run
through
all the errors to make sure they make sense.
I just want to feel confident that we aren't just making a
decision
without
considering the consequences carefully.

Justine

On Tue, Jul 2, 2024 at 12:30 PM Alieh Saeedi
<asae...@confluent.io.invalid

wrote:

Hey Justine,

yes we talked about `RecordTooLargeException` as an example,
but
did
we
ever limit ourselves to only this specific exception? I think
neither
in
the KIP nor in the PR.  As Chris mentioned, this KIP is going
to
undo
what
we have done in `KAFKA-9279` in case 1) the user is in a
transaction
and
2)
he decides to ignore the errors in which the record was not
even
added
to
the batch. Yes, and we suggested some methods for undoing or,
in
fact,
moving back the transaction from the error state in `flush`
or
in
`commitTnx` and we finally came to the idea of not even doing
the
changes
(better than undoing) in `send`.

Bests,
Alieh

On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan
<jols...@confluent.io.invalid

wrote:

Hey folks,

I understand where you are coming from by asking for
specific
use
cases.
My
understanding based on previous conversations was that
there
were a
few
different errors that have been seen.
One example I heard some information about was when the
record
was
too
large and it fails the batch. Besides that, I'm not really
sure
if
there
are cases in mind, though it is fair to ask on those and
bring
them
up.

Does a record qualify as a poison pill if it targets a
topic
that
doesn't exist? Or if it targets a topic that the producer
principal
lacks
ACLs for? What if it fails broker-side validation (e.g.,
has
a
null
key
for
a compacted topic)?

I think there was some parallel work with addressing the
UnknownTopicOrPartitionError in another way. As for the
other
checks,
acls,
validation etc. I am not aware of that being in Alieh's
scope,
but
we
should be clear about exactly what we are doing.

All errors that fall into ApiException seems too broad to
me.

Justine

On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi
<asae...@confluent.io.invalid

wrote:

Hey Chris,
thanks for sharing your concerns.

1) About the language of KIP (or maybe later in
Javadocs):
Is
that
alright
if I write all errors that fall into the `ApiException`
category
thrown
(actually returned) by Producer?
2) About future expansion: do you have any better
suggestions
for
enum
names? Do you think `IGNORE_API_EXEPTIONS` or something
like
that
is
a
"better/more accurate" one?

Bests,
Alieh

On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton
<chr...@aiven.io.invalid

wrote:

Hi Alieh and Justine,

I'm concerned that we're settling on a definition of
"poison
pill"
that's
easiest to tackle right now but may lead to
shortcomings
down
the
road. I
understand the relationship between this KIP and
KAFKA-9279,
and
I
can
totally get behind the desire to keep things small,
focused,
and
simple
in
the name of avoiding bugs. However, what I don't think
is
clear
at
all
is
what the "specific circumstances" are that Justine
mentioned. I
had a
drastically different idea of what the intended
behavioral
change
would
be
before looking at the draft PR.

I would like 1) for us to be clearer about the
categories
of
errors
that
we
want to cover with this new API (especially since we'll
have
to
find
a
clear, succinct way to document this for users), and 2)
to
make
sure
that
if we do try to expand this API in the future, that we
won't
be
painted
into a corner.

For item 1, hopefully we can agree that the language in
the
KIP
for IGNORE_SEND_ERRORS ("The records causing
irrecoverable
errors
are
excluded from the batch and the transaction is
committed
successfully.")
is
pretty vague. If we start using the phrase "poison pill
record"
that
could
help, but IMO more detail would still be needed. We
know
that
we
want
to
include records that are so large that they can be
immediately
rejected
by
the producer. But there are other cases that users
might
expect
to
be
handled. Does a record qualify as a poison pill if it
targets a
topic
that
doesn't exist? Or if it targets a topic that the
producer
principal
lacks
ACLs for? What if it fails broker-side validation
(e.g.,
has
a
null
key
for
a compacted topic)?

For item 2, this really depends on how narrow the scope
of
what
we're
doing
right now is. If we only handle a subset of the
examples
I
laid
out
above
that could possibly be considered poison pills with
this
KIP,
do
we
want
to
lock ourselves in to never addressing more in the
future,
or
can
we
choose
an API (probably just enum names would be the only
important
decision
here)
that leaves room for more later?

Best,

Chris



On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan
<jols...@confluent.io.invalid>
wrote:

Chris and Alieh,

My understanding is that this KIP is really only
trying
to
solve
an
issue
of a "poison pill" record that fails send().
We've talked a lot about having a generic framework
for
all
errors,
but I
don't think that is what this KIP is trying to do.
Essentially
the
request
is to undo the change from KAFKA-9279
<https://issues.apache.org/jira/browse/KAFKA-9279>
but
under
specific
circumstances that are controlled. I really am
concerned
about
opening
new
avenues for bugs with EOS and hesitate to handle any
other
types
of
errors.
I think if we all agree on the problem that we are
trying
to
solve,
it
is
easier to agree on solutions.

Justine

On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi
<asae...@confluent.io.invalid

wrote:

Hi Matthias,
Thanks for the valid points you mentioned. I
updated
the
KIP
and
the
PR
with:
1) mentioning that the new overloaded `send` throws
`IllegalStateException`
if the user tries to ignore `send()` errors outside
of
a
transaction.
2) the default implementation in `Producer`
interface
throws
an
`UnsupportedOperationException`

Hi Chris,
Thanks for the feedback. I tried to clarify the
points
you
listed:
-------> we've narrowed the scope from any error
that
might
take
place
with
producing a record to Kafka, to only the ones that
are
thrown
directly
from
Producer::send;

 From the very beginning and even since KIP-1038,
the
main
purpose
was
to
have "more flexibility," or, in other words,
"giving
the
user
the
authority" to handle some specific exceptions
thrown
from
the
`Producer`.
Due to the specific cases we had in mind, KIP-1038
was
discarded
and
we
decided to not define a `CustomExceptionHandler`
for
`Producer`
and
instead
treat the `send` failures in a different way. The
main
issue
is
that
`send`
makes a transition to error state, which is
undoable.
In
fact,
one
single
poison pill record makes the whole batch fail. The
former
suggestions
that
you agreed with have been all about un-doing this
transition
in
`flush`
or
`commit`. The new suggestion is to un-do (or
better,
NOT
do)
in
`send`
due
to the reasons listed in the discussions above.
Moreover, I would say that having such a large
scope
as
you
mentioned
is
impossible. In the best case, we may have control
over
the
`Producer`.
What
shall we do with the broker? The `any error that
might
take
place
with
producing a record to Kafka` is too much, I think.

-------> is this all we want to handle, and will it
prevent
us
from
handling more in the future in an intuitive way?

I think yes. This is all we want. Other sorts of
errors
such
as
having
problem with partition addition, producer fenced
exception,
etc
seem
to
be
more serious issues. The intention was to handle
problems
created
by
(maybe) a single poison pill record. BTW, I do not
see
any
obstacles
to
future changes.

Bests,
Alieh

On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton
<chr...@aiven.io.invalid

wrote:

Ah, sorry--spoke too soon. The PR doesn't show
that
errors
thrown
from
Producer::send are handled, but instead,
ApiException
instances
that
are
caught inside KafkaProducer::doSend and are
handled
by
returning
an
already-failed future are. I think the same
question
still
applies
(is
this
all we want to handle, and will it prevent us
from
handling
more
in
the
future in an intuitive way), though.

On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <
chr...@aiven.io

wrote:

Hi Alieh,

This KIP has evolved a lot since I last looked
at
it,
but
the
changes
seem
well thought-out both in semantics and API. One
clarifying
question I
have
is that it looks based on the draft PR that
we've
narrowed
the
scope
from
any error that might take place with producing
a
record
to
Kafka,
to
only
the ones that are thrown directly from
Producer::send;
is
that
the
intended
behavior here? And if so, do you have thoughts
on
how
we
might
design a
follow-up KIP that would catch all errors
(including
ones
reported
asynchronously instead of synchronously)? I'd
like
it
if
we
could
leave
the
door open for that without painting ourselves
into
too
much
of
a
corner
with the API design for this KIP.

Cheers,

Chris

On Fri, Jun 28, 2024 at 6:31 PM Matthias J.
Sax <
mj...@apache.org>
wrote:

Thanks Alieh,

it seems this KIP can just pick between a
couple
of
tradeoffs.
Adding
an
overloaded `send()` as the KIP propose makes
sense
to
me
and
seems
to
provides the cleanest solution compare to
there
options
we
discussed.

Given the explicit name of the passed-in
option
that
highlights
that
the
option is for TX only make is pretty clear and
avoids
the
issue
of
`flush()` ambiguity.


Nit: We should make clear on the KIP though,
that
the
new
`send()`
overload would throw an
`IllegalStateException`
if
TX
are
not
used
(similar to other TX methods like initTx(),
etc)


About the `Producer` interface, I am not sure
how
this
was
done
in
the
past (eg, KIP-266 added
`Consumer.poll(Duration)`
w/o
a
default
implementation), if we need a default
implementation
for
backward
compatibility or not? If we do want to add
one,
I
think
it
would
be
appropriate to throw an
`UnsupportedOperationException`
by
default,
instead of just keeping the default impl
empty?


My points are rather minor, and should not
block
this
KIP
though.
Overall LGTM.



-Matthias

On 6/27/24 1:28 PM, Alieh Saeedi wrote:
Hi Justine,

Thanks for the suggestion.
Making applications to validate every single
record
is
not
the
best
way,
from an efficiency point of view.
Moreover, between changing the behavior of
the
Producer
in
`send`
and
`commitTnx`, the former seems more
reasonable
and
clean.

Bests,
Alieh

On Thu, Jun 27, 2024 at 8:14 PM Justine
Olshan
<jols...@confluent.io.invalid>
wrote:

Hey Alieh,

I see there are two options now. So folks
will
be
discussing
the
approaches
and deciding the best way forward before we
vote?
I do think there could be a problem with
the
approach
on
commit
if
we
get
stuck on an earlier error and have more
records
(potentially
on
new
partitions) to commit as the current PR is
implemented.

I guess this takes us back to the question
of
whether
the
error
should
be
cleared on send.

(And I guess at the back of my mind, I'm
wondering
if
there
is
a
way
we can
validate the "posion pill" records
application
side
before
we
even
try
to
send them)

Justine

On Wed, Jun 26, 2024 at 4:38 PM Alieh
Saeedi
<asae...@confluent.io.invalid

wrote:

Hi Justine,

I did not update the KIP with
`TxnSendOption`
since
I
thought
it'd
be
better discussed here beforehand.
right now, there are 2 PRs:
- the PR that implements the current
version
of
the
KIP:

https://github.com/apache/kafka/pull/16332
- the POC PR that clarifies the
`TxnSendOption`:

https://github.com/apache/kafka/pull/16465

Bests,
Alieh

On Thu, Jun 27, 2024 at 12:42 AM Justine
Olshan
<jols...@confluent.io.invalid> wrote:

Hey Alieh,

I think I am a little confused. Are the 3
points
above
addressed
by
the
KIP
or did something change? The PR seems to
not
include
this
change
and
still
has the CommitOption as well.

Thanks,
Justine

On Wed, Jun 26, 2024 at 2:15 PM Alieh
Saeedi
<asae...@confluent.io.invalid

wrote:

Hi all,


Looking at the PR <
https://github.com/apache/kafka/pull/16332

corresponding to the KIP, there are some
points
worthy
of
mention:


1) clearing the error ends up
dirty/messy
code
in
`TransactionManager`.

2) By clearing the error, we are
actually
doing
an
illegal
transition
from
`ABORTABLE_ERROR` to `IN_TRANSACTION`
which
is
conceptually
not
acceptable.
This can be the root cause of some
issues,
with
perhaps
further
future
changes by others.

3) If the poison pill record `r1`
causes a
transition
to
the
error
state
and then the next record `r2` requires
adding
a
partition
to
the
transaction, the action fails due to
being
in
the
error
state.
In
this
case, clearing errors during
`commitTnx(CLEAR_SEND_ERROR)`
is
too
late.
However, this case can NOT be the main
concern
as
soon
as
KIP-890
is
fully
implemented.


My suggestion is to solve the problem
where
it
arises.
If
the
transition
to
the error state does not happen during
`send()`,
we
do
not
need
to
clear
the error later. Therefore, instead of
`CommitOption`,
we
can
define
a
`TxnSendOption` and prevent the `send()`
method
from
going
to
the
error
state in case 1) we're in a transaction
and
2)
the
user
asked
for
IGONRE_SEND_ERRORS. For more clarity,
you
can
take a
look
at
the
POC
PR
<
https://github.com/apache/kafka/pull/16465
.

Cheers,
Alieh























Reply via email to