Re: [DISCUSS] KIP-1059: Enable the Producer flush() method to clear the latest send() error

Matthias J. Sax Mon, 08 Jul 2024 22:04:23 -0700

Hey,

just catching up on the latest discussion, and I think we are going incircles? And maybe over-complicate things a little bit too much?



My take on the KIP is a follows:

 - As a minimum, we want to solve client side RecordTooLargeException

- If we can tackle other client side errors "for free", why notinclude them? I am ok with either approach to either list all errors wecover or list all errors we want to exclude. If we cannot reach a quickconsensus on this question, I would pragmatically limit it toRecordTooLargeException just to unblock the KIP and to address an issueusers did hit (we can also update the enum to express this clearly,following a "list all error we cover explicitly" approach) -- We can dofollow up KIP for other errors on an on-demand basis and fix-forward /enlarge the scope successively.

(The question what "poison poll" means might not have a precise answer,but a single too large record is for sure a poison pill...)



 - The KIP is pretty clear about "send" errors, not TX errors

- The KIP was a little bit unspecific about client vs broker sideerrors, but Alieh clearly express on the DISCUSS thread that she intendsto only address client side error, so I would propose to just stick withthis for now (if we want to cover broker side error, we can always do afollow up KIP)

- Chris, you mentioned to you don't want us to get into a corner thatwe cannot get out with a follow up KIP. This is basically always a risk,that I think we just need to accept. If Alieh wants to limit the scopeto client side errors it would be a good incremental step forward and Ithink we should take it. Or is there any justification why we _need_ toexpand the scope to broker side error (I agree, it would nice toinclude, but I don't see it as a must, and it seems it raises too manyopen questions, preventing us to make progress...)

Btw: I also find the argument of "it's much more difficult to implement"totally valid to limit the scope of some work... we do this ALL the time...

- The proposed API makes sense (as I already expressed previously);using a `send()` overload and an expressive name for the enum and cleardocumentation, make the semantics and usage pattern easy to understand




-Matthias



On 7/8/24 12:44 PM, Alieh Saeedi wrote:

Hey Chris


Thanks for the comprehensive answer.


I don’t believe that things were better/worse before/after the fix of X/Y.
I strongly believe that things get better and better when we intend to
improve our current solutions. I am happy that experts, including you, are
so cautious and, at the same time so helpful with the changes and
improvements.


Regarding considering broker side errors: As you mentioned, it’s a bit
difficult since we must overload both `send` and `commitTnx`. It won’t be
doable only in `commitTnx` (see this
<https://lists.apache.org/thread/zgdoxx93gj410zhh1wso5o5w44vy9wyg> please).
Also, the broker and the producer have different configurations and
consequently different constraints for a too large record. Long story
short, the conclusion is what you mentioned: that it will be harder to
implement and even to clarify for the user in my opinion (the user must set
`ignore-send-errors` in both methods, and otherwise we throw exceptions or
if they use it only in `send` we do action X and if they use it only in
`commitTnx` we do action Y, and so on).


Regarding clarifying the new `send` for users: as you suggested before, we
must use a reasonable heuristic that would cover the errors (without
explicitly list them). We can say something like: “setting
`ignore-send-errors`, prevents failing a transaction because of single
poison pill records that violated producer configs. The faulty record is
omitted from the batch and the transaction is committed with the rest of
the batch. The broker can still reject committing a transaction due to its
own rules/constraints.”

The main idea (as you called it heuristic) must be around the idea of
non-existing records in the batch. Also, explicitly mentioning `Producer`
configs helps not assuming broker-side error coverage. This is for too
large records. Of course, we can re-word it and add the other cases as well
to make it more accurate.


About the current votes: I assume that both the new PR
<https://github.com/apache/kafka/pull/16465> (changes in `send`) and the
former PR <https://github.com/apache/kafka/pull/16332> (changes in
`commitTnx`) do not consider broker-side errors. But of course it was a
main concern of all experts (specifically Andrew) since KIP-1038 and
through the discussions we came to this agreement. BTW, it is my
opinion/assumption and might be wrong.


Cheers,

Alieh

On Mon, Jul 8, 2024 at 5:38 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

Hi Alieh,

Can you clarify why broker-side errors shouldn't be covered? The only real
rationale I can come up with is that it's easier to implement.

"Things were better for Kafka Streams before KAFKA-9279 was fixed" isn't
very convincing, because Kafka Streams is not the only user of the Java
producer client. And for others, especially new users, I doubt that this
new API we're proposing would make sense without having to consult a lot of
historical context.

I also don't think that most users will know or even care about the
distinction between errors that cause a record to fail before it's added to
a batch vs. after. If you were writing a producer application of your own,
and you wanted to handle RecordTooLargeException instances by dropping a
record without aborting a transaction, would you care about whether it was
your client or your broker that balked? Would you be happy if you wrote
logic expecting that that problem was solved once and for all, only to
learn that it could still affect you in other circumstances? Or,
alternatively, would you be happy if you wanted to solve that problem and
found an API that seemed to do exactly what you wanted, but after reading
the fine print, realized you'd have to do it yourself instead?

Ultimately, the more I think about this, the more I believe that we're
adding noise to the API (with the new overloaded variant of send) for a
feature that will likely bring confusion and even frustration to anyone
besides maintainers of Kafka Streams who tries to use it.

If the only concern about covering broker-side errors is that it would be
more difficult to implement, I believe we should strongly reconsider that
alternative. That said, if there is a straightforward way to explain this
feature to new users that won't mislead them or require them to do research
on producer internals, then I can still live with it.

Regarding a list of recoverable vs. irrecoverable errors, this is actually
the subject of another recently-introduced KIP:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1050%3A+Consistent+error+handling+for+Transactions

Finally, I'd also like to ask the people who have already voted (Andrew,
Matthias) if, at the time they voted, they believed that the API would
handle all errors, or only the subset of errors that would cause a record
to be rejected from a batch before it can be sent to a broker.

Best,

Chris

On Thu, Jul 4, 2024 at 12:43 PM Alieh Saeedi <asae...@confluent.io.invalid

wrote:

Salut from the KIP’s author


Clarifying two points:


1) broker side errors:

As far as I remember we are not going to cover the errors originating

from

the broker!

A historical fact: One of the debate points in KIP-1038 was that by
defining a producer custom handler, the user may assume that broker-side
errors must be covered as well. They may define a handler for handling
`RecordTooLargeException` and still see such errors not being handled as
they wish.


2) Regarding irrecoverable/recoverable errors:

Before the fix of `KAFKA-9279`,  errors such as `RecordTooLargeException`
or errors related to missing meta data (both originating from Producer
`send()`) were considered as recoverable but after that they turned into
being irrecoverable without changing any Javadocs or having any KIP.  All
the effort made in this KIP and the former one have been towards

returning

to the former state.


I am sure that it is clear for you that which sort of errors we are going
to cover: A single record may happen to NOT get added to the batch due to
the issues with the record or its corresponding topic. The point was that
if the record is not added to the batch let ’s don’t fail the whole batch
because of that non-existing record. We never intended to do sth in

broker

side or ignore more important errors.  But I agree with you Chris. If we
are adding a new API, we must have good documentation for that. The
sentence `all irrecoverable transactional errors will still be fatal` as
you suggested is good. What do you think? I am totally against

enumerating

errors in Javadocs since these sort of errors can be changing during
time.  More
over, have you ever seen any list of recoverable or irrecoverable errors
somewhere so far?


Bests,

Alieh

On Wed, Jul 3, 2024 at 6:07 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

Hi Justine,

I agree that enumerating a list of errors that should be covered by the

KIP

is difficult; I was thinking it might be easier if we list the errors

that

should _not_ be covered by the KIP, and only if we can't define a
reasonable heuristic that would cover them without having to explicitly
list them. Could it be enough to say "all irrecoverable transactional
errors will still be fatal", or even just "all transactional errors (as
opposed to errors related to this specific record) will still be

fatal"?


Cheers,

Chris

On Wed, Jul 3, 2024 at 11:56 AM Justine Olshan
<jols...@confluent.io.invalid>
wrote:

Hey Chris,

I think what you say makes sense. I agree that defining the behavior

based

on code that can possibly change is not a good idea, and I was trying

to

get a clearer definition from the KIP's author :)

I think it can always be hard to ensure that only specific errors are
handled unless they are explicitly enumerated in code as the code can
change and can be changed by folks who are not aware of this KIP or
conversation.
I personally don't have the bandwidth to do this

definition/enumeration

of

errors, so hopefully Alieh can expand upon this.

Justine

On Wed, Jul 3, 2024 at 8:28 AM Chris Egerton <chr...@aiven.io.invalid

wrote:

Hi Alieh,

I don't love defining the changes for this KIP in terms of a catch

clause

in the KafkaProducer class, for two reasons. First, the set of

errors

that

are handled by that clause may shift over time as the code base is
modified, and second, it would be fairly opaque to users who want

to

understand whether an error would be affected by using this API or

not.


It also seems strange that we'd handle some types of
RecordTooLargeException (i.e., ones reported client-side) with this

API,

but not others (i.e., ones reported by a broker).

I think this kind of API would be most powerful, most intuitive to

users,

and easiest to document if we expanded the scope to all

record-send-related

errors, except anything indicating issues with exactly-once

semantics.

That

would include records that are too large (when caught both client-

and

server-side), records that can't be sent due to authorization

failures,

records sent to nonexistent topics/topic partitions, and keyless

records

sent to compacted topics. It would not include
ProducerFencedException, InvalidProducerEpochException,
UnsupportedVersionException,
and possibly others.

@Justine -- do you think it would be possible to develop either a

better

definition for the kinds of "excluded" errors that should not be

covered

by

this API, or, barring that, a comprehensive list of exact error

types?

And

do you think this would be acceptable in terms of risk and

complexity?


Cheers,

Chris

On Tue, Jul 2, 2024 at 5:05 PM Alieh Saeedi

<asae...@confluent.io.invalid


wrote:

Hey Justine,

About the consequences: the consequences will be like when we did

not

have

the fix made in `KAFKA-9279`: silent loss of data! Obviously,

when

the

user

intentionally chose to ignore errors, that would not be silent

any

more.

Right?
Of course, considering all types of `ApiException`s would be too

broad.

But

are the exceptions caught in `catch(ApiException e)` of the

`doSend()`

method also too broad?

-Alieh

On Tue, Jul 2, 2024 at 9:45 PM Justine Olshan

<jols...@confluent.io.invalid

wrote:

Hey Alieh,

If we want to allow any error to be ignored we should probably

run

through

all the errors to make sure they make sense.
I just want to feel confident that we aren't just making a

decision

without

considering the consequences carefully.

Justine

On Tue, Jul 2, 2024 at 12:30 PM Alieh Saeedi

<asae...@confluent.io.invalid

wrote:

Hey Justine,

yes we talked about `RecordTooLargeException` as an example,

but

did

we

ever limit ourselves to only this specific exception? I think

neither

in

the KIP nor in the PR.  As Chris mentioned, this KIP is going

to

undo

what

we have done in `KAFKA-9279` in case 1) the user is in a

transaction

and

2)

he decides to ignore the errors in which the record was not

even

added

to

the batch. Yes, and we suggested some methods for undoing or,

in

fact,

moving back the transaction from the error state in `flush`

or

in

`commitTnx` and we finally came to the idea of not even doing

the

changes

(better than undoing) in `send`.

Bests,
Alieh

On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan

<jols...@confluent.io.invalid

wrote:

Hey folks,

I understand where you are coming from by asking for

specific

use

cases.

My

understanding based on previous conversations was that

there

were a

few

different errors that have been seen.
One example I heard some information about was when the

record

was

too

large and it fails the batch. Besides that, I'm not really

sure

if

there

are cases in mind, though it is fair to ask on those and

bring

them

up.

Does a record qualify as a poison pill if it targets a

topic

that

doesn't exist? Or if it targets a topic that the producer

principal

lacks

ACLs for? What if it fails broker-side validation (e.g.,

has

null

key

for

a compacted topic)?

I think there was some parallel work with addressing the
UnknownTopicOrPartitionError in another way. As for the

other

checks,

acls,

validation etc. I am not aware of that being in Alieh's

scope,

but

we

should be clear about exactly what we are doing.

All errors that fall into ApiException seems too broad to

me.


Justine

On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi

<asae...@confluent.io.invalid

wrote:

Hey Chris,
thanks for sharing your concerns.

1) About the language of KIP (or maybe later in

Javadocs):

Is

that

alright

if I write all errors that fall into the `ApiException`

category

thrown

(actually returned) by Producer?
2) About future expansion: do you have any better

suggestions

for

enum

names? Do you think `IGNORE_API_EXEPTIONS` or something

like

that

is

"better/more accurate" one?

Bests,
Alieh

On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton

<chr...@aiven.io.invalid

wrote:

Hi Alieh and Justine,

I'm concerned that we're settling on a definition of

"poison

pill"

that's

easiest to tackle right now but may lead to

shortcomings

down

the

road. I

understand the relationship between this KIP and

KAFKA-9279,

and

can

totally get behind the desire to keep things small,

focused,

and

simple

in

the name of avoiding bugs. However, what I don't think

is

clear

at

all

is

what the "specific circumstances" are that Justine

mentioned. I

had a

drastically different idea of what the intended

behavioral

change

would

be

before looking at the draft PR.

I would like 1) for us to be clearer about the

categories

of

errors

that

we

want to cover with this new API (especially since we'll

have

to

find

clear, succinct way to document this for users), and 2)

to

make

sure

that

if we do try to expand this API in the future, that we

won't

be

painted

into a corner.

For item 1, hopefully we can agree that the language in

the

KIP

for IGNORE_SEND_ERRORS ("The records causing

irrecoverable

errors

are

excluded from the batch and the transaction is

committed

successfully.")

is

pretty vague. If we start using the phrase "poison pill

record"

that

could

help, but IMO more detail would still be needed. We

know

that

we

want

to

include records that are so large that they can be

immediately

rejected

by

the producer. But there are other cases that users

might

expect

to

be

handled. Does a record qualify as a poison pill if it

targets a

topic

that

doesn't exist? Or if it targets a topic that the

producer

principal

lacks

ACLs for? What if it fails broker-side validation

(e.g.,

has

null

key

for

a compacted topic)?

For item 2, this really depends on how narrow the scope

of

what

we're

doing

right now is. If we only handle a subset of the

examples

laid

out

above

that could possibly be considered poison pills with

this

KIP,

do

we

want

to

lock ourselves in to never addressing more in the

future,

or

can

we

choose

an API (probably just enum names would be the only

important

decision

here)

that leaves room for more later?

Best,

Chris



On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan
<jols...@confluent.io.invalid>
wrote:

Chris and Alieh,

My understanding is that this KIP is really only

trying

to

solve

an

issue

of a "poison pill" record that fails send().
We've talked a lot about having a generic framework

for

all

errors,

but I

don't think that is what this KIP is trying to do.

Essentially

the

request

is to undo the change from KAFKA-9279
<https://issues.apache.org/jira/browse/KAFKA-9279>

but

under

specific

circumstances that are controlled. I really am

concerned

about

opening

new

avenues for bugs with EOS and hesitate to handle any

other

types

of

errors.

I think if we all agree on the problem that we are

trying

to

solve,

it

is

easier to agree on solutions.

Justine

On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi

<asae...@confluent.io.invalid


wrote:

Hi Matthias,
Thanks for the valid points you mentioned. I

updated

the

KIP

and

the

PR

with:
1) mentioning that the new overloaded `send` throws

`IllegalStateException`

if the user tries to ignore `send()` errors outside

of

transaction.

2) the default implementation in `Producer`

interface

throws

an

`UnsupportedOperationException`

Hi Chris,
Thanks for the feedback. I tried to clarify the

points

you

listed:

-------> we've narrowed the scope from any error

that

might

take

place

with

producing a record to Kafka, to only the ones that

are

thrown

directly

from

Producer::send;

 From the very beginning and even since KIP-1038,

the

main

purpose

was

to

have "more flexibility," or, in other words,

"giving

the

user

the

authority" to handle some specific exceptions

thrown

from

the

`Producer`.

Due to the specific cases we had in mind, KIP-1038

was

discarded

and

we

decided to not define a `CustomExceptionHandler`

for

`Producer`

and

instead

treat the `send` failures in a different way. The

main

issue

is

that

`send`

makes a transition to error state, which is

undoable.

In

fact,

one

single

poison pill record makes the whole batch fail. The

former

suggestions

that

you agreed with have been all about un-doing this

transition

in

`flush`

or

`commit`. The new suggestion is to un-do (or

better,

NOT

do)

in

`send`

due

to the reasons listed in the discussions above.
Moreover, I would say that having such a large

scope

as

you

mentioned

is

impossible. In the best case, we may have control

over

the

`Producer`.

What

shall we do with the broker? The `any error that

might

take

place

with

producing a record to Kafka` is too much, I think.

-------> is this all we want to handle, and will it

prevent

us

from

handling more in the future in an intuitive way?

I think yes. This is all we want. Other sorts of

errors

such

as

having

problem with partition addition, producer fenced

exception,

etc

seem

to

be

more serious issues. The intention was to handle

problems

created

by

(maybe) a single poison pill record. BTW, I do not

see

any

obstacles

to

future changes.

Bests,
Alieh

On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton

<chr...@aiven.io.invalid

wrote:

Ah, sorry--spoke too soon. The PR doesn't show

that

errors

thrown

from

Producer::send are handled, but instead,

ApiException

instances

that

are

caught inside KafkaProducer::doSend and are

handled

by

returning

an

already-failed future are. I think the same

question

still

applies

(is

this

all we want to handle, and will it prevent us

from

handling

more

in

the

future in an intuitive way), though.

On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <

chr...@aiven.io

wrote:

Hi Alieh,

This KIP has evolved a lot since I last looked

at

it,

but

the

changes

seem

well thought-out both in semantics and API. One

clarifying

question I

have

is that it looks based on the draft PR that

we've

narrowed

the

scope

from

any error that might take place with producing

record

to

Kafka,

to

only

the ones that are thrown directly from

Producer::send;

is

that

the

intended

behavior here? And if so, do you have thoughts

on

how

we

might

design a

follow-up KIP that would catch all errors

(including

ones

reported

asynchronously instead of synchronously)? I'd

like

it

if

we

could

leave

the

door open for that without painting ourselves

into

too

much

of

corner

with the API design for this KIP.

Cheers,

Chris

On Fri, Jun 28, 2024 at 6:31 PM Matthias J.

Sax <

mj...@apache.org>

wrote:

Thanks Alieh,

it seems this KIP can just pick between a

couple

of

tradeoffs.

Adding

an

overloaded `send()` as the KIP propose makes

sense

to

me

and

seems

to

provides the cleanest solution compare to

there

options

we

discussed.


Given the explicit name of the passed-in

option

that

highlights

that

the

option is for TX only make is pretty clear and

avoids

the

issue

of

`flush()` ambiguity.


Nit: We should make clear on the KIP though,

that

the

new

`send()`

overload would throw an

`IllegalStateException`

if

TX

are

not

used

(similar to other TX methods like initTx(),

etc)



About the `Producer` interface, I am not sure

how

this

was

done

in

the

past (eg, KIP-266 added

`Consumer.poll(Duration)`

w/o

default

implementation), if we need a default

implementation

for

backward

compatibility or not? If we do want to add

one,

think

it

would

be

appropriate to throw an

`UnsupportedOperationException`

by

default,

instead of just keeping the default impl

empty?



My points are rather minor, and should not

block

this

KIP

though.

Overall LGTM.



-Matthias

On 6/27/24 1:28 PM, Alieh Saeedi wrote:

Hi Justine,

Thanks for the suggestion.
Making applications to validate every single

record

is

not

the

best

way,

from an efficiency point of view.
Moreover, between changing the behavior of

the

Producer

in

`send`

and

`commitTnx`, the former seems more

reasonable

and

clean.


Bests,
Alieh

On Thu, Jun 27, 2024 at 8:14 PM Justine

Olshan

<jols...@confluent.io.invalid>

wrote:

Hey Alieh,

I see there are two options now. So folks

will

be

discussing

the

approaches

and deciding the best way forward before we

vote?

I do think there could be a problem with

the

approach

on

commit

if

we

get

stuck on an earlier error and have more

records

(potentially

on

new

partitions) to commit as the current PR is

implemented.


I guess this takes us back to the question

of

whether

the

error

should

be

cleared on send.

(And I guess at the back of my mind, I'm

wondering

if

there

is

way

we can

validate the "posion pill" records

application

side

before

we

even

try

to

send them)

Justine

On Wed, Jun 26, 2024 at 4:38 PM Alieh

Saeedi

<asae...@confluent.io.invalid

wrote:

Hi Justine,

I did not update the KIP with

`TxnSendOption`

since

thought

it'd

be

better discussed here beforehand.
right now, there are 2 PRs:
- the PR that implements the current

version

of

the

KIP:

https://github.com/apache/kafka/pull/16332

- the POC PR that clarifies the

`TxnSendOption`:

https://github.com/apache/kafka/pull/16465


Bests,
Alieh

On Thu, Jun 27, 2024 at 12:42 AM Justine

Olshan

<jols...@confluent.io.invalid> wrote:

Hey Alieh,

I think I am a little confused. Are the 3

points

above

addressed

by

the

KIP

or did something change? The PR seems to

not

include

this

change

and

still

has the CommitOption as well.

Thanks,
Justine

On Wed, Jun 26, 2024 at 2:15 PM Alieh

Saeedi

<asae...@confluent.io.invalid

wrote:

Hi all,


Looking at the PR <

https://github.com/apache/kafka/pull/16332

corresponding to the KIP, there are some

points

worthy

of

mention:



1) clearing the error ends up

dirty/messy

code

in

`TransactionManager`.


2) By clearing the error, we are

actually

doing

an

illegal

transition

from

`ABORTABLE_ERROR` to `IN_TRANSACTION`

which

is

conceptually

not

acceptable.

This can be the root cause of some

issues,

with

perhaps

further

future

changes by others.

3) If the poison pill record `r1`

causes a

transition

to

the

error

state

and then the next record `r2` requires

adding

partition

to

the

transaction, the action fails due to

being

in

the

error

state.

In

this

case, clearing errors during

`commitTnx(CLEAR_SEND_ERROR)`

is

too

late.

However, this case can NOT be the main

concern

as

soon

as

KIP-890

is

fully

implemented.


My suggestion is to solve the problem

where

it

arises.

If

the

transition

to

the error state does not happen during

`send()`,

we

do

not

need

to

clear

the error later. Therefore, instead of

`CommitOption`,

we

can

define

`TxnSendOption` and prevent the `send()`

method

from

going

to

the

error

state in case 1) we're in a transaction

and

2)

the

user

asked

for

IGONRE_SEND_ERRORS. For more clarity,

you

can

take a

look

at

the

POC

PR

https://github.com/apache/kafka/pull/16465


Cheers,
Alieh

Re: [DISCUSS] KIP-1059: Enable the Producer flush() method to clear the latest send() error

Reply via email to