Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Boyang Chen Fri, 22 Jan 2021 10:47:23 -0800

Hey Guozhang,

Jason and I were discussing the new API offline and decided to take another
approach. Firstly, the reason not to invent a new API with returned boolean
flag is for compatibility consideration, since old EOS code would not know
that a given transaction commit was failed internally as they don't listen
to the output flag. Our proposed alternative solution is to let
*commitTransaction
throw CommitFailedException whenever the commit failed with non-fatal
exception*, so that on the caller side the handling logic becomes:


try {
        if (shouldCommit) {
            producer.commitTransaction();
        } else {
            resetToLastCommittedPositions(consumer);
            producer.abortTransaction();
        }
    } catch (CommitFailedException e) {
        // Transaction commit failed with abortable error, user could reset
        // the application state and resume with a new transaction. The root
        // cause was wrapped in the thrown exception.
        resetToLastCommittedPositions(consumer);
        producer.abortTransaction();
    } catch (KafkaException e) {
        producer.close();
        consumer.close();
        throw e;
    }

This approach looks cleaner as all exception types other than CommitFailed
will doom to be fatal, which is very easy to adopt for users. In the
meantime, we still maintain the commitTxn behavior to throw instead of
silently failing.

In addition, we decided to drop the recommendation to handle
TimeoutException and leave it to the users to make the call. The downside
for blindly calling abortTxn upon timeout is that we could result in an
illegal state when the commit was already successful on the broker
side. Without a good guarantee on the outcome, overcomplicating the
template should not be encouraged IMHO.

Let me know your thoughts on the new approach here, thank you!

Best,
Boyang

On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <[email protected]> wrote:

> Thanks for your clarification on 2)/3), that makes sense.
>
> On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <[email protected]>
> wrote:
>
> > Thanks for the input Guozhang, replied inline.
> >
> > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <[email protected]>
> wrote:
> >
> > > Hello Boyang,
> > >
> > > Thanks for the updated KIP. I read it again and have the following
> > > thoughts:
> > >
> > > 0. I'm a bit concerned that if commitTxn does not throw any non-fatal
> > > exception, and instead we rely on the subsequent beginTxn call to
> throw,
> > it
> > > may violate some callers with a pattern that relying on commitTxn to
> > > succeed to make some non-rollback operations. For example:
> > >
> > > beginTxn()
> > > // do some read-write on my local DB
> > > commitTxn()
> > > // if commitTxn succeeds, then commit the DB
> > >
> > > -------------
> > >
> > > The issue is that, committing DB is a non-rollback operation, and users
> > may
> > > just rely on commitTxn to return without error to make this
> non-rollback
> > > call. Of course we can just claim this pattern is not legitimate and is
> > not
> > > the right way of doing things, but many users may naturally adopt this
> > > pattern.
> > >
> > > So maybe we should still let commitTxn also throw non-fatal exceptions,
> > in
> > > which case we would then call abortTxn again.
> > >
> > > But if we do this, the pattern becomes:
> > >
> > > try {
> > >    beginTxn()
> > >    // do something
> > > } catch (Exception) {
> > >    shouldCommit = false;
> > > }
> > >
> > > if (shouldCommit) {
> > >     try {
> > >         commitTxn()
> > >     } catch (...) {        // enumerate all fatal exceptions
> > >         shutdown()
> > >     } catch (KafkaException) {
> > >         // non-fatal
> > >         shouldCommit = false;
> > >     }
> > > }
> > >
> > > if (!shouldCommit && running()) {
> > > try {
> > >         abortTxn()
> > >     } catch (KafkaException) {
> > >         // only throw fatal
> > >         shutdown()
> > >     }
> > > }
> > >
> > > ---------------------
> > >
> > > Which is much more complicated.
> > >
> > > Thank makes me think, the alternative we have discussed offline may be
> > > better: let commitTxn() to return a boolean flag.
> > >
> > > * If it returns true, it means the commit succeeded. Users can
> > comfortably
> > > continue and do any external non-rollback operations if they like.
> > > * If it returns false, it means the commit failed with non-fatal error
> > *AND
> > > the txn has been aborted*. Users do not need to call abortTxn again.
> > > * If it throws, then it means fatal errors. Users should shut down the
> > > client.
> > >
> > > In this case, the pattern becomes:
> > >
> > > try {
> > >    beginTxn()
> > >    // do something
> > > } catch (Exception) {
> > >    shouldCommit = false;
> > > }
> > >
> > > try {
> > >     if (shouldCommit) {
> > >         commitSucceeded = commitTxn()
> > >     } else {
> > >         // reset offsets, rollback operations, etc
> > >         abortTxn()
> > >     }
> > > } catch (KafkaException) {
> > >     // only throw fatal
> > >     shutdown()
> > > }
> > >
> > > if (commitSucceeded)
> > >    // do other non-rollbackable things
> > > else
> > >    // reset offsets, rollback operations, etc
> > >
> > > -------------------------
> > >
> > > Of course, if we want to have better visibility into what caused the
> > commit
> > > to fail and txn to abort. We can let the return type be an enum instead
> > of
> > > a flag. But my main idea is to still let the commitTxn be the final
> point
> > > users can tell whether this txn succeeded or not, instead of relying on
> > the
> > > next beginTxn() call.
> > >
> > > I agree that adding a boolean flag is indeed useful in this case. Will
> > update the KIP.
> >
> > 1. Re: "while maintaining the behavior to throw fatal exception in raw"
> not
> > > sure what do you mean by "throw" here. Are you proposing the callback
> > would
> > > *pass* (not throw) in any fatal exceptions when triggered without
> > wrapping?
> > >
> > > Yes, I want to say *pass*, the benefit is to make the end user's
> > expectation consistent
> > regarding exception handling.
> >
> >
> > > 2. I'm not sure I understand the claim regarding the callback that "In
> > EOS
> > > setup, it is not required to handle the exception". Are you proposing
> > that,
> > > e.g. in Streams, we do not try to handle any exceptions if EOS is
> enabled
> > > in the callback anymore, but just let commitTxn() itself to fail to be
> > > notified about the problem? Personally I think in Streams we should
> just
> > > make the handling logic of the callback to be consistent regardless of
> > the
> > > EOS settings (today we have different logic depending on this logic,
> > which
> > > I think could be unified as well).
> > >
> > > My idea originates from the claim on send API:
> > "When used as part of a transaction, it is not necessary to define a
> > callback or check the result of the future  in order to detect errors
> from
> > <code>send</code>. "
> > My understanding is that for EOS, the exception will be detected by
> calling
> > transactional APIs either way, so it is a duplicate handling to track
> > all the sendExceptions in RecordCollector. However, I looked up
> > sendException is being used today as follow:
> >
> > 1. Pass to "ProductionExceptionHandler" for any default or customized
> > exception handler to handle
> > 2. Stop collecting offset info or new exceptions
> > 3. Check and rethrow exceptions in close(), flush() or new send() calls
> >
> > To my understanding, we should still honor the commitment to #1 for any
> > user customized implementation. The #2 does not really affect our
> decision
> > upon EOS. The #3 here is still valuable as it could help us fail fast in
> > new send() instead of waiting to later stage of processing. In that
> sense,
> > I agree we should continue to record send exceptions even under EOS case
> to
> > ensure the strength of stream side Producer logic. On the safer side, we
> no
> > longer need to wrap certain fatal exceptions like ProducerFenced as
> > TaskMigrated, since they should not crash the stream thread if thrown in
> > raw format, once we adopt the new processing model in the send phase.
> >
> >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> [email protected]>
> > > wrote:
> > >
> > > > Thanks for everyone's feedback so far. I have polished the KIP after
> > > > offline discussion with some folks working on EOS to make the
> exception
> > > > handling more lightweight. The essential change is that we are not
> > > > inventing a new intermediate exception type, but instead separating a
> > > full
> > > > transaction into two phases:
> > > >
> > > > 1. The data transmission phase
> > > > 2. The commit phase
> > > >
> > > > This way for any exception thrown from phase 1, will be an indicator
> in
> > > > phase 2 whether we should do commit or abort, and from now on
> > > > `commitTransaction` should only throw fatal exceptions, similar to
> > > > `abortTransaction`, so that any KafkaException caught in the commit
> > phase
> > > > will be definitely fatal to crash the app. For more advanced users
> such
> > > as
> > > > Streams, we have the ability to further wrap selected types of fatal
> > > > exceptions to trigger task migration if necessary.
> > > >
> > > > More details in the KIP, feel free to take another look, thanks!
> > > >
> > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Thanks Bruno for the feedback.
> > > > >
> > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <[email protected]>
> > > wrote:
> > > > >
> > > > >> Thanks Boyang for the KIP!
> > > > >>
> > > > >> Like Matthias, I do also not know the producer internal well
> enough
> > to
> > > > >> comment on the categorization. However, I think having a super
> > > exception
> > > > >> (e.g. RetriableException) that  encodes if an exception is fatal
> or
> > > not
> > > > >> is cleaner, better understandable and less error-prone, because
> > > ideally
> > > > >> when you add a new non-fatal exception in future you just need to
> > > think
> > > > >> about letting it inherit from the super exception and all the rest
> > of
> > > > >> the code will just behave correctly without the need to wrap the
> new
> > > > >> exception into another exception each time it is thrown (maybe it
> is
> > > > >> thrown at different location in the code).
> > > > >>
> > > > >> As far as I understand the following statement from your previous
> > > e-mail
> > > > >> is the reason that currently such a super exception is not
> possible:
> > > > >>
> > > > >> "Right now we have RetriableException type, if we are going to
> add a
> > > > >> `ProducerRetriableException` type, we have to put this new
> interface
> > > as
> > > > >> the parent of the RetriableException, because not all thrown
> > non-fatal
> > > > >> exceptions are `retriable` in general for producer"
> > > > >>
> > > > >>
> > > > >> In the list of exceptions in your KIP, I found non-fatal
> exceptions
> > > that
> > > > >> do not inherit from RetriableException. I guess those are the ones
> > you
> > > > >> are referring to in your statement:
> > > > >>
> > > > >> InvalidProducerEpochException
> > > > >> InvalidPidMappingException
> > > > >> TransactionAbortedException
> > > > >>
> > > > >> All of those exceptions are non-fatal and do not inherit from
> > > > >> RetriableException. Is there a reason for that? If they depended
> > from
> > > > >> RetriableException we would be a bit closer to a super exception I
> > > > >> mention above.
> > > > >>
> > > > >> The reason is that sender may catch those exceptions in the
> > > > > ProduceResponse, and it currently does infinite
> > > > > retries on RetriableException. To make sure we could trigger the
> > > > > abortTransaction() by catching non-fatal thrown
> > > > > exceptions and reinitialize the txn state, we chose not to let
> those
> > > > > exceptions inherit RetriableException so that
> > > > > they won't cause infinite retry on sender.
> > > > >
> > > > >
> > > > >> With OutOfOrderSequenceException and UnknownProducerIdException, I
> > > think
> > > > >> to understand that their fatality depends on the type (i.e.
> > > > >> configuration) of the producer. That makes it difficult to have a
> > > super
> > > > >> exception that encodes the retriability as mentioned above. Would
> it
> > > be
> > > > >> possible to introduce exceptions that inherit from
> > RetriableException
> > > > >> and that are thrown when those exceptions are caught from the
> > brokers
> > > > >> and the type of the producer is such that the exceptions are
> > > retriable?
> > > > >>
> > > > >> Yea, I think in general the exception type mixing is difficult to
> > get
> > > > > them all right. I have already proposed another solution based on
> my
> > > > > offline discussion with some folks working on EOS
> > > > > to make the handling more straightforward for end users without the
> > > need
> > > > > to distinguish exception fatality.
> > > > >
> > > > >> Best,
> > > > >> Bruno
> > > > >>
> > > > >>
> > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > >> > Thanks Boyang for the proposal! I made a pass over the list and
> > here
> > > > are
> > > > >> > some thoughts:
> > > > >> >
> > > > >> > 0) Although this is not part of the public API, I think we
> should
> > > make
> > > > >> sure
> > > > >> > that the suggested pattern (i.e. user can always call abortTxn()
> > > when
> > > > >> > handling non-fatal errors) are indeed supported. E.g. if the txn
> > is
> > > > >> already
> > > > >> > aborted by the broker side, then users can still call abortTxn
> > which
> > > > >> would
> > > > >> > not throw another exception but just be treated as a no-op.
> > > > >> >
> > > > >> > 1) *ConcurrentTransactionsException*: I think this error can
> also
> > be
> > > > >> > returned but not documented yet. This should be a non-fatal
> error.
> > > > >> >
> > > > >> > 2) *InvalidTxnStateException*: this error is returned from
> broker
> > > when
> > > > >> txn
> > > > >> > state transition failed (e.g. it is trying to transit to
> > > > complete-commit
> > > > >> > while the current state is not prepare-commit). This error could
> > > > >> indicates
> > > > >> > a bug on the client internal code or the broker code, OR a user
> > > error
> > > > >> --- a
> > > > >> > similar error is ConcurrentTransactionsException, i.e. if Kafka
> is
> > > > >> bug-free
> > > > >> > these exceptions would only be returned if users try to do
> > something
> > > > >> wrong,
> > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
> > thinking
> > > it
> > > > >> > should be a non-fatal error instead of a fatal error, wdyt?
> > > > >> >
> > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
> > sequence
> > > > >> > (Fatal)", this is a bit conflicting with the
> > > *OutOfSequenceException*
> > > > >> that
> > > > >> > is treated as non-fatal. I guess your proposal is that
> > > > >> > OutOfOrderSequenceException would be treated either as fatal
> with
> > > > >> > transactional producer, or non-fatal with idempotent producer,
> is
> > > that
> > > > >> > right? If the producer is only configured with idempotency but
> not
> > > > >> > transaction, then throwing a TransactionStateCorruptedException
> > for
> > > > >> > non-fatal errors would be confusing since users are not using
> > > > >> transactions
> > > > >> > at all.. So I suggest we always throw OutOfSequenceException
> as-is
> > > > (i.e.
> > > > >> > not wrapped) no matter how the producer is configured, and let
> the
> > > > >> caller
> > > > >> > decide how to handle it based on whether it is only idempotent
> or
> > > > >> > transactional itself.
> > > > >> >
> > > > >> > 4) Besides all the txn APIs, the `send()` callback / future can
> > also
> > > > >> throw
> > > > >> > txn-related exceptions, I think this KIP should also cover this
> > API
> > > as
> > > > >> well?
> > > > >> >
> > > > >> > 5) This is related to 1/2) above: sometimes those non-fatal
> errors
> > > > like
> > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state being
> > > > >> corrupted
> > > > >> > at the broker side, but maybe users are doing something wrong.
> So
> > > I'm
> > > > >> > wondering if we should further distinguish those non-fatal
> errors
> > > > >> between
> > > > >> > a) those that are caused by Kafka itself, e.g. a broker timed
> out
> > > and
> > > > >> > aborted a txn and later an endTxn request is received, and b)
> the
> > > > user's
> > > > >> > API call pattern is incorrect, causing the request to be
> rejected
> > > with
> > > > >> an
> > > > >> > error code from the broker. *TransactionStateCorruptedException*
> > > feels
> > > > >> to
> > > > >> > me more like for case a), but not case b).
> > > > >> >
> > > > >> >
> > > > >> > Guozhang
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > [email protected]>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Thanks Matthias, I think your proposal makes sense as well, on
> > the
> > > > pro
> > > > >> side
> > > > >> >> we could have a universally agreed exception type to be caught
> by
> > > the
> > > > >> user,
> > > > >> >> without having an extra layer on top of the actual exceptions.
> I
> > > > could
> > > > >> see
> > > > >> >> some issue on downsides:
> > > > >> >>
> > > > >> >> 1. The exception hierarchy will be more complex. Right now we
> > have
> > > > >> >> RetriableException type, if we are going to add a
> > > > >> >> `ProducerRetriableException` type, we have to put this new
> > > interface
> > > > >> as the
> > > > >> >> parent of the RetriableException, because not all thrown
> > non-fatal
> > > > >> >> exceptions are `retriable` in general for producer, for example
> > > > >> >> <
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > >> >>> .
> > > > >> >> We could have an empty interface `ProducerRetriableException`
> to
> > > let
> > > > >> all
> > > > >> >> the thrown exceptions implement for sure, but it's a bit
> > unorthodox
> > > > in
> > > > >> the
> > > > >> >> way we deal with exceptions in general.
> > > > >> >>
> > > > >> >> 2. There are cases where we throw a KafkaException wrapping
> > another
> > > > >> >> KafkaException as either fatal or non-fatal. If we use an
> > interface
> > > > to
> > > > >> >> solve #1, it is also required to implement another bloated
> > > exception
> > > > >> class
> > > > >> >> which could replace KafkaException type, as we couldn't mark
> > > > >> KafkaException
> > > > >> >> as retriable for sure.
> > > > >> >>
> > > > >> >> 3. In terms of the encapsulation, wrapping means we could limit
> > the
> > > > >> scope
> > > > >> >> of affection to the producer only, which is important since we
> > > don't
> > > > >> want
> > > > >> >> shared exception types to implement a producer-related
> interface,
> > > > such
> > > > >> >> as UnknownTopicOrPartitionException.
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> Boyang
> > > > >> >>
> > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >>> Thanks for the KIP Boyang!
> > > > >> >>>
> > > > >> >>> Overall, categorizing exceptions makes a lot of sense. As I
> > don't
> > > > know
> > > > >> >>> the producer internals well enough, I cannot comment on the
> > > > >> >>> categorization in detail though.
> > > > >> >>>
> > > > >> >>> What I am wondering is, if we should introduce an exception
> > > > interface
> > > > >> >>> that non-fatal exception would implement instead of creating a
> > new
> > > > >> class
> > > > >> >>> that will wrap non-fatal exceptions? What would be the
> pros/cons
> > > for
> > > > >> >>> both designs?
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> -Matthias
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > >> >>>> Hey there,
> > > > >> >>>>
> > > > >> >>>> I would like to start a discussion thread for KIP-691:
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > >> >>>>
> > > > >> >>>> The KIP is aiming to simplify the exception handling logic
> for
> > > > >> >>>> transactional Producer users by classifying fatal and
> non-fatal
> > > > >> >>> exceptions
> > > > >> >>>> and throw them correspondingly for easier catch and retry.
> Let
> > me
> > > > >> know
> > > > >> >>> what
> > > > >> >>>> you think.
> > > > >> >>>>
> > > > >> >>>> Best,
> > > > >> >>>> Boyang
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Reply via email to