That's a good suggestion Jason. Adding a dedicated PRODUCER_FENCED error
should help distinguish exceptions and could safely mark
INVALID_PRODUCER_EPOCH exception as non-fatal in the new code. Updated the
KIP.

Boyang

On Wed, Apr 8, 2020 at 12:18 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Boyang,
>
> Thanks for the KIP. I think the main problem we've identified here is that
> the current errors conflate transaction timeouts with producer fencing. The
> first of these ought to be recoverable, but we cannot distinguish it. The
> suggestion to add a new error code makes sense to me, but it leaves this
> bit of awkwardness:
>
> > One extra issue that needs to be addressed is how to handle
> `ProducerFenced` from Produce requests.
>
> In fact, the underlying error code here is INVALID_PRODUCER_EPOCH. It's
> just that the code treats this as equivalent to `ProducerFenced`. One
> thought I had is maybe PRODUCER_FENCED needs to be a separate error code as
> well. After all, only the transaction coordinator knows whether a producer
> has been fenced or not. So maybe the handling could be something like the
> following:
>
> 1. Produce requests may return INVALID_PRODUCER_EPOCH. The producer
> recovers by following KIP-360 logic to see whether the epoch can be bumped.
> If it cannot because the broker version is too old, we fail.
> 2. Transactional APIs may return either TRANSACTION_TIMEOUT or
> PRODUCER_FENCED. In the first case, we do the same as above. We try to
> recover by bumping the epoch. If the error is PRODUCER_FENCED, it is fatal.
> 3. Older brokers may return INVALID_PRODUCER_EPOCH as well from
> transactional APIs. We treat this the same as 1.
>
> What do you think?
>
> Thanks,
> Jason
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 6, 2020 at 3:41 PM Boyang Chen <reluctanthero...@gmail.com>
> wrote:
>
> > Yep, updated the KIP, thanks!
> >
> > On Mon, Apr 6, 2020 at 3:11 PM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > > Regarding 2), sounds good, I saw UNKNOWN_PRODUCER_ID is properly
> handled
> > > today in produce / add-partitions-to-txn / add-offsets-to-txn / end-txn
> > > responses, so that should be well covered.
> > >
> > > Could you reflect this in the wiki page that the broker should be
> > > responsible for using different error codes given client request
> versions
> > > as well?
> > >
> > >
> > >
> > > Guozhang
> > >
> > > On Mon, Apr 6, 2020 at 9:20 AM Boyang Chen <reluctanthero...@gmail.com
> >
> > > wrote:
> > >
> > > > Thanks Guozhang for the review!
> > > >
> > > > On Sun, Apr 5, 2020 at 5:47 PM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > > >
> > > > > Hello Boyang,
> > > > >
> > > > > Thank you for the proposed KIP. Just some minor comments below:
> > > > >
> > > > > 1. Could you also describe which producer APIs could potentially
> > throw
> > > > the
> > > > > new TransactionTimedOutException, and also how should callers
> handle
> > > them
> > > > > differently (i.e. just to make your description more concrete as
> > > > javadocs).
> > > > >
> > > > > Good point, I will add example java doc changes.
> > > >
> > > >
> > > > > 2. It's straight-forward if client is on newer version while
> broker's
> > > on
> > > > > older version; however If the client is on older version while
> > broker's
> > > > on
> > > > > newer version, today would the internal module of producers treat
> it
> > > as a
> > > > > general fatal error or not? If not, should the broker set a
> different
> > > > error
> > > > > code upon detecting older request versions?
> > > > >
> > > > > That's a good suggestion, my understanding is that the prerequisite
> > of
> > > > this change is the new KIP-360 API which is going out with 2.5,
> > > > so we could just return UNKNOWN_PRODUCER_ID instead of
> PRODUCER_FENCED
> > as
> > > > it could be interpreted as abortable error
> > > > in 2.5 producer and retry. Producers older than 2.5 will not be
> > covered.
> > > > WDYT?
> > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Thu, Apr 2, 2020 at 1:40 PM Boyang Chen <
> > reluctanthero...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hey there,
> > > > > >
> > > > > > I would like to start discussion for KIP-588:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> > > > > >
> > > > > > which aims to improve Producer resilience to transaction timeout
> > due
> > > to
> > > > > > transient system gaps.
> > > > > >
> > > > > > Best,
> > > > > > Boyang
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Reply via email to