Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Guozhang Wang Mon, 26 Aug 2019 18:05:37 -0700

Hi Jason,

I've made another pass on the wiki page and it reads much better now. One
more clarification about the "Simplified error handling" section:


1. There will be no "retriable error" from the broker side regarding any
send requests and txn requests (to txn coordinators). All errors would
cause the corresponding txn to eventually be aborted.
2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
producer entering the ABORTABLE_ERROR state, but only the current txn to be
aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
enter the FATAL_ERROR state, plus it would cause all future txns to be
aborted.

Is that right?


Guozhang


On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <[email protected]>
wrote:
>
> Thanks Jason!
>
> LGTM.
>
> On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > Hi Matthias,
> >
> > Thanks, I appreciate the thorough review. I've revised the section to
make
> > the logic clearer. I think you have it right except for the 1). We only
> > generate a new PID if the epoch cannot be incremented without overflow.
> >
> > -Jason
> >
> > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <[email protected]>
> > wrote:
> >
> >> Thanks for the KIP. I just have some clarification questions to make
> >> sure I understand the proposal correctly:
> >>
> >> 1) "Safe Epoch Incrementing"
> >>
> >>> When the coordinator receives a new InitProducerId request, we will
use
> >> the following logic to update the epoch:
> >>>
> >>> 1. No epoch is provided: the current epoch will be bumped and the last
> >> epoch will be set to -1.
> >>> 2. Epoch and producerId are provided, and the provided producerId
> >> matches the current producerId or the provided producerId matches the
> >> previous producerId and the provided epoch is exhausted:
> >>>       a. Provided epoch matches current epoch: the last epoch will be
> >> set to the current epoch, and the current epoch will be bumped .
> >>>       b. Provided epoch matches last epoch: the current epoch will be
> >> returned
> >>>       c. Else: return INVALID_PRODUCER_EPOCH
> >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >> Case (1) would be for a new producer. Hence, should we state that "no
> >> PID" is provided (instead of "no epoch" is provided?). That might be
> >> clearer and it implies that there is no epoch anyway.
> >>
> >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> >> error and tries to re-initialize itself.
> >>
> >> Case (2a) implies that the producer send its first request and is not
> >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> >> itself, ie, it first request to re-initilize did not get a respond but
> >> was processed by the transaction coordinator. Case (2c) implies that a
> >> producer was fenced (similar case 3, even if I am not sure what case 3
> >> actually would be?)
> >>
> >> Please let me know if my understanding is correct.
> >>
> >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> >> requires that the "provide epoch is exhausted"?
> >>
> >> For case 2b:
> >>
> >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> >> PID/epoch pair. The TC processes the request and creates a new PID=101
> >> with new epoch=0, however, the respond to the producer is lost. The TC
> >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> >> know it's a retry?
> >>
> >> If this reasoning is correct, should the logic be as follows:
> >>
> >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> >> epoch to -1.
> >> 2. Epoch and producerId are provided
> >>    a) the provided producerId/epoch matches the current
producerId/epoch:
> >>       i) if the epoch is not exhausted, bump the epoch
> >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> >>    b) the provided producerId/epoch matches the previous
> >> producerId/epoch: respond with current PID/epoch
> >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >>
> >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> >>> Hi Everyone,
> >>>
> >>> Sorry for the long delay on this KIP. I have updated it to include the
> >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
are
> >> no
> >>> further comments, I will plan to start a vote early next week.
> >>>
> >>> Thanks!
> >>> Jason
> >>>
> >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
[email protected]
> >>>
> >>> wrote:
> >>>
> >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> >>>>
> >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi John
> >>>>>
> >>>>> What is the status of this KIP?
> >>>>>
> >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
on
> >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> >> proper
> >>>>> fix is needed.
> >>>>>
> >>>>> Adam
> >>>>>
> >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> Thanks Jason. The proposed solution sounds good to me.
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
error
> >>>>>>> occurs following expiration of the producerId. It's possible that
> >>>>>> another
> >>>>>>> producerId has been installed in its place following expiration
(if
> >>>>>> another
> >>>>>>> producer instance has become active), or the mapping is empty. We
can
> >>>>>>> safely retry the InitProducerId with the logic in this KIP in
order
> >> to
> >>>>>>> detect which case it is. So I'd suggest something like this:
> >>>>>>>
> >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
send
> >>>>>>> InitProducerId using the current producerId and epoch.
> >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> >> producerId
> >>>>>> and
> >>>>>>> return it. If a transaction is in progress on the client, it will
> >> have
> >>>>>> to
> >>>>>>> be aborted, but the producer can continue afterwards.
> >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
can
> >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
> >> can
> >>>>>>> probably raise this as ProducerFencedException since that is
> >>>> effectively
> >>>>>>> what has happened. Ideally this is the only fatal case that users
> >> have
> >>>>>> to
> >>>>>>> handle.
> >>>>>>>
> >>>>>>> I'll give it a little more thought and update the KIP.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <[email protected]>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> You're right about the dangling txn since it will actually block
> >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> >>>> since
> >>>>>>> this
> >>>>>>>> is a very rare case, we can consider fixing it not via
broker-side
> >>>>>> logic
> >>>>>>>> but via tooling in a future work.
> >>>>>>>>
> >>>>>>>> I've also discovered some related error handling logic inside
> >>>> producer
> >>>>>>> that
> >>>>>>>> may be addressed together with this KIP (since it is mostly for
> >>>>>> internal
> >>>>>>>> implementations the wiki itself does not need to be modified):
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
[email protected]
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Guozhang,
> >>>>>>>>>
> >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> >>>> for
> >>>>>>>>> inter-broker communications. The use of an API and its
> >>>> corresponding
> >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> >>>>>>>>>
> >>>>>>>>> Nevertheless, it sounds like we're on the same page about
removing
> >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
is
> >>>> a
> >>>>>>>> little
> >>>>>>>>> worse than what you describe though. Consumers with the
> >>>>>> read_committed
> >>>>>>>>> isolation level will be stuck. Still, I think we agree that this
> >>>>>> case
> >>>>>>>>> should be rare and we can reconsider for future work. Rather
than
> >>>>>>>>> preventing dangling transactions, perhaps we should consider
> >>>> options
> >>>>>>>> which
> >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> >>>>>> thought.
> >>>>>>> I
> >>>>>>>>> will update the KIP.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jason
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
[email protected]
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> 0. My original question is about the implementation details
> >>>>>>> primarily,
> >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> >>>>>> simply
> >>>>>>>> "use
> >>>>>>>>>> the highest supported version of the corresponding request",
but
> >>>>>> if
> >>>>>>> the
> >>>>>>>>>> returned response from APIVersionRequest says "I don't even
know
> >>>>>>> about
> >>>>>>>>> the
> >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
additional
> >>>>>>> logic
> >>>>>>>>> for
> >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> >>>>>>>> NetworkClient
> >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
this
> >>>>>>> logic
> >>>>>>>>> more
> >>>>>>>>>> complicated.
> >>>>>>>>>>
> >>>>>>>>>> As for the general issue that a broker does not recognize a
> >>>>>> producer
> >>>>>>>> with
> >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> >>>>>> wiki,
> >>>>>>>> this
> >>>>>>>>>> is only a concern for transactional producer since for
> >>>> idempotent
> >>>>>>>>> producer
> >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> >>>>>> even
> >>>>>>> if
> >>>>>>>>> the
> >>>>>>>>>> producer request from a fenced producer gets accepted, its
> >>>>>>> transaction
> >>>>>>>>> will
> >>>>>>>>>> never be committed and hence messages not exposed to
> >>>>>> read-committed
> >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> >>>>>>>> consumers
> >>>>>>>>>> will still read those messages, 2) unnecessary storage for
those
> >>>>>>> fenced
> >>>>>>>>>> produce messages, but in practice should not accumulate to a
> >>>> large
> >>>>>>>> amount
> >>>>>>>>>> since producer should soon try to commit and be told it is
> >>>> fenced
> >>>>>> and
> >>>>>>>>> then
> >>>>>>>>>> stop, 3) there will be no markers for those transactional
> >>>> messages
> >>>>>>>> ever.
> >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> >>>>>> happen
> >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> >>>>>>> (default
> >>>>>>>>> is
> >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the comments. Responses below:
> >>>>>>>>>>>
> >>>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
> >>>>>>> using
> >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> >>>>>>> upgraded,
> >>>>>>>>> we
> >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> >>>>>> write.
> >>>>>>>> This
> >>>>>>>>>> is
> >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> >>>> Does
> >>>>>>> that
> >>>>>>>>>> seem
> >>>>>>>>>>> reasonable?
> >>>>>>>>>>>
> >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> >>>>>> over, I
> >>>>>>>> am
> >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> >>>> a
> >>>>>>> write
> >>>>>>>>>> from
> >>>>>>>>>>> a zombie which has just reset its sequence number after that
> >>>>>>> producer
> >>>>>>>>> had
> >>>>>>>>>>> had its state cleaned up. However, one of the other
> >>>>>> improvements in
> >>>>>>>>> this
> >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> >>>>>> log.
> >>>>>>> I
> >>>>>>>>>> think
> >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> >>>> it
> >>>>>> for
> >>>>>>>>>> future
> >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> >>>>>>> transaction
> >>>>>>>>>> state
> >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> >>>>>> consider
> >>>>>>>>> this
> >>>>>>>>>>> problem more generally. What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> >>>>>> iteration.
> >>>>>>>>> I'll
> >>>>>>>>>>> make a pass and try to clarify usage.
> >>>>>>>>>>> 2. I was attempting to handle some edge cases since this check
> >>>>>>> would
> >>>>>>>> be
> >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> >>>>>> suggested
> >>>>>>>>> above,
> >>>>>>>>>>> then we can ignore this.
> >>>>>>>>>>>
> >>>>>>>>>>> -Jason
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0. One question about the migration plan: "The new
> >>>>>>>>> GetTransactionState
> >>>>>>>>>>> API
> >>>>>>>>>>>> and the new version of the transaction state message will
> >>>> not
> >>>>>> be
> >>>>>>>> used
> >>>>>>>>>>> until
> >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> >>>> about
> >>>>>> the
> >>>>>>>>>>>> implementation details here: say a broker is on the newer
> >>>>>> version
> >>>>>>>> and
> >>>>>>>>>> the
> >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> >>>>>>>>> APIVersionsRequest
> >>>>>>>>>>> can
> >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> >>>>>>>> forbidding
> >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> >>>> the
> >>>>>>>> broker
> >>>>>>>>>>> side
> >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> >>>>>>> concern
> >>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> this will be some tech-debt code that will live long before
> >>>>>> being
> >>>>>>>>>>> removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Some additional minor comments:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> >>>>>> the
> >>>>>>>> same
> >>>>>>>>>>> thing
> >>>>>>>>>>>> in your wiki.
> >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> >>>>>> the
> >>>>>>>>>> producer
> >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> >>>> If
> >>>>>>> the
> >>>>>>>>>>> metadata
> >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> >>>> the
> >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> >>>>>> cache?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >>>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I am +1 on this :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> >>>> link
> >>>>>> to
> >>>>>>>> the
> >>>>>>>>>> KIP:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >>>>>>>>>>>>>> .
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >>>>>>>>>> [email protected]
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have a proposal to improve the
> >>>> transactional/idempotent
> >>>>>>>>>> producer's
> >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> >>>>>> result
> >>>>>>> of
> >>>>>>>>>>> losing
> >>>>>>>>>>>>>>> producer state following segment removal. The current
> >>>>>>> behavior
> >>>>>>>>> is
> >>>>>>>>>>> both
> >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> >>>>>> what
> >>>>>>>> you
> >>>>>>>>>>>> think.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> >>>>>>> initial
> >>>>>>>>>> draft.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>


--
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Reply via email to