Using the session expiration logic we already have seems like the simplest
option (this is probably a one or two line change). The rejoin should be
quick anyway, so I don't think it's worth optimizing for unjoined new
members. Just my two cents. This is more of an implementation detail, so
need not necessarily be resolved here.

-Jason

On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bche...@outlook.com> wrote:

> Thanks Matthias for the question. I'm thinking of having a separate hash
> set called `registeredMemberIds` which
> will be cleared out every time a group finishes one round of rebalance.
> Since storing one id is pretty trivial, using
> purgatory to track the id removal is a bit wasteful in my opinion.
> ________________________________
> From: Matthias J. Sax <matth...@confluent.io>
> Sent: Friday, November 30, 2018 10:26 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> Thanks! Makes sense.
>
> I missed that fact, that the `member.id` is added on the second
> joinGroup request that contains the `member.id`.
>
> However, it seems there is another race condition for this design:
>
> If two consumers join at the same time, it it possible that the broker
> assigns the same `member.id` to both (because none of them have joined
> the group yet--ie, second joinGroup request not sent yet--, the
> `member.id` is not store broker side yes and broker cannot check for
> duplicates when creating a new `member.id`.
>
> The probability might be fairly low thought. However, what Stanislav
> proposed, to add the `member.id` directly, and remove it after
> `session.timeout.ms` sound like a save option that avoids this issue.
>
> Thoughts?
>
>
> -Matthias
>
> On 11/28/18 8:15 PM, Boyang Chen wrote:
> > Thanks Matthias for the question, and Stanislav for the explanation!
> >
> > For the scenario described, we will never let a member join the
> GroupMetadata map
> > if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> >
> >   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
> >   2.  Broker rejects while allocating a member.id to c1 in response (c1
> protocol version is current);
> >   3.  c1 handles the error and rejoins with assigned member.id;
> >   4.  Broker stores c1 in its group metadata;
> >   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> >   6.  Broker rejects while allocating a member.id to c2 in response (c2
> protocol version is current);
> >   7.  c2 fails to get the response/crashes in the middle;
> >   8.  After certain time, c2 restarts a join request with
> UNKNOWN_MEMBER_ID;
> >
> > As you could see, c2 will repeat step 6~8 until successfully send back a
> join group request with allocated id.
> > By then broker will include c2 within the broker metadata map.
> >
> > Does this sound clear to you?
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Stanislav Kozlovski <stanis...@confluent.io>
> > Sent: Wednesday, November 28, 2018 7:39 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group request
> >
> > Hey Matthias,
> >
> > I think the notion is to have the `session.timeout.ms` to start ticking
> > when the broker responds with the member.id. Then, the broker would
> > properly expire consumers and not hold too many stale ones.
> > This isn't mentioned in the KIP though so it is worth to wait for Boyang
> to
> > confirm
> >
> > On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matth...@confluent.io>
> > wrote:
> >
> >> Thanks for the KIP Boyang.
> >>
> >> I guess I am missing something, but I am still learning more details
> >> about the rebalance protocol, so maybe you can help me out?
> >>
> >> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
> >> The broker generates a `member.id` and sends it back via
> >> `MEMBER_ID_REQUIRED` error response. This response might never reach the
> >> client or the client fails before it can send the second joinGroup
> >> request. Thus, a client would need to start over with a new
> >> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> >> generate a new `member.id` again.
> >>
> >> So it seems the problem is moved, but not resolved? The motivation of
> >> the KIP is:
> >>
> >>> The edge case is that if initial join group request keeps failing due
> to
> >> connection timeout, or the consumer keeps restarting,
> >>
> >> From my understanding, this KIP move the issue from the first to the
> >> second joinGroup request (or broker joinGroup response).
> >>
> >> But maybe I am missing something. Can you help me out?
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 11/27/18 6:00 PM, Boyang Chen wrote:
> >>> Thanks Stanislav and Jason for the suggestions!
> >>>
> >>>
> >>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> the
> >>>> version of the JoinGroup protocol in order to indicate compatibility
> >> with
> >>>> the new behavior. The coordinator needs to know when it is safe to
> >> assume
> >>>> the client will handle the error code.
> >>>>
> >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> error
> >>>> code. When the client sees this error code, it will take the memberId
> >> from
> >>>> the response and rejoin. We'd still need the protocol bump since older
> >>>> consumers do not have this logic.
> >>>
> >>> I will add the join group protocol version change to the KIP. Meanwhile
> >> I feel for
> >>> understandability it's better to define a separate error code since
> >> REBALANCE_IN_PROGRESS
> >>> is not the actual cause of the returned error.
> >>>
> >>>> One small question I have is now that we have one and a half
> round-trips
> >>>> needed to join in a rebalance (1 full RT addition), is it worth it to
> >>>> consider increasing the default value of `
> >> group.initial.rebalance.delay.ms`?
> >>> I guess we could keep it for now. After KIP-345 and incremental
> >> cooperative rebalancing
> >>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
> `.
> >> Also one round trip
> >>> shouldn't increase the latency too much IMO.
> >>>
> >>> Best,
> >>> Boyang
> >>> ________________________________
> >>> From: Stanislav Kozlovski <stanis...@confluent.io>
> >>> Sent: Wednesday, November 28, 2018 2:32 AM
> >>> To: dev@kafka.apache.org
> >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >> group request
> >>>
> >>> Hi Boyang,
> >>>
> >>> The KIP looks very good.
> >>> One small question I have is now that we have one and a half
> round-trips
> >>> needed to join in a rebalance (1 full RT addition), is it worth it to
> >>> consider increasing the default value of `
> >> group.initial.rebalance.delay.ms`?
> >>>
> >>> Best,
> >>> Stanislav
> >>>
> >>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >>>
> >>>> Hi Boyang,
> >>>>
> >>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> the
> >>>> version of the JoinGroup protocol in order to indicate compatibility
> >> with
> >>>> the new behavior. The coordinator needs to know when it is safe to
> >> assume
> >>>> the client will handle the error code.
> >>>>
> >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> error
> >>>> code. When the client sees this error code, it will take the memberId
> >> from
> >>>> the response and rejoin. We'd still need the protocol bump since older
> >>>> consumers do not have this logic.
> >>>>
> >>>> Thanks,
> >>>> Jason
> >>>>
> >>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com>
> >> wrote:
> >>>>
> >>>>> Hey friends,
> >>>>>
> >>>>>
> >>>>> I would like to start a discussion thread for KIP-394 which is trying
> >> to
> >>>>> mitigate broker cache bursting issue due to anonymous join group
> >>>> requests:
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
> >>>>>
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Boyang
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >>>
> >>
> >>
> >
> > --
> > Best,
> > Stanislav
> >
>
>

Reply via email to