Thanks! Makes sense.

I missed that fact, that the `member.id` is added on the second
joinGroup request that contains the `member.id`.

However, it seems there is another race condition for this design:

If two consumers join at the same time, it it possible that the broker
assigns the same `member.id` to both (because none of them have joined
the group yet--ie, second joinGroup request not sent yet--, the
`member.id` is not store broker side yes and broker cannot check for
duplicates when creating a new `member.id`.

The probability might be fairly low thought. However, what Stanislav
proposed, to add the `member.id` directly, and remove it after
`session.timeout.ms` sound like a save option that avoids this issue.

Thoughts?


-Matthias

On 11/28/18 8:15 PM, Boyang Chen wrote:
> Thanks Matthias for the question, and Stanislav for the explanation!
> 
> For the scenario described, we will never let a member join the GroupMetadata 
> map
> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> 
>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>   2.  Broker rejects while allocating a member.id to c1 in response (c1 
> protocol version is current);
>   3.  c1 handles the error and rejoins with assigned member.id;
>   4.  Broker stores c1 in its group metadata;
>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>   6.  Broker rejects while allocating a member.id to c2 in response (c2 
> protocol version is current);
>   7.  c2 fails to get the response/crashes in the middle;
>   8.  After certain time, c2 restarts a join request with UNKNOWN_MEMBER_ID;
> 
> As you could see, c2 will repeat step 6~8 until successfully send back a join 
> group request with allocated id.
> By then broker will include c2 within the broker metadata map.
> 
> Does this sound clear to you?
> 
> Best,
> Boyang
> ________________________________
> From: Stanislav Kozlovski <stanis...@confluent.io>
> Sent: Wednesday, November 28, 2018 7:39 PM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group 
> request
> 
> Hey Matthias,
> 
> I think the notion is to have the `session.timeout.ms` to start ticking
> when the broker responds with the member.id. Then, the broker would
> properly expire consumers and not hold too many stale ones.
> This isn't mentioned in the KIP though so it is worth to wait for Boyang to
> confirm
> 
> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matth...@confluent.io>
> wrote:
> 
>> Thanks for the KIP Boyang.
>>
>> I guess I am missing something, but I am still learning more details
>> about the rebalance protocol, so maybe you can help me out?
>>
>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>> The broker generates a `member.id` and sends it back via
>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>> client or the client fails before it can send the second joinGroup
>> request. Thus, a client would need to start over with a new
>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>> generate a new `member.id` again.
>>
>> So it seems the problem is moved, but not resolved? The motivation of
>> the KIP is:
>>
>>> The edge case is that if initial join group request keeps failing due to
>> connection timeout, or the consumer keeps restarting,
>>
>> From my understanding, this KIP move the issue from the first to the
>> second joinGroup request (or broker joinGroup response).
>>
>> But maybe I am missing something. Can you help me out?
>>
>>
>> -Matthias
>>
>>
>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>> Thanks Stanislav and Jason for the suggestions!
>>>
>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>
>>> I will add the join group protocol version change to the KIP. Meanwhile
>> I feel for
>>> understandability it's better to define a separate error code since
>> REBALANCE_IN_PROGRESS
>>> is not the actual cause of the returned error.
>>>
>>>> One small question I have is now that we have one and a half round-trips
>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>> I guess we could keep it for now. After KIP-345 and incremental
>> cooperative rebalancing
>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms`.
>> Also one round trip
>>> shouldn't increase the latency too much IMO.
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <stanis...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hi Boyang,
>>>
>>> The KIP looks very good.
>>> One small question I have is now that we have one and a half round-trips
>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>>
>>> Best,
>>> Stanislav
>>>
>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>>
>>>> Hi Boyang,
>>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com>
>> wrote:
>>>>
>>>>> Hey friends,
>>>>>
>>>>>
>>>>> I would like to start a discussion thread for KIP-394 which is trying
>> to
>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>> requests:
>>>>>
>>>>>
>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Boyang
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
> 
> --
> Best,
> Stanislav
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to