@Becket: I am not sure, if I understand this correctly. Instead of applying a fixed delay, that starts when the first consumer of an (empty) group joins, you suggest to re-trigger/re-set the delay each time a new consumer joins?
This sound like a good strategy to me, if the config is on the broker side. @Eno: I think that's a valid point and I like this idea! -Matthias On 3/24/17 1:23 PM, Eno Thereska wrote: > Thanks Damian, > > This KIP deals with the initial phase only. What about the cases when several > consumers leave a group? Won't there be several expensive rebalances then as > well? I'm wondering if it makes sense for the delay to hold anytime the "set" > of consumers in a group changes, be it addition to the group or removal from > group. > > Thanks > Eno > > >> On 24 Mar 2017, at 20:04, Becket Qin <[email protected]> wrote: >> >> Thanks for the KIP, Damian. >> >> My two cents on this. It seems there are two things worth thinking here: >> >> 1. Better rebalance timing. We will try to rebalance only when all the >> consumers in a group have joined. The challenge would be someone has to >> define what does ALL consumers mean, it could either be a time or number of >> consumers, etc. >> >> 2. Avoid frequent rebalance. For example, if there are 100 consumers in a >> group, today, in the worst case, we may end up with 100 rebalances even if >> all the consumers joined the group in a reasonably small amount of time. >> Frequent rebalance is also a bad thing for brokers. >> >> Having a client side configuration may solve problem 1 better because each >> consumer group can potentially configure their own timing. However, it does >> not really prevent frequent rebalance in general because some of the >> consumers can be misconfigured. (This may have something to do with KIP-124 >> as well. But if quota is applied on the JoinGroup/SyncGroup request it may >> cause some unwanted cascading effects.) >> >> Having a broker side configuration may result in less flexibility for each >> consumer group, but it can prevent frequent rebalance better. I think with >> some reasonable design, the rebalance timing issue can be resolved on the >> broker side as well. Matthias had a good point on extending the delay when >> a new consumer joins a group (we actually did something similar to batch >> ISR change propagation). For example, let's say on the broker side, we will >> always delay 2 seconds each time we see a new consumer joining a consumer >> group. This would probably work for most of the consumer groups and will >> also limit the rebalance frequency to protect the brokers. >> >> I am not sure about the streams use case here, but if something like 2 >> seconds of delay is acceptable for streams, I would prefer adding the >> configuration to the broker so that we can address both problems. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> >> On Fri, Mar 24, 2017 at 5:30 AM, Damian Guy <[email protected]> wrote: >> >>> Thanks for the feedback. >>> >>> Ewen: I'm happy to make it a client side config. Other than the protocol >>> bump i think the effort is almost the same. Personally i see no other >>> issues, but based on discussions with others this is what we came up with. >>> >>> True, it can probably be tested easily via an integration test. >>> >>> Matthias: Yes i agree, the delay could be extended as each new member joins >>> the group. >>> >>> Thanks, >>> Damian >>> >>> On Fri, 24 Mar 2017 at 05:14 Ewen Cheslack-Postava <[email protected]> >>> wrote: >>> >>>> I have the same initial response as Ismael re: broker vs consumer >>> settings. >>>> The global setting seems questionable. >>>> >>>> Could we maybe summarize what the impact of making this a client config >>>> would be? Protocol bump is obvious, but is there any other significant >>>> issue? For the protocol bump in particular, I think this change is >>>> currently really critical for streams; it will be valuable elsewhere, but >>>> the immediate demand is streams, so a protocol bump while being backwards >>>> compatible wouldn't affect any other clients. Is this still actually >>>> compatible with different clients given that they would now expect >>>> different timeouts? (I think it's strictly compatible if you wait for >>>> responses, but if you enforce any client side timeouts, I'm not so sure.) >>>> >>>> re: test plan, I'm sure this will come as a surprise, but is the system >>>> test even necessary? Validating # of rebalances seems messy as other >>> things >>>> can cause rebalances (though admittedly not in a "clean" case). But >>> really >>>> it seems like an integration test could validate this by making sure >>> only 1 >>>> rebalance occurred when 2 members joined with a sufficient time gap. >>>> >>>> -Ewen >>>> >>>> On Thu, Mar 23, 2017 at 3:53 PM, Matthias J. Sax <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the KIP Damian! >>>>> >>>>> My two cents: >>>>> >>>>> - we should have an explicit parameter for this -- implicit setting >>> are >>>>> always tricky (the "importance" of this parameter would be LOW) >>>>> >>>>> - the config should be different for each consumer group: >>>>> * assume you have a stateless app, you want to rebalance immediately >>>>> * if you start-up in an visualized environment using some tools like >>>>> Mesos you might need a different value that on bare metal (no VM to be >>>>> started) >>>>> * it also depends, how many consumer instanced you expect -- it's >>>>> harder to start up 100 instances in 3 seconds than 5 >>>>> >>>>> - the default value should be zero >>>>> >>>>> >>>>> One more thought: what about scaling scenarios? If a consumer group has >>>>> 10 instanced and should be scaled up to 20, it would make sense to do >>>>> this with a single rebalance, too. Thus, I am wondering, if it would >>>>> make sense to apply this delay each time a new consumer joins group, >>>>> even if the group is not empty? >>>>> >>>>> >>>>> -Matthias >>>>> >>>>> >>>>> On 3/23/17 10:19 AM, Damian Guy wrote: >>>>>> Thanks Gouzhang - i think another problem with this is that is >>>>> overloading >>>>>> session.timeout.ms to mean multiple things. I'm not sure that is a >>>> good >>>>>> thing. >>>>>> >>>>>> On Thu, 23 Mar 2017 at 17:14 Guozhang Wang <[email protected]> >>> wrote: >>>>>> >>>>>>> The downside of it, though, is that although it "hides" this from >>> most >>>>> of >>>>>>> the users needing to be aware of it, by default session timeout i.e. >>>> the >>>>>>> rebalance timeout is 10 seconds which could arguably too long. >>>>>>> >>>>>>> >>>>>>> Guozhang >>>>>>> >>>>>>> On Thu, Mar 23, 2017 at 10:12 AM, Guozhang Wang <[email protected] >>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Just throwing another alternative idea here: we can consider using >>>> the >>>>>>>> rebalance timeout value which is already included in the join >>> request >>>>>>>> protocol (and on the current Java client it is always written as >>> the >>>>>>>> session timeout value), that the first member joining will always >>>> force >>>>>>> the >>>>>>>> coordinator to wait that long. By doing this we do not need to bump >>>> up >>>>>>> the >>>>>>>> protocol either. >>>>>>>> >>>>>>>> >>>>>>>> Guozhang >>>>>>>> >>>>>>>> On Thu, Mar 23, 2017 at 5:49 AM, Damian Guy <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Ismael, >>>>>>>>> >>>>>>>>> Mostly to avoid the protocol bump. >>>>>>>>> >>>>>>>>> I agree that it may be difficult to choose the right delay for all >>>>>>>>> consumer >>>>>>>>> groups, but we wanted to make this something that most users don't >>>>>>> really >>>>>>>>> need to think about, i.e., a small enough default delay that works >>>> in >>>>>>> the >>>>>>>>> majority of cases. However it would be much more flexible as a >>>>> consumer >>>>>>>>> config, which i'm happy to pursue if this change is worthy of a >>>>> protocol >>>>>>>>> bump. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Damian >>>>>>>>> >>>>>>>>> On Thu, 23 Mar 2017 at 12:35 Ismael Juma <[email protected]> >>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the KIP, Damian. It makes sense to avoid multiple >>>>>>> rebalances >>>>>>>>>> during start-up. One issue with having this as a broker config is >>>>> that >>>>>>>>> it >>>>>>>>>> may be difficult to choose the right delay for all consumer >>> groups. >>>>>>> Can >>>>>>>>> you >>>>>>>>>> elaborate a little more on why the first alternative (add a >>>> consumer >>>>>>>>>> config) was rejected? We bump protocol versions regularly (when >>> it >>>>>>> makes >>>>>>>>>> sense), so it would be good to get a bit more detail. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ismael >>>>>>>>>> >>>>>>>>>> On Thu, Mar 23, 2017 at 12:24 PM, Damian Guy < >>> [email protected] >>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> I've prepared a KIP to add a configurable delay to the initial >>>>>>>>> consumer >>>>>>>>>>> group rebalance. >>>>>>>>>>> >>>>>>>>>>> Please have look here: >>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>>>>> 134%3A+Delay+initial+consumer+group+rebalance >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Damian >>>>>>>>>>> >>>>>>>>>>> BTW, i apologize if this appears twice. Seems the first one may >>>> have >>>>>>>>> not >>>>>>>>>>> made it. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -- Guozhang >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- Guozhang >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >
signature.asc
Description: OpenPGP digital signature
