[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961597#comment-14961597 ] ASF GitHub Bot commented on KAFKA-2397: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/103 > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.9.0.0 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957784#comment-14957784 ] Onur Karaman commented on KAFKA-2397: - Cool. It sounds like we all generally agree on the explicit request. Does a committer want to review the pull request? > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.9.0.0 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957718#comment-14957718 ] Guozhang Wang commented on KAFKA-2397: -- For consumer shutdown, in the ZK-based consumer we will immediately delete its ephemeral node so that other members will be notified, hence now in the new consumer without this fix we are effectively introducing an regression. For consumer hard failure cases though, with ZK-based old consumer it also takes a session timeout period to be detected; so I feel modifying the socket server to penetrate client-id information to pass to coordinator in order to improve on this case may be an overkill. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.9.0.0 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957550#comment-14957550 ] Jay Kreps commented on KAFKA-2397: -- Dunno if we closed the loop on the approach. [~onurkaraman] Yeah the pro of the TCP approach is that all clients get it automatically even in hard app failure cases. The downside is that the implementation on the server side is more involved and there is some risk of unnecessary rebalances if there are situations that cause the connection to be lost. Another aspect is the ability to implement a shutdown without rebalance for quick restarts. This is particularly useful for stream processing where there is associated state that takes work to rebuild. I don't think this can be implemented easily with the TCP connection approach. I think I'm on board with doing it explicitly. I think the other question is whether it is cleaner to add a field to the ack or make a custom request. I don't have a strong opinion on that though in general I think fewer requests is better. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.9.0.0 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955673#comment-14955673 ] Onur Karaman commented on KAFKA-2397: - My pull request had diverged again from trunk, so I force pushed a rebase that just cleans up the conflicts. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.9.0.0 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731373#comment-14731373 ] Jiangjie Qin commented on KAFKA-2397: - [~jkreps] using TCP close to signal disconnect does have merits. It works either when client process crashes or closes normally. It is just not very clear to me whether it is worth doing here. The price we pay here is we have to propagate every connection close at network to coordinator. From the server log in LinkedIn I saw, socket closure is quite frequent. Todd even submitted a patch to change that particular log to debug level. They could just be the ad-hoc SyncProducer in old consumer to refresh metadata. Maybe I'm over concerned but I am a bit worried about the noise here. I don't know in which case a TCP connection might be closed. Proxy was mentioned earlier, maybe some workload balancer / firewall / gateway, etc. I feel it might be another unnecessary assumption/dependency we introduce that is not buying us too much. Another thing I am not sure is how often an application process crashes except people do a kill -9. In most cases there are multiple threads in an application. If an uncaught exception is thrown, usually only that thread dies and the process will hang but not exit unless the people do that explicitly like mirror maker does. In that case, is it reasonable to expect the client.close() to be called in the application shutdown hook or some finally block? (It may not be the case for some other language like C, though). If using TCP close mainly addresses kill -9. It is very likely that session timeout has already reached when people manually kill the process. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731364#comment-14731364 ] Onur Karaman commented on KAFKA-2397: - Here's a summary of what I think each approach has to offer over the other. Pros of LeaveGroupRequest: 1. simplicity. I think the logic is pretty straightforward, keeps the network and api layers separated, fits well with existing patterns, and doesn't require a complicated refactoring. 2. opens up the possibility for tooling that lets you kick out a consumer and trigger a rebalance. This might be a useful admin tool for when things go wrong. 3. opens up the possibility for rolling bounces without triggering a rebalance. We can modify the consumer to have a close and closeNow (close sends out a LeaveGroupRequest, closeNow doesn't). The application can persist the consumer id somewhere. The consumer can initially try out the persisted consumer id after it comes back up. Pros of tcp disconnect: 1. rebalance gets triggered on process death. This would be a con if you want the possibility for rolling bounces without triggering a rebalance. P.S: I'm going to be on vacation from tonight to Tuesday so my responsiveness will be a bit spotty from tonight until Tuesday. I think Jiangjie may be in a similar situation. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731091#comment-14731091 ] Jay Kreps commented on KAFKA-2397: -- [~hachikuji] Makes sense. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731073#comment-14731073 ] Jason Gustafson commented on KAFKA-2397: [~jkreps] Not having a way to properly leave the group is pretty painful in testing, which I imagine users are going to be doing a lot of initially. I think it also makes rolling upgrades trickier if you don't have it since you have to allow additional time for the group to stabilize after each machine is upgraded. The ideal workflow to minimize rebalance overhead would probably be to shutdown one instance, let the group stabilize, then restart it. If you just restart the instance, then the whole group will have to pause until the old member's session timeout has expired (Although you can also get around this by persisting the consumer id). Anyway, I'd rather have something if possible, but I agree that it could be pushed to another release if we think the TCP option is the way forward. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731052#comment-14731052 ] Jay Kreps commented on KAFKA-2397: -- Couple thoughts: 1. [~hachikuji] Does this need to be in the next release? This is really an optimization we can do at any time right? How bad of a user experience is it not to have it? 2. Does anyone have a concrete idea of where using TCP close to signal disconnect falls short? [~becket_qin] I think you are saying this is a problem but when is it actually a problem? This might be one where broader input could help... 3. We shouldn't end up with two different ways to do the same thing just because two ways have been proposed and we aren't sure yet which is best. This just means we aren't done thinking through the design. I think likely zero is better than two, right? > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730633#comment-14730633 ] Onur Karaman commented on KAFKA-2397: - My pull request had diverged from trunk, so I force pushed a rebase that just cleans up the conflicts. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729800#comment-14729800 ] Jason Gustafson commented on KAFKA-2397: [~becket_qin] I'm not sure anyone was suggesting replacing the session timeout if TCP disconnect was used to signal group departure. I think you need session timeout regardless of whether we have an explicit leave group request or we use the TCP disconnect. I also feel a little concern about #3, but I don't actually know of any cases where network issues will cause a disconnect. In general, my feeling is that the advantages of the TCP disconnect (in particular the ability to detect hard crashes more swiftly) are not worth the cost of exposing the lower level network layer in the consumer coordinator. At the moment, however, my main concern is more pragmatic: the window for a big change like that is starting to close. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729769#comment-14729769 ] Jiangjie Qin commented on KAFKA-2397: - [~ewencp] [~hachikuji] Some thoughts on this. I agree with [~ewencp] that we should follow one protocol but not both. Personally I like explicit leave group request better. The goals we want to achieve are: 1. When a consumer actually dies, we don't want to wait for too long before a rebalance. 2. When a consumer exits normally, we want to trigger a rebalance soon. 3. If there are some jitters or network issues, etc. We want to have some tolerance over that. Using TCP connection to signify the liveliness will satisfy 2. For 1, if the TCP connection timeout is super long it won't work. That's why we introduced session timeout. For 3, using TCP connection to signify liveliness might cause problem. Explicit leave group request is clear that a member will only be excluded from a group when it exit normally, or session is timeout. So all the three goals are met. An important related scenario worth thinking about is bouncing a consumer. Without leave group request, it is possible to bounce a client without triggering rebalance as long as the consumer shuts down then come back before session timeout. If we send a leave group request explicitly, bouncing a consumer means there will be two rebalances (Which I think is the correct behavior). So making rebalance cheap and fast is very important. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728410#comment-14728410 ] Ewen Cheslack-Postava commented on KAFKA-2397: -- [~hachikuji] My primary objection to that plan is that it might lead to us maintaining more complicated code if we support leave group via two mechanisms instead of one, and it's also more stuff the user has to understand. On the other hand, I can see a case for supporting both: explicit leave group via a message is great for forcing the coordinator to trigger a rebalance ASAP, whereas an implicit leave group is a nice way to allow for fast reconnect in the case of a network hiccup without affecting membership/requiring another round but also allows the broker to boot the consumer from the group without waiting for the full session interval (and the client can also take this into account, stopping consumption after a heartbeat interval during which it cannot connect rather than waiting for a full session timeout). But since I'm not really clear on when we'd see such network hiccups that wouldn't be masked by TCP anyway, I'm not sure this is worth the more complicated model. It does sound like it's probably complicated -- or at least a lot of code changes -- to make the lower level connection management and higher level protocol stuff coordinate. Since this issue actually slows things down for me on a daily basis now, I think the explicit leave group would make sense to get committed. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727809#comment-14727809 ] Jason Gustafson commented on KAFKA-2397: Bumping this issue. One nice thing about the current patch is its simplicity (should be similar with the un-heartbeat approach). I wonder if it would be a bad thing to support explicit group departure with this patch and implicit departure with TCP disconnect? Then we could let this patch go through and consider the TCP disconnect in another JIRA. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696602#comment-14696602 ] Ewen Cheslack-Postava commented on KAFKA-2397: -- [~becket_qin] I was more worried about figuring out what behavior was preferable first, then figuring out how to make it work with our code. I realize we'd need to expose some events in the lower level up to the code layered on it, but I don't see anything wrong with doing that, it just requires a tracking some more state and relaying events as you described. Kicking the member out based on TCP disconnection seemed to cover more cases, so unless there was a problem with it, I figured it's worth the effort to try to make it work that way. Any system tests that forcibly kill Copycat workers are going to have the same issues I'm running into now. That isn't a huge problem since it's ok for some tests to take a long time, but it does have other impacts as well; for example, that means that a crashed process will hold on to any assignments it has for up to the full session timeout, in which case those assignments will not be processed (which, for Copycat, could potentially mean 30s worth of data lost if the source data is ephemeral, such as metrics). [~hachikuji] I thought about proxies, but I couldn't come up with a scenario where the TCP connection to the coordinator would be closed do to a very short transient issue. But I definitely won't claim I know that it will never be the case or that I know all the weird things proxies might do under a variety of scenarios or configurations... One problem with requiring an explicit leave group request/flag is that any crash still takes a lot of time to free up assigned partitions and keeps any members who are behaving properly from continuing to process their assigned work (since they discover the need for rebalance, invoke the rebalance revoked callback, and join group immediately). This means any process that crashes can gum up the works for all the other processes. And some people prefer the fail fast, crash and recover by restarting the process approach, so while they would obviously prefer crashes not happen, they also might expect to encounter this scenario semi-regularly and then find things grinding to a halt for 30s at a time. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696551#comment-14696551 ] Onur Karaman commented on KAFKA-2397: - Sooo... ship it?! > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696351#comment-14696351 ] Jason Gustafson commented on KAFKA-2397: [~ewencp] The only case I can come up with is an application timeout on the client (e.g. if the heartbeat was delayed by a transient network issue), and that can be fixed by always ensuring that the timeout for coordinator requests is longer than the session timeout. My unease mostly has to do with proxy/tunnel situations where I don't know that TCP always behaves properly. Perhaps you know enough to know whether this is an issue? In any case, it seems like we all agree that we need a way to leave the group properly. My preference is probably for Onur's patch as it stands now. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696334#comment-14696334 ] Jiangjie Qin commented on KAFKA-2397: - [~ewencp] Are you saying that the coordinator should kick the consumer out of group once its TCP connection is closed? I think the problem here is this breaks the layers we have on broker side. So the TCP connections are only maintained by SocketServer and not exposed to KafkaApiThreads. So SocketServer does not know about which consumer a particular connection is associated with. If you want to let Coordinator knows about TCP connection closure, the coordinator needs to keep a consumer-socket map and SocketServer needs to produce some event back to the request queue to notify a disconnection and coordinator needs to check if that socket is associated with some consumer or not. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696327#comment-14696327 ] Ewen Cheslack-Postava commented on KAFKA-2397: -- Can we be more concrete about what we think the odd side effects would be if it were tied to the TCP session? What would happen that would cause the TCP connection to actually close rather than waiting around for a long time for the normal TCP timeout? I'm struggling to come up with a scenario that would actually kill the connection and I wouldn't want to kick the member out of the group. I'm running into annoying issues since we don't have any mechanism to leave the group currently. Initially it was just manifesting in my manual testing with Copycat where if I restarted a sink task, it would take awhile for it to start processing data because it had to wait for the process that I had just killed to be kicked out of the group. This is a bit annoying with the default session timeout of 30s, but workable. However, I'm also working on system tests and now it's making me set quite large timeouts for some steps (which otherwise should be more like < 1s to complete) and therefore makes the tests run a lot slower. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660279#comment-14660279 ] Jay Kreps commented on KAFKA-2397: -- [~guozhang] Basically "session" would be the concept that ties a connection to the business logic layer. The session would be exposed with the request. I haven't thought through how this would work, but maybe in handleProduce you could do session.addOnClose("clear-produce-purgatory", => removeRequestFromPurgatory(id)) and in the purgatory when the request is completed you'd do session.removeOnClose("clear-produce-purgatory") A similar mechanism would work based on the join-group request to rebalance the group on connection close. Like I said, I didn't think this through and don't really advocate it. Like Jason I think there could be odd side effects. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660269#comment-14660269 ] Jason Gustafson commented on KAFKA-2397: [~jkreps] Yeah, TCP is pretty resilient to network weirdness. I was mostly thinking client-side timeouts which may end up exposed in configuration. The only thing the client can do if a request times out is disconnect and try again. Perhaps we'd want to keep any timeouts with the coordinator out of configuration if we tried this approach. I was also wondering if there were some tunneling situations which could make the connection unstable. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659525#comment-14659525 ] Guozhang Wang commented on KAFKA-2397: -- [~jkreps] I need to look into the session implementation at socket server, but just to be more concrete are you suggesting adding the session-id in all handleXXXRequest in KafkaApis, and add another handleSessionTimeout in to KafkaApis as well? > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659470#comment-14659470 ] Jay Kreps commented on KAFKA-2397: -- [~guozhang] I haven't thought this through but here is the basic idea. For the purgatory case I think when you added something to the purgatory you would also add a "shutdown action" to the session that would delete the item on session termination. The session concept is what ties the KafkaApi layer to the network layer so this could be added in handleProduce(). [~hachikuji] Makes sense. Theoretically TCP should handle it, but yeah anything which broke the tcp connections would issue a rebalance storm. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659275#comment-14659275 ] Jason Gustafson commented on KAFKA-2397: [~jkreps] I think the disconnect approach could be interesting if it was tractable in the code, but I'm a little concerned that it would lead to spurious rebalancing due to ephemeral network events. This might not be a big deal when the consumers are in the same data center as the Kafka cluster, but it could be a bigger problem if they have to cross the Internet. I wonder if you could even get into some bad situations where network instability leads to constant rebalancing as consumers leave and immediately join repeatedly. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659185#comment-14659185 ] Gwen Shapira commented on KAFKA-2397: - No, SocketServer may be aware of host:port of client, but not clientID > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659158#comment-14659158 ] Guozhang Wang commented on KAFKA-2397: -- I once considered letting KafkaApis to handle connection closure when I was working on purgatory re-design, to purge the requests as mentioned by Jar. The difficulties are that at the socket server the connection is not logically tied to a client (although in fact it is), while for handling a requests / client failure events we need to pass-in a client-id into the API layer. A lot has changed in SocketServer since then so I do not know if things have changed so that we can infer the client-id (or more specifically consumer-id in this case) from SocketServer. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652686#comment-14652686 ] Jay Kreps commented on KAFKA-2397: -- Nice summary [~onurkaraman]. I agree that adding a field to heartbeat is functionally equivalent to a leave_group request/resp. The reason for preferring that was just to reduce the conceptual weight of the protocol. A second idea that I'm not sure is good: rather than having either a new request or a heartbeat it would be possible to use the TCP connection closure for this. The advantage would be ANY process death that didn't also kill the OS would then be detectable without any client participation needed. The downside is that (1) the server change would be slightly more involved, and (2) you wouldn't be able to close the connection for other reasons. The complexity of implementation is that currently only the network layer knows about socket closes. However we were already introducing a session concept for the security work which allows the KakaApi layer to have access to cross-request state such as the authenticated identity. We could make it possible to add shutdown actions to the session that would make it possible to trigger this; or alternately we could add a way to add onSocketClose actions directly to the network layer. This same feature would actually be useful for the purgatory. Currently when a connection is closed, I don't think that requests in purgatory are removed. If the purgatory timeout is very small this is okay, but a very common thing for people to ask for NO timeout in which case each connection close potentially leaks memory. I think we kind of "fixed" this by just overriding the max wait time but purging purgatory on shutdown is obviously preferable. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652386#comment-14652386 ] Onur Karaman commented on KAFKA-2397: - Hey everyone. There's a difference between the best, expected, and worst case rebalance time. Trunk - A consumer leaves at t = 0 and the coordinator detects the failure at t = s. The rebalance window can close as soon as all the existing consumers rejoin and as late as the maximum member session timeout. The time to stabilize since the consumer failure is something like: {code} t = s + rebalance_timeout {code} Best case: The coordinator receives all of the remaining consumers' heartbeats immediately after t = s. All of the remaining consumers rejoin immediately after receiving the heartbeat response. So everything is done by *t ~= s*. Expected case: The coordinator receives all of the remaining heartbeats at t = 4s/3 because consumers will typically figure out the rebalance after s/3 (an oversimplification. Consumers of a group actually have staggered heartbeat intervals). All of the remaining consumers eventually rejoin (coordinator_join_group_request_receival_delay). So everything is done by *t ~= s + (s/3 + coordinator_join_group_request_receival_delay)*. Worst case: All of the consumers in the group somehow fail to get notified of the rebalance until very last possible moment and rejoin the group just before the rebalance window ends: *t = s + s*. LeaveGroupRequest - A consumer leaves at t = 0 and sends out the LeaveGroupRequest. The rebalance window can close as soon as all the existing consumers rejoin and as late as the maximum member session timeout. The LeaveGroupRequest would cut down the time to stabilize since the consumer failure to something like: {code} t = coordinator_leave_group_request_receival_delay + rebalance_timeout {code} Best case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and the coordinator immediately receives the LeaveGroupRequest. The coordinator receives all of the remaining consumers' heartbeats immediately after t = 0. All of the remaining consumers rejoin immediately after receiving the heartbeat response. So everything is done by *t ~= 0*. Expected case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and the coordinator receives the LeaveGroupRequest at t = coordinator_leave_group_request_receival_delay. All of the remaining consumers eventually rejoin (coordinator_join_group_request_receival_delay). So everything is done by *t ~= coordinator_leave_group_request_receival_delay + (s/3 + coordinator_join_group_request_receival_delay)*. I'm assuming coordinator_leave_group_request_receival_delay << s. Worst case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and the coordinator receives the LeaveGroupRequest at t = coordinator_leave_group_request_receival_delay. All of the consumers in the group somehow fail to get notified of the rebalance until very last possible moment and rejoin the group just before the rebalance window ends: *t = coordinator_leave_group_request_receival_delay + s*. I'm assuming coordinator_leave_group_request_receival_delay << s. Absolute worst case: The LeaveGroupRequest somehow got dropped before reaching the coordinator. The heartbeat would timeout on the coordinator anyway and hit the existing *t = s + s* behavior. Summary - So I guess the absolute worst case behavior hasn't changed if the LeaveGroupRequest was somehow dropped, but everything else should get better by about s. P.S: To avoid confusion, it's probably best to state whether you're talking about the behavior in trunk or the proposed behavior with LeaveGroupRequest. I prefer having a separate LeaveGroupRequest, but that's less of the focus here. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652231#comment-14652231 ] Jiangjie Qin commented on KAFKA-2397: - I would prefer extending heartbeat to indicate leaving group. And there will always a be a delay for up to 1/3 of session timeout for the rebalance to be triggered on all the consumers in the group given the broker always trigger rebalance on heartbeat response. That is probably fine. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652198#comment-14652198 ] Jason Gustafson commented on KAFKA-2397: [~onurkaraman] I like this idea. Wouldn't the expected rebalance time actually be just the heartbeat interval since that's how long it would take the other group members to see the need to rebalance and send the new join group request? I think [~jkreps] was also suggesting to implement this as an "un-heartbeat" (i.e. with a flag on the heartbeat request), but I'm not sure if there was a strong reason to prefer that over the explicit request. > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2397) leave group request
[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650693#comment-14650693 ] ASF GitHub Bot commented on KAFKA-2397: --- GitHub user onurkaraman opened a pull request: https://github.com/apache/kafka/pull/103 KAFKA-2397: leave group request You can merge this pull request into a Git repository by running: $ git pull https://github.com/onurkaraman/kafka leave-group Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/103.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #103 commit 24d7c931f17f34211e3cac69a678ae0d3980396a Author: Onur Karaman Date: 2015-07-31T08:52:44Z leave group request > leave group request > --- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Onur Karaman >Assignee: Onur Karaman >Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)